Politecnico di Torino | Servizi per la didattica

KEYWORD

Explaining failures in HPC and Data Centers

Parole chiave HPC, RELIABILITY, DATA CENTERS, CLOUD, FAULTS

Riferimenti JOSIE ESTEBAN RODRIGUEZ CONDIA, MATTEO SONZA REORDA

Gruppi di ricerca DAUIN - GR-05 - ELECTRONIC CAD & RELIABILITY GROUP - CAD

Descrizione High Performance Computing (HPC) and big Data Centers for Cloud applications are experiencing growing problems due to the occurrence of Silent Data Corruptions (SDCs), i.e., failures in the produced results due to faults affecting the hardware. Recent papers by Google, Meta and Alibaba discussed the size and importance of this issue. Currently, there is a lack of evidence regarding the magnitude of the issue, specifically the likelihood of SDC occurrences, as well as the underlying causes contributing to SDCs. This thesis deals with the construction of a software environment able to extract meaningful data from GPU devices used in the HPC system (e.g., characterizing the number of active modules, the behaviour of the caches, the degree of activity), so that machine learning algorithms or deep neural networks can be used to discover anomaly patterns. This thesis will also aim at validating the environment on a real HPC cluster of GPUs suitably stressed in order to maximise the probability of SDC occurrence.

Conoscenze richieste Basic knowledge about GPU architecture and programming

Scadenza validita proposta 12/06/2025 PROPONI LA TUA CANDIDATURA