Servizi per la didattica
PORTALE DELLA DIDATTICA

Explaining failures in HPC and Data Centers

 

PAROLE CHIAVE HPC, RELIABILITY, DATA CENTERS, CLOUD, FAULTS

RIFERIMENTI JOSIE ESTEBAN RODRIGUEZ CONDIA, MATTEO SONZA REORDA

DESCRIZIONE High Performance Computing (HPC) and big Data Centers for Cloud applications are experiencing growing problems due to the occurrence of Silent Data Corruptions (SDCs), i.e., failures in the produced results due to faults affecting the hardware. Recent papers by Google, Meta and Alibaba discussed the size and importance of this issue. Currently, there is a lack of evidence regarding the magnitude of the issue, specifically the likelihood of SDC occurrences, as well as the underlying causes contributing to SDCs. This thesis deals with the construction of a software environment able to extract meaningful data from GPU devices used in the HPC system (e.g., characterizing the number of active modules, the behaviour of the caches, the degree of activity), so that machine learning algorithms or deep neural networks can be used to discover anomaly patterns. This thesis will also aim at validating the environment on a real HPC cluster of GPUs suitably stressed in order to maximise the probability of SDC occurrence.

CONOSCENZE RICHIESTE Basic knowledge about GPU architecture and programming


SCADENZA VALIDITA PROPOSTA 12/06/2025