KEYWORD |
Area Engineering
Explaining failures in HPC and Data Centers
keywords HPC, RELIABILITY, DATA CENTERS, CLOUD, FAULTS
Reference persons JOSIE ESTEBAN RODRIGUEZ CONDIA, MATTEO SONZA REORDA
Research Groups DAUIN - GR-05 - ELECTRONIC CAD & RELIABILITY GROUP - CAD
Description High Performance Computing (HPC) and big Data Centers for Cloud applications are experiencing growing problems due to the occurrence of Silent Data Corruptions (SDCs), i.e., failures in the produced results due to faults affecting the hardware. Recent papers by Google, Meta and Alibaba discussed the size and importance of this issue. Currently, there is a lack of evidence regarding the magnitude of the issue, specifically the likelihood of SDC occurrences, as well as the underlying causes contributing to SDCs. This thesis deals with the construction of a software environment able to extract meaningful data from GPU devices used in the HPC system (e.g., characterizing the number of active modules, the behaviour of the caches, the degree of activity), so that machine learning algorithms or deep neural networks can be used to discover anomaly patterns. This thesis will also aim at validating the environment on a real HPC cluster of GPUs suitably stressed in order to maximise the probability of SDC occurrence.
Required skills Basic knowledge about GPU architecture and programming
Deadline 12/06/2025
PROPONI LA TUA CANDIDATURA