KEYWORD |
Optimizing Large Language Models Inference through KV Cache Compression
Parole chiave ATTENTION MECHANISM, COMPRESSION, KV CACHE, LARGE LANGUAGE MODEL, OPTIMIZATION, TRANSFORMER
Riferimenti ALESSIO BURRELLO, DANIELE JAHIER PAGLIARI
Riferimenti esterni Luca Benfenati
Gruppi di ricerca DAUIN - GR-06 - ELECTRONIC DESIGN AUTOMATION - EDA, ELECTRONIC DESIGN AUTOMATION - EDA, GR-06 - ELECTRONIC DESIGN AUTOMATION - EDA
Tipo tesi EXPERIMENTAL, RESEARCH, SOFTWARE DEVELOPMENT
Descrizione Large Language Models (LLMs), such as GPT, BERT, and Llama, have revolutionized natural language processing by enabling human-like text understanding and generation. While much attention has focused on optimizing LLM training, which is a one-time effort, inference optimization is arguably even more critical, as it runs every time the model is executed.
LLMs use a Key-Value (KV) cache to store attention layer information, accelerating token generation by retaining key and value tensors and reducing redundant computations. However, as input length grows, so does the KV cache, creating significant memory demands that challenge deployment on resource-limited devices.
This thesis proposes a compression strategy for the KV cache to reduce its memory demands without sacrificing inference efficiency. Two methods are of particular interest:
• Hyper-Dimensional Computing (HDC): inspired by brain-like computation, HDC encodes information into high-dimensional, redundant vectors, enabling compact representations of key-value tensors. This method can potentially reduce cache size significantly while preserving essential information.
• Principal Component Analysis (PCA): by transforming key and value tensors into a lower-dimensional space, PCA retains the most informative components, cutting down the KV cache size and optimizing memory usage without major inference accuracy loss.
The primary objectives are:
1. To evaluate the memory demands of the KV cache and the limits it imposes on device-level inference.
2. To develop and test KV cache compression methods using HDC and PCA.
3. To benchmark these methods for memory efficiency, inference latency, and impact on model accuracy.
Conoscenze richieste Proficiency in Python is required. Familiarity with Deep Learning, PyTorch and HuggingFace libraries is a plus.
Scadenza validita proposta 31/10/2025
PROPONI LA TUA CANDIDATURA