PORTALE DELLA DIDATTICA

Ricerca CERCA
  KEYWORD

Optimizing Large Language Models Inference through KV Cache Compression

keywords ATTENTION MECHANISM, COMPRESSION, KV CACHE, LARGE LANGUAGE MODEL, OPTIMIZATION, TRANSFORMER

Reference persons ALESSIO BURRELLO, DANIELE JAHIER PAGLIARI

External reference persons Luca Benfenati

Research Groups ELECTRONIC DESIGN AUTOMATION - EDA

Thesis type EXPERIMENTAL

Description Large Language Models (LLMs), such as GPT, BERT, and Llama, have revolutionized natural language processing by enabling human-like text understanding and generation. While much attention has focused on optimizing LLM training, which is a one-time effort, inference optimization is arguably even more critical, as it runs every time the model is executed.

LLMs use a Key-Value (KV) cache to store attention layer information, accelerating token generation by retaining key and value tensors and reducing redundant computations. However, as input length grows, so does the KV cache, creating significant memory demands that challenge deployment on resource-limited devices.

This thesis proposes a compression strategy for the KV cache to reduce its memory demands without sacrificing inference efficiency. Two methods are of particular interest:
• Hyper-Dimensional Computing (HDC): inspired by brain-like computation, HDC encodes information into high-dimensional, redundant vectors, enabling compact representations of key-value tensors. This method can potentially reduce cache size significantly while preserving essential information.
• Principal Component Analysis (PCA): by transforming key and value tensors into a lower-dimensional space, PCA retains the most informative components, cutting down the KV cache size and optimizing memory usage without major inference accuracy loss.

The primary objectives are:
1. To evaluate the memory demands of the KV cache and the limits it imposes on device-level inference.
2. To develop and test KV cache compression methods using HDC and PCA.
3. To benchmark these methods for memory efficiency, inference latency, and impact on model accuracy.

Required skills Proficiency in Python is required. Familiarity with Deep Learning, PyTorch and HuggingFace libraries is a plus.


Deadline 31/10/2025      PROPONI LA TUA CANDIDATURA