Politecnico di Torino | Servizi per la didattica

KEYWORD

Area Ingegneria

Simplifying Transformer Architecture for Efficient Large Language Model Inference

Parole chiave ATTENTION MECHANISM, COMPRESSION, KV CACHE, LARGE LANGUAGE MODEL, OPTIMIZATION, TRANSFORMER

Riferimenti ALESSIO BURRELLO, DANIELE JAHIER PAGLIARI

Riferimenti esterni Luca Benfenati

Gruppi di ricerca DAUIN - GR-06 - ELECTRONIC DESIGN AUTOMATION - EDA, ELECTRONIC DESIGN AUTOMATION - EDA, GR-06 - ELECTRONIC DESIGN AUTOMATION - EDA

Tipo tesi EXPERIMENTAL, RESEARCH, SOFTWARE DEVELOPMENT

Descrizione Large Language Models (LLMs), such as GPT, BERT, and Llama, have revolutionized natural language processing by enabling human-like text understanding and generation. While much attention has focused on optimizing LLM training, which is a one-time effort, inference optimization is arguably even more critical, as it runs every time the model is executed.

The transformer architecture in LLMs relies on multiple layers of attention mechanisms to capture text dependencies. This includes attention heads and layer-wise computations, often with a Key-Value (KV) cache to minimize redundant operations. However, this complexity is not always essential for every task, and as sequence length and model depth increase, so do inference demands.

This thesis proposes simplifying the transformer architecture by removing components that may not contribute significantly to specific tasks. Key strategies include:
• Selective Attention Head Pruning: pruning redundant heads that add minimal value to reduce computational load without major accuracy loss.
• Layer Reduction: streamlining the model by removing less relevant layers to cut down memory and processing requirements.
• Shared KV Cache Across Layers: reducing redundancy by sharing the KV cache across layers, particularly where cross-layer dependencies are less critical.

The primary objectives of this thesis are:
1. To analyze the transformer architecture and identify which components contribute the least to performance for specific tasks.
2. To develop and evaluate methods for attention head pruning, layer reduction, and shared KV cache across layers.
3. To benchmark these strategies for inference speed, memory efficiency, and impact on task accuracy.

Conoscenze richieste Proficiency in Python is required. Familiarity with Deep Learning, PyTorch and HuggingFace libraries is a plus.

Scadenza validita proposta 31/10/2025 PROPONI LA TUA CANDIDATURA