Politecnico di Torino | Servizi per la didattica

KEYWORD

Area Engineering

Simplifying Transformer Architecture for Efficient Large Language Model Inference

keywords ATTENTION MECHANISM, COMPRESSION, KV CACHE, LARGE LANGUAGE MODEL, OPTIMIZATION, TRANSFORMER

Reference persons ALESSIO BURRELLO, DANIELE JAHIER PAGLIARI

External reference persons Luca Benfenati

Research Groups DAUIN - GR-06 - ELECTRONIC DESIGN AUTOMATION - EDA, ELECTRONIC DESIGN AUTOMATION - EDA, GR-06 - ELECTRONIC DESIGN AUTOMATION - EDA

Thesis type EXPERIMENTAL, RESEARCH, SOFTWARE DEVELOPMENT

Description Large Language Models (LLMs), such as GPT, BERT, and Llama, have revolutionized natural language processing by enabling human-like text understanding and generation. While much attention has focused on optimizing LLM training, which is a one-time effort, inference optimization is arguably even more critical, as it runs every time the model is executed.

The transformer architecture in LLMs relies on multiple layers of attention mechanisms to capture text dependencies. This includes attention heads and layer-wise computations, often with a Key-Value (KV) cache to minimize redundant operations. However, this complexity is not always essential for every task, and as sequence length and model depth increase, so do inference demands.

This thesis proposes simplifying the transformer architecture by removing components that may not contribute significantly to specific tasks. Key strategies include:
• Selective Attention Head Pruning: pruning redundant heads that add minimal value to reduce computational load without major accuracy loss.
• Layer Reduction: streamlining the model by removing less relevant layers to cut down memory and processing requirements.
• Shared KV Cache Across Layers: reducing redundancy by sharing the KV cache across layers, particularly where cross-layer dependencies are less critical.

The primary objectives of this thesis are:
1. To analyze the transformer architecture and identify which components contribute the least to performance for specific tasks.
2. To develop and evaluate methods for attention head pruning, layer reduction, and shared KV cache across layers.
3. To benchmark these strategies for inference speed, memory efficiency, and impact on task accuracy.

Required skills Proficiency in Python is required. Familiarity with Deep Learning, PyTorch and HuggingFace libraries is a plus.

Deadline 31/10/2025 PROPONI LA TUA CANDIDATURA