Mentor: Dominik Hildebrand
Email: Dominik.Hildebrandspam prevention@uni-tuebingen.de
Large Language Models (LLMs) such as “ChatGPT” can quickly turn huge amounts of text into clear and helpful responses such as when you need to draft an email, translate a paragraph, or provide a quick summary. Thus, they are becoming a larger and larger part of our everyday lives by making everyday tasks faster and easier.
To compare LLMs in terms of ability, they are often evaluated using standardized benchmarks such as “ARC”, “HellaSwag”, or “MMLU" as well as benchmark compilations like ”HELM". However, LLMs - as their name suggests - are indeed large with parameter counts ranging from 1 Billion (B) over 56B all the way up to 671B.
As such, running inference with these models is expensive. For instance, the 671B model (called “DeepSeek R1”) requires (without optimization, lower bound) ~1.3 TB of (GPU) memory which needs 16 H100 just for loading it (market price as of April 2025: ~30,000€ / unit). And comprehensive benchmarking requires extensive amounts of inference…
To address this, model compression is an active area of research which aims to lower resource requirements of models by “shrinking” them. Using such methods (mainly a subset called ‘quantization’), Unsloth shrank the R1-model enough to fit it onto a single consumer grade GPU (RTX 4090). However, while this potentially addresses hardware concerns, it often comes with the trade-off of (much) slower inference speed and thus, longer benchmarking times.
Further, compression methods can feature a significant amount of hyperparameters which necessitates a grid-search. Ideally, there should be a compilation of benchmark subsets that is small but allows to accuratly estimate the LLM's performance on the full benchmark.
Creating such subsets is the goal of this thesis.
Specifically, the student should
- Gain an overview of existing benchmark used to compare popular LLMs (Llama, Qwen, DeepSeek, …)
- Do a literature review of methods creating representative subsets of such benchmarks (Starting points can be i.e. “tinyBenchmarks”, Reliable and Efficient Amortized Model-based Evaluation)
- Compile representative subsets (validation and test) of (1.) using the most promising method(s) found in (2.) that are as small as possible
Necessary Background:
- You know what PyTorch is
- You can work independently
- You can follow basic instructions such as those found under “Contact Details”
Recommended Background:
- Experience with at least one LLM inference backend (i.e. Transformers, vLLM, …)
- Familiar with cluster-based computing (i.e. SLURM)
- Basic understanding of the transformer architecture (i.e. attention mechanism, auto-regressive decoding, kv-cache, …)
- Solid grasp on statistics / data processing
Contact Details:
- Please contact me only via e-mail
- Attach your Transcript of Records (feel free to hide your grades, I only want to see what lectures you have heard)
- I try to get back to you within a week. If I don't, please contact me again (ideally just resend your original mail). If you don't, I'll assume you are no longer interested.