Mentor: Dominik Hildebrand
Email: Dominik.Hildebrandspam prevention@uni-tuebingen.de
Large Language Models (LLMs) such as “ChatGPT” can quickly turn huge amounts of text into clear and helpful responses such as when you need to draft an email, translate a paragraph, or provide a quick summary. Thus, they are becoming a larger and larger part of our everyday lives by making everyday tasks faster and easier.
However, LLMs - as their name suggests - are indeed large with parameter counts ranging from 1 Billion (B) over 56B all the way up to 671B. As such, running inference with these models is expensive. For instance, the 671B model (called “DeepSeek R1”) requires (without optimization, lower bound) ~1.3 TB of (GPU) memory which needs 16 H100 just for loading it (market price as of April 2025: ~30,000€ / unit). Thus, these models are usually ran using cloud-based solutions where your query is sent to and processed by a server-cluster.
This means a number of issues for the user such as potentially high latency, no way to query it offline and privacy concerns of both your and other's data. For instance, using ChatGPT to summarize your chat messages means you are giving away not just your data but also that of the other participants.
To address this, model compression is an active area of research which aims to lower resource requirements of models by “shrinking” them. Ideally, this allows running those models locally and even in resource constraint settings (on so called “edge devices” like a smartphone). However, the effectiveness of such methods should be verified empirically by doing actual deployment on edge devices.
The goal of this thesis is to facilitate the deployment of various LLMs on an edge device, namely the Nvidia Orin AGX Development Kit.
Specifically, the student should
- Setup a working environment on the edge device
- Use said environment to run a selection of LLMs (i.e. Llama-3.2-1B, Llama-3.2-3B, Mistral-7B, …)
- Benchmark inference speed
- (Optional:) Apply various compression techniques to shrink the models deployed in (2.)
Necessary Background:
- You can work independently
- You can follow basic instructions such as those found under “Contact Details”
Recommended Background:
- Has used a package manager like Anaconda before
Ideal Background:
- Some experience using the transformers library
- Knows what CUDA is
- Basic understanding of the transformer architecture (i.e. attention mechanism, auto-regressive decoding, kv-cache, …)
Contact Details:
- Please contact me only via e-mail
- Attach your Transcript of Records (feel free to hide your grades, I only want to see what lectures you have heard)
- I try to get back to you within a week. If I don't, please contact me again (ideally just resend your original mail). If you don't, I'll assume you are no longer interested.