Bachelor Theses | University of Tübingen

Bachelor Theses at the Chair of Cognitive Systems (Prof. Dr. Andreas Zell)

Students who want to take a bachelor thesis should have attended at least one lecture of Prof. Zell and passed it with good or at least satisfactory grades. They might also have obtained the relevant background knowledge for the thesis from other, similar lectures.

Open Topics

In-depth Analysis and Optimization of LLM inference on the Edge

Mentor: Dominik Hildebrand

Email: Dominik.Hildebrandspam prevention@uni-tuebingen.de

Large Language Models (LLMs) such as “ChatGPT” can quickly turn huge amounts of text into clear and helpful responses such as when you need to draft an email, translate a paragraph, or provide a quick summary. Thus, they are becoming a larger and larger part of our everyday lives by making everyday tasks faster and easier.

However, LLMs - as their name suggests - are indeed large with parameter counts ranging from 1 Billion (B) over 56B all the way up to 671B. As such, running inference with these models is expensive. For instance, the 671B model (called “DeepSeek R1”) requires (without optimization, lower bound) ~700 GB of (GPU) memory which needs 8 H100 just for loading it (market price as of March 2026: ~40,000€ / unit). Thus, these models are usually deployed using cloud-based solutions where your query is sent to and processed by a server-cluster.

This means a number of issues for the user such as potentially high latency, no way to query it offline and privacy concerns of both your and other's data. For instance, using ChatGPT to summarize your chat messages means you are giving away not just your data but also that of the other participants.

To address this, model compression is an active area of research that aims to lower resource requirements of models by “shrinking” them. Ideally, this allows running those models locally and even in resource-constrained settings (on “edge devices” like a smartphone). However, the effectiveness of such methods should be verified empirically by doing actual deployment on edge devices. A previous thesis has established a framework for deploying LLMs on the Nvidia Orin AGX Development Kit. This thesis dives deeper into both hard- and software to establish what is going on “behind the scenes”.

Specifically, the student should

Set up a profiling software (e.g. Nvidia Nsights) on the edge device
Use it to profile the framework established in the previous thesis
Based on the results, identify and - ideally - fix potential bottlenecks

Necessary Background:

You can work independently
You can follow basic instructions such as those found under “Contact Details”

Recommended Background:

Has used a package manager like Anaconda before

Ideal Background:

Some experience using transformers, llama.cpp or MLC-LLM
Knows what CUDA is
Basic understanding of the transformer architecture (i.e. attention mechanism, auto-regressive decoding, kv-cache, …)

Contact Details:

Please contact me only via e-mail
Attach your Transcript of Records (feel free to hide your grades, I only want to see what lectures you have heard)
I try to get back to you within a week. If I don't, please contact me again (ideally just resend your original mail). If you don't, I'll assume you are no longer interested.

Implementation of a Grasping Pipeline on a Novel Robotic Arm

Mentor: David Ott
E-Mail: david.ottspam prevention@uni-tuebingen.de

The goal of this thesis is to enable autonomous robotic grasping on a novel robotic platform, the Neura Robotics MAiRA-7M. To achieve this, the robot manufacturer's control interfaces (NeuraPy or the low-level servo interface) must first be integrated into an existing robotic control framework. This integration includes implementing interfaces for acquiring sensory data, such as camera images and proprioception, as well as interfaces for sending motion control commands to the robot.

Once these components are in place, existing functionality within the framework can be leveraged to perform semantic grasping tasks. Semantic grasping refers to the ability to specify target objects using natural language and automatically execute the corresponding grasp. The existing framework already incorporates a Vision-Language Model (VLM) for object selection based on natural-language instructions, as well as a grasp-point detection module that generates feasible grasp poses for execution on the physical robot.

Depending on the student's interests, the work could go into several further directions. Further efforts could be put into evaluating different possible control schemes for their efficacy in controlling the robot arm, and how well they integrate with the different control systems. Alternatively one could work on implementing a simulation backend for the MAiRA-7M within the framework, or generating training data to train a Vision-Language-Action (VLA) policy for this novel robotic embodiment.

Necessary Prerequisites: Python, Foundations of Robotics
Optional: C++

Multi-modal Robot Manipulation combining Gaze and Speech

Mentor: Yuzhi Lai

Email: yuzhi.laispam prevention@uni-tuebingen.de

Effective Human-Robot Interaction (HRI) is crucial for enhancing accessibility and usability in real-world robotics applications. However, existing solutions often rely on gestures or language commands, making interaction inefficient and ambiguous, particularly for users with physical impairments. Gaze, as a natural interaction modality, has great potential in HRI for individuals with severe physical limitations. By integrating gaze with natural language understanding, robot arm can infer the user's intention, reducing ambiguity and enhancing interaction efficiency. This multi-modal approach is particularly beneficial for individuals with severe motor impairments, as it enables hands-free, natural communication without requiring complex gestures or precise verbal articulation.

The student should begin by reading previous related work, including our own publication and the official Aria Research Kit tutorials, to become familiar with the system architecture and capabilities of the glasses. They should then study literature on multimodal human-robot interaction, robotic grasp detection, and large language model prompt engineering. The core tasks include improving the alignment between language and gaze for more accurate intent recognition and integrating a state-of-the-art grasp detection algorithm, such as AnyGrasp, into the robotic system to enable generalizable object manipulation. A critical part of the project is replacing the prior cloud-based LLM module with a local LLaMA model for privacy-preserving inference. The student is expected to implement at least one existing algorithm for grasp point prediction and propose two prompt for gaze-language fusion and robot action generation. The project will conclude with real world experiments to evaluate system performance and usability.

Muilt-modal Robot Manipulation, A Comparion between Large Language Model and Traditional Natural Lagauage Process

Mentor: Yuzhi Lai

Email: yuzhi.laispam prevention@uni-tuebingen.de

This project builds on our existing gaze-language fusion framework to develop an efficient and lightweight multi-modal interaction system for robot manipulation. Unlike previous methods that rely on large-scale language models (LLMs) for command interpretation, this project will focus on traditional NLP approaches that offer faster response times, lower memory consumption, and better suitability for edge-device deployment.

To evaluate the effectiveness of this approach, we will conduct a detailed comparison between our traditional NLP-based system and an LLM-based system across key performance metrics such as processing speed, accuracy in interpreting commands, and overall interaction efficiency. This analysis will help determine whether lightweight NLP models can match or exceed the performance of large language models while significantly improving real-time responsiveness and computational efficiency.

Web-based Leaderboard and Comparative Analysis Tool

Mentor: Rafia Rahim

Email: rafia.rahimspam prevention@uni-tuebingen.de

Implement a leaderboard-style web platform that lists stereo matching methods sorted by runtime and accuracy. Each entry would present detailed runtime statistics, inference speed (FPS), and accuracy metrics on a uniform hardware environment. Users can select two or more methods for side-by-side comparisons, revealing detailed runtime breakdowns through expandable panels, graphs, and visualizations. Include intuitive search/filter functionality to narrow down methods based on criteria (runtime range, accuracy thresholds, GPU/CPU usage).

Requirements: good programming skills, deep learning knowledge.

Performance and Efficiency of Hybrid Stereo Depth Models: A Comparative Analysis with Quantization

Mentor: Rafia Rahim

Email: rafia.rahimspam prevention@uni-tuebingen.de

This thesis evaluates modern stereo depth estimation models (e.g., MonSter, StereoAnyWhere, FoundationStereo, DEFOM Stereo) that leverage monocular or foundation model priors. It presents a comparative analysis of their zero-shot performance and robustness, alongside an investigation into the accuracy-efficiency trade-offs introduced by model quantization.

Requirements: good programming skills, deep learning knowledge.

Real-Time Person Detection and Tracking Using Event Cameras

Mentor: David Joseph

Email: david.joseph@uni-tuebingen.de

In this thesis, the student will develop a system for real-time person detection and tracking based on event camera data. The task includes adapting or training a deep learning model (e.g., YOLO-based approaches) to detect people from event streams and implementing a tracking pipeline. A key component of the project is ensuring that the system runs in real time.

Additionally, the student will design and implement a visually appealing display interface that presents the detection and tracking results. The final system should serve as an interactive and demonstrative showcase.

Requirements: good programming skills, basic knowledge of machine learning and computer vision, (optimally) experience with deep learning frameworks.