Master Theses at the Chair of Cognitive Systems (Prof. Dr. Andreas Zell)

Students who want to take a master thesis should have attended at least one lecture of Prof. Zell and passed it with good or at least satisfactory grades. They might also have obtained the relevant background knowledge for the thesis from other, similar lectures.

Eye-Tracking-Based Interaction System for Robot Manipulation

Mentor: Yuzhi Lai

Email: yuzhi.laispam prevention@uni-tuebingen.de

Traditional interaction methods such as gesture and voice commands present significant challenges. Gesture-based interactions require users to perform precise movements within the field of view of a camera, restricting their mobility and range of interaction. On the other hand, speech-based commands can be ambiguous, as verbal descriptions often lack the spatial specificity needed for accurate robotic response. In contrast, gaze-based interaction offers a silent, unobtrusive, and intuitive alternative that enables hands-free control. By leveraging AI-driven eye-tracking technology, we aim to develop a gaze-based control system that enhances accessibility and interaction efficiency, providing a versatile solution for users in both general and specialized settings.


This project aims to develop a gaze-driven dialogue system for ARIA glasses that allows users to form and communicate simple sentences using only eye movements. Unlike existing assistive communication tools that rely on manual input or predefined selection grids or GUIs, this system will leverage real-time eye-tracking to provide an intuitive way for users to collaborate with robots. This project aims to provide a voice-and hands-free communication solution that significantly enhances the quality of life for individuals with severe motor and speech impairments.

 

 

Building and Training a Grasp-Point Detector

Mentor: David Ott
E-Mail: david.ottspam prevention@uni-tuebingen.de

The goal of this thesis is to implement and train a modern deep learning-based grasp-point detection model for robotic arms, inspired by the AnyGrasp architecture.

Grasp-point detection is a core problem in robotic manipulation, where the objective is to determine 6-DoF poses in 3D space that a robotic gripper can move to in order to successfully grasp an object. A common pipeline involves detecting these grasp-points from raw 3D point-cloud data and then using off-the-shelf motion planning and control frameworks to execute the grasp. A current state-of-the-art approach for this is AnyGrasp, but unfortunately a fully open-source training pipeline has not been released and it is based on outdated dependencies and frameworks.

This thesis will focus on re-implementing a grasp-point detection architecture similar to AnyGrasp in a modern PyTorch-based machine learning stack. The implementation should be modular, clean, and leverage current PyTorch best practices. The model will be trained and evaluated on public datasets relevant to grasp-point detection such as GraspNet or similar benchmarks. The model will then be integrated into a real world robotic grasping pipeline using a Franka Emika Panda robot arm.

Depending on progress and interest, the work may be extended to improve the performance of the system by introducing newer components (e.g., updated feature extractors, point-cloud encoders, or transformer-based modules), or by augmenting the dataset with more challenging objects and scenes.

Necessary Prerequisites: Solid understanding of PyTorch and deep learning training pipelines, strong Python programming skills
Useful but not required Skills: Experience with SLURM, Docker, distributed ML training, robotics, 3D coordinate transforms, point-cloud data processing, ROS / ROS 2

Building a Robust Point Cloud Scene Representation for Robot Arms

Mentor: David Ott
E-Mail: david.ottspam prevention@uni-tuebingen.de

The goal of this thesis is to investigate and evaluate techniques for building robust, persistent point cloud representations of the world that can serve as input to robot arm control systems. Such representations go beyond simple RGB(D) camera inputs by integrating observations over time and across viewpoints to create a more consistent and complete 3D understanding of the scene.

In many robotic manipulation scenarios, intermediate sensor representations are essential. Robot policies often require a reliable model of the world that handles occlusion, tracks object movement, and fuses multiple sensory inputs. Instead of using raw depth maps or RGBD frames, a robust point cloud can serve as a more expressive and stable input to downstream tasks such as grasp planning, motion execution, or object manipulation.

This thesis will explore and compare various techniques for generating such scene representations, focusing on fusing multiple point clouds obtained at different times and/or from different viewpoints into a single, coherent world-model. Particular attention will be paid to the following sub-problems:
- Point Cloud Registration: Aligning and fusing point clouds collected from different time steps or different camera viewpoints.
- Downsampling and Filtering: Efficiently reducing the size of point clouds while preserving relevant structural information.
- Handling Dynamic Objects: Updating the scene representation when objects are moved, and modeling uncertainty or “staleness” of unobserved areas.
- Sensor and Camera Calibration: Transforming point clouds into a common world frame, especially when using multiple RGBD sensors.

The resulting fused scene representations will be evaluated in terms of their robustness and usefulness for downstream robotic tasks. Depending on progress and interest, the student may implement simple evaluation tasks (e.g., pick-and-place) or integrate the system into a simulated or physical robot platform for testing.

Necessary Prerequisites: Python, Basic understanding of 3D geometry and linear algebra

Useful but not required Skills: C++, ROS / ROS2, experience with 3D coordinate transforms and point cloud processing, familiarity with robot arm platforms and RGB(D) sensors
 

Distilling Knowledge from Foundation Stereo Networks for Efficient and General-Purpose Depth Estimation

Mentor: Rafia Rahim

Email:  rafia.rahimspam prevention@uni-tuebingen.de

 

Recent advances in stereo vision have led to the emergence of "foundation" stereo networks—such as FoundationStereo, Stereo Anything, and MonSter—which demonstrate robust generalization and adaptability across diverse domains and scene types. However, these state-of-the-art models are typically very large, computationally intensive, and often impractical for deployment on edge devices or in latency-sensitive applications. This thesis investigates the use of knowledge distillation to compress these foundation models into lightweight, task-agnostic student networks, aiming to preserve their broad generalization and high performance while drastically reducing computational requirements. The research will explore multi-teacher distillation, where a student model learns from multiple foundation networks simultaneously, as well as specialized distillation losses tailored to stereo depth estimation (e.g., disparity-aware and confidence-aware losses). The thesis will benchmark distilled student models on a range of stereo datasets (KITTI, Middlebury, ETH3D, indoor/outdoor datasets) and assess performance not just on accuracy and runtime, but also on cross-domain robustness, thereby demonstrating the practical benefits of distilled foundation stereo models.

 

Requirements: good programming skills, deep learning knowledge.
 

Efficient Evaluation of Large Language Models

Mentor: Dominik Hildebrand

Email: Dominik.Hildebrandspam prevention@uni-tuebingen.de

Large Language Models (LLMs) such as “ChatGPT”  can quickly turn huge amounts of text into clear and helpful responses such as when you need to draft an email, translate a paragraph, or provide a quick summary. Thus, they are becoming a larger and larger part of our everyday lives by making everyday tasks faster and easier.

To compare LLMs in terms of ability, they are often evaluated using standardized benchmarks such as “ARC”, “HellaSwag”, or “MMLU" as well as benchmark compilations like ”HELM". However, LLMs - as their name suggests - are indeed large with parameter counts ranging from 1 Billion (B) over 56B all the way up to 671B.  

As such, running inference with these models is expensive. For instance, the 671B model (called “DeepSeek R1”) requires (without optimization, lower bound) ~700 GB of (GPU) memory which needs 8 H100 just for loading it (market price as of March 2026: ~40,000€ / unit). And comprehensive benchmarking requires extensive amounts of inference…

To address this, model compression is an active area of research which aims to lower resource requirements of models by “shrinking” them. Using such methods (mainly a subset called ‘quantization’), Unsloth shrank the R1-model enough to fit it onto a single consumer grade GPU (RTX 4090). However, while this potentially addresses hardware concerns, it often comes with the trade-off of (much) slower inference speed and thus, longer benchmarking times.

Further, compression methods can feature a significant amount of hyperparameters which necessitates a grid-search. Ideally, there should be a compilation of benchmark subsets that is small but allows to accuratly estimate the LLM's performance on the full benchmark. 

Creating such subsets is the goal of this thesis.

Specifically, the student should

  1. Gain an overview of existing benchmark used to compare popular LLMs (Llama, Qwen, DeepSeek, …)
  2. Do a literature review of methods creating representative subsets of such benchmarks (Starting points can be i.e. tinyBenchmarks, Reliable and Efficient Amortized Model-based Evaluation)
  3. Compile representative subsets (validation and test) of (1.) using the most promising method(s) found in (2.) that are as small as possible

Necessary Background:

  • You know what PyTorch is
  • You can work independently
  • You can follow basic instructions such as those found under “Contact Details” 

Recommended Background:

  • Experience with at least one LLM inference backend (i.e. Transformers, vLLM, …)
  • Familiar with cluster-based computing (i.e. SLURM)
  • Basic understanding of the transformer architecture (i.e. attention mechanism, auto-regressive decoding, kv-cache, …)
  • Solid grasp on statistics / data processing 

Contact Details:

  • Please contact me only via e-mail
  • Attach your Transcript of Records (feel free to hide your grades, I only want to see what lectures you have heard)
  • I try to get back to you within a week. If I don't, please contact me again (ideally just resend your original mail). If you don't, I'll assume you are no longer interested.

In-depth Analysis and Optimization of LLM inference on the Edge

Mentor: Dominik Hildebrand

Email: Dominik.Hildebrandspam prevention@uni-tuebingen.de

Large Language Models (LLMs) such as “ChatGPT”  can quickly turn huge amounts of text into clear and helpful responses such as when you need to draft an email, translate a paragraph, or provide a quick summary. Thus, they are becoming a larger and larger part of our everyday lives by making everyday tasks faster and easier.

However, LLMs - as their name suggests - are indeed large with parameter counts ranging from 1 Billion (B) over 56B all the way up to 671B.  As such, running inference with these models is expensive. For instance, the 671B model (called “DeepSeek R1”) requires (without optimization, lower bound) ~700 GB of (GPU) memory which needs 8 H100 just for loading it (market price as of March 2026: ~40,000€ / unit). Thus, these models are usually deployed using cloud-based solutions where your query is sent to and processed by a server-cluster.

This means a number of issues for the user such as potentially high latency, no way to query it offline and privacy concerns of both your and other's data. For instance, using ChatGPT to summarize your chat messages means you are giving away not just your data but also that of the other participants.

To address this, model compression is an active area of research that aims to lower resource requirements of models by “shrinking” them. Ideally, this allows running those models locally and even in resource-constrained settings (on  “edge devices” like a smartphone). However, the effectiveness of such methods should be verified empirically by doing actual deployment on edge devices. A previous thesis has established a framework for deploying LLMs on the Nvidia Orin AGX Development Kit. This thesis dives deeper into both hard- and software to establish what is going on “behind the scenes”. 

Specifically, the student should

  1. Set up a profiling software (e.g. Nvidia Nsights) on the edge device
  2. Use it to profile the framework established in the previous thesis
  3. Based on the results, identify and - ideally - fix potential bottlenecks

Necessary Background:

  • You can work independently
  • You can follow basic instructions such as those found under “Contact Details” 

Recommended Background:

  • Has used a package manager like Anaconda before

Ideal Background:

  • Some experience using transformers, llama.cpp or MLC-LLM
  • Knows what CUDA is
  • Basic understanding of the transformer architecture (i.e. attention mechanism, auto-regressive decoding, kv-cache, …)

Contact Details:

  • Please contact me only via e-mail
  • Attach your Transcript of Records (feel free to hide your grades, I only want to see what lectures you have heard)
  • I try to get back to you within a week. If I don't, please contact me again (ideally just resend your original mail). If you don't, I'll assume you are no longer interested.

Agentic AI on the Edge

Mentor: Dominik Hildebrand

Email: Dominik.Hildebrandspam prevention@uni-tuebingen.de

Large Language Models (LLMs) such as “ChatGPT”  can quickly turn huge amounts of text into clear and helpful responses such as when you need to draft an email, translate a paragraph, or provide a quick summary. Thus, they are becoming a larger and larger part of our everyday lives by making everyday tasks faster and easier. In recent times, most models are capable of processing image inputs. In this case, they are called “Large Vision Models” (VLMs) instead. 

However, both VLMs and LLMs - as their name suggests - are indeed large with parameter counts ranging from 1 Billion (B) over 56B all the way up to 671B.  As such, running inference with these models is expensive. For instance, the 671B model (called “DeepSeek R1”) requires (without optimization, lower bound) ~700 GB of (GPU) memory which needs 8 H100 just for loading it (market price as of March 2026: ~40,000€ / unit). Thus, these models are usually deployed using cloud-based solutions where your query is sent to and processed by a server-cluster.

This means a number of issues for the user such as potentially high latency, no way to query it offline and privacy concerns of both your and other's data. For instance, using ChatGPT to summarize your chat messages means you are giving away not just your data but also that of the other participants.

To address this, model compression is an active area of research that aims to lower resource requirements of models by “shrinking” them. Ideally, this allows running those models locally and even in resource-constrained settings (on  “edge devices” like a smartphone). However, the effectiveness of such methods should be verified empirically by doing actual deployment on edge devices. A previous thesis has established a framework for deploying and benchmarking VLMs on the Nvidia Jetson Thor. As a next step, this thesis should evaluate a deployment scenario on said platform.

Specifically, the student should

  1. Set up an environment to test agentic capabilities of VLMs (e.g. a game such as Factorio or Minecraft)
  2. Train & deploy agents for this environment on the Thor
  3. Extend the framework with this environment
  4. Establish a pareto frontier of model-scaling for this environment (i.e. answer: Given a fixed compute budget, is it better to compress a large model then finetune or finetune a smaller model for longer?)

Necessary Background:

  • You can work independently
  • You can follow basic instructions such as those found under “Contact Details” 

Recommended Background:

  • Has used a package manager like Anaconda before

Ideal Background:

  • Some experience using transformers or vLLM
  • Knows what CUDA is
  • Basic understanding of the transformer architecture (i.e. attention mechanism, auto-regressive decoding, kv-cache, …)
  • Some knowledge about finetuning techniques for VLMs (especially reinforcement learning)

Contact Details:

  • Please contact me only via e-mail
  • Attach your Transcript of Records (feel free to hide your grades, I only want to see what lectures you have heard)
  • I try to get back to you within a week. If I don't, please contact me again (ideally just resend your original mail). If you don't, I'll assume you are no longer interested.