Master Theses | University of Tübingen

Master Theses at the Chair of Cognitive Systems (Prof. Dr. Andreas Zell)

Students who want to take a master thesis should have attended at least one lecture of Prof. Zell and passed it with good or at least satisfactory grades. They might also have obtained the relevant background knowledge for the thesis from other, similar lectures.

Event-based vision for a tactile sensor

Mentor: Andreas Ziegler

Email: andreas.zieglerspam prevention@uni-tuebingen.de

Intelligent interaction with the physical world requires perceptual abilities beyond vision and hearing; vibrant tactile sensing is essential for autonomous robots to dexterously manipulate unfamiliar objects or safely contact humans. Therefore, robotic manipulators need high-resolution touch sensors that are compact, robust, inexpensive, and efficient. In recent work, our collaborators at MPI presented Minsight [1], a soft vision-based haptic sensor, which is a miniaturized and optimized version of the previously published sensor Insight. Minsight has the size and shape of a human fingertip and uses machine learning methods to output high-resolution maps of 3D contact force vectors at 60 Hz.

To look into the high frequency aspect of textures, an update rate of 60 Hz is not enough. Event-based cameras [2] which become more and more popular could be a good alternative to the classical, frame-based camera used so far. Event cameras are bio-inspired sensors that asynchronously report timestamped changes in pixel intensity and offer advantages over conventional frame-based cameras in terms of low-latency, low redundancy sensing and high dynamic range. Hence, event cameras have a large potential for robotics and computer vision.

In this thesis, the student is tasked to use a new, miniature event-based camera together with Deep Learning to bring Minsight to the next level.

The student should be familiar with Computer Vision, Deep Learning and ideally already used Deep Learning frameworks like PyTorch previously in projects or course work

[1] Andrussow, I., Sun, H., Kuchenbecker, K. J., Martius, G. Minsight: A Fingertip-Sized Vision-Based Tactile Sensor for Robotic Manipulation Advanced Intelligent Systems, 5(8):2300042, August 2023, Inside back cover

[2] G. Gallego, T. Delbruck, G. Orchard, C. Bartolozzi, B. Taba, A. Censi, S. Leutenegger, A. Davison, J. Conradt, K. Daniilidis, D. Scaramuzza, Event-based Vision: A Survey, IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 44, no. 1, pp. 154-180, 1 Jan. 2022.

Various Topics in Maritime Computer Vision, Robot arms, or GenAI

Mentor: Benjamin Kiefer
Email: benjamin.kieferspam prevention@uni-tuebingen.de

Reach out for up2date list

Spiking neural network for event-based ball detection

Mentor: Andreas Ziegler

Email: andreas.zieglerspam prevention@uni-tuebingen.de

Event cameras are bio-inspired sensors that differ from conventional frame cameras: Instead of capturing images at a fixed rate, they asynchronously measure per-pixel brightness changes, and output a stream of events that encode the time, location and sign of the brightness changes. Event cameras offer attractive properties compared to traditional cameras: high temporal resolution (in the order of μs), very high dynamic range (140 dB vs. 60 dB), low power consumption, and high pixel bandwidth (on the order of kHz) resulting in reduced motion blur. Hence, event cameras have a large potential for robotics and computer vision.

So far, most learning approaches applied to event data, convert a batch of events into a tensor and then use conventional CNNs as network. While such approaches achieve state-of-the-art performance, they do not make use of the asynchronous nature of the event data. Spiking Neural Networks (SNNs) on the other hand are bio-inspired networks that can process output from event-based directly. SNNs process information conveyed as temporal spikes rather than numeric values. This makes SNNs an ideal counterpart for event-based cameras.

The goal of this thesis is to investigate and evaluate how a SNN can be used together with our event-based cameras to detect and track table tennis balls. The Cognitive Systems groups has a table tennis robot system, where the developed ball tracker can be used and compared to other methods.

Requirements: Familiar with "traditional" Computer Vision, Deep Learning, Python

Asynchronous Graph-based Neural Networks for Ball Detection with Event Cameras

Mentor: Andreas Ziegler

Email: andreas.zieglerspam prevention@uni-tuebingen.de

Event cameras are bio-inspired sensors that asynchronously report timestamped changes in pixel intensity and offer advantages over conventional frame-based cameras in terms of low-latency, low redundancy sensing and high dynamic range. Hence, event cameras have a large potential for robotics and computer vision.

State-of-the-art machine-learning methods for event cameras treat events as dense representations and process them with CNNs. Thus, they fail to maintain the sparsity and asynchronous nature of event data, thereby imposing significant computation and latency constraints. A recent line of work [1]–[5] tackles this issue by modeling events as spatio-temporally evolving graphs that can be efficiently and asynchronously processed using graph neural networks. These works showed impressive reductions in computation.

The goal of this thesis is to apply these Graph-based networks for ball detection with event cameras. Existing graph-based networks were designed for some more general object detection task [4], [5]. Since we only want to detect balls, in a first step, the student will investigate if a network architecture, targeted for our use case, could further improve the inference time.

The student should to be familiar with „traditional“ Computer Vision and Deep Learning. Experience with Python and PyTorch from previous projects would be beneficial.

[1] Y. Li et al., “Graph-based Asynchronous Event Processing for Rapid Object Recognition,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, Oct. 2021, pp. 914–923. doi: 10.1109/ICCV48922.2021.00097.

[2] Y. Deng, H. Chen, H. Liu, and Y. Li, “A Voxel Graph CNN for Object Classification with Event Cameras,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, Jun. 2022, pp. 1162–1171. doi: 10.1109/CVPR52688.2022.00124.

[3] A. Mitrokhin, Z. Hua, C. Fermuller, and Y. Aloimonos, “Learning Visual Motion Segmentation Using Event Surfaces,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, Jun. 2020, pp. 14402–14411. doi: 10.1109/CVPR42600.2020.01442.

[4] S. Schaefer, D. Gehrig, and D. Scaramuzza, “AEGNN: Asynchronous Event-based Graph Neural Networks,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, Jun. 2022, pp. 12361–12371. doi: 10.1109/CVPR52688.2022.01205.

[5] D. Gehrig and D. Scaramuzza, “Pushing the Limits of Asynchronous Graph-based Object Detection with Event Cameras.” arXiv, Nov. 22, 2022. Accessed: Dec. 16, 2022. [Online]. Available: arxiv.org/abs/2211.12324

Multi Object tracking via event-based motion segmentation with event cameras

Mentor: Andreas Ziegler

Email: andreas.zieglerspam prevention@uni-tuebingen.de

Since event cameras report changes of intensity per pixel, their output resembles an image gradient where mainly edges and corners are present. The contrast maximization framework (CMax) [1] uses this fact by optimizing the sharpness of accumulated events to solve computer vision tasks like the estimation of motion, depth or optical flow. Most recent works on event-based (multi) object segmentation [2]–[4] applies this CMax framework. The common scheme is to jointly assign events to an objct and fit ting a motion model which best explains the data.

The goal of this thesis is to develop a real-time capable (multi) object tracking pipeline by applying multi object segmentation. After the student got familiar with the recent literature, a suitable multi object segmentation approach should be chosen and adjusted for our use case, namely a table tennis setup. Afterwards, different object tracking approaches should be developed, evaluated and compared against each other.

The student should to be familiar with „traditional“ Computer Vision. Experience with C++ and/or optimization from previous projects or coursework would be beneficial.

[1] G. Gallego, M. Gehrig, and D. Scaramuzza, “Focus Is All You Need: Loss Functions for Event-Based Vision,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, Jun. 2019, pp. 12272–12281. doi: 10.1109/CVPR.2019.01256.

[2] X. Lu, Y. Zhou, and S. Shen, “Event-based Motion Segmentation by Cascaded Two-Level Multi-Model Fitting.” arXiv, Nov. 05, 2021. Accessed: Jan. 05, 2023. [Online]. Available: http://arxiv.org/abs/2111.03483

[3] T. Stoffregen, G. Gallego, T. Drummond, L. Kleeman, and D. Scaramuzza, “Event-Based Motion Segmentation by Motion Compensation,” ArXiv190401293 Cs, Aug. 2019, Accessed: Jun. 14, 2021. [Online]. Available: http://arxiv.org/abs/1904.01293

[4] Y. Zhou, G. Gallego, X. Lu, S. Liu, and S. Shen, “Event-based Motion Segmentation with Spatio-Temporal Graph Cuts,” IEEE Trans. Neural Netw. Learn. Syst., pp. 1–13, 2021, doi: 10.1109/TNNLS.2021.3124580.

Exploiting Drone Metadata for Multi Object Tracking (MOT)

Mentor: Martin Meßmer

Email: martin.messmerspam prevention@uni-tuebingen.de

Although some deep learning methods like correlation filters and Siamese networks show great promise to tackle the problem of multi object tracking, those approaches are far from working perfectly. Therefore, in specific use cases, it is necessary to impose additional priors or leverage additional data. Luckily, when working with drones, there is free metadata to work with such as height or velocity of the drone. In this thesis, the student should develop some useful ideas on how to exploit this data to increase the performance of a MOT-model and also implement and compare those ideas with other approaches.

Requirements: deep learning knowledge, Python, good English or German

Eye-Tracking-Based Interaction System for Robot Manipulation

Mentor: Yuzhi Lai

Email: yuzhi.laispam prevention@uni-tuebingen.de

Traditional interaction methods such as gesture and voice commands present significant challenges. Gesture-based interactions require users to perform precise movements within the field of view of a camera, restricting their mobility and range of interaction. On the other hand, speech-based commands can be ambiguous, as verbal descriptions often lack the spatial specificity needed for accurate robotic response. In contrast, gaze-based interaction offers a silent, unobtrusive, and intuitive alternative that enables hands-free control. By leveraging AI-driven eye-tracking technology, we aim to develop a gaze-based control system that enhances accessibility and interaction efficiency, providing a versatile solution for users in both general and specialized settings.

This project aims to develop a gaze-driven dialogue system for ARIA glasses that allows users to form and communicate simple sentences using only eye movements. Unlike existing assistive communication tools that rely on manual input or predefined selection grids or GUIs, this system will leverage real-time eye-tracking to provide an intuitive way for users to collaborate with robots. This project aims to provide a voice-and hands-free communication solution that significantly enhances the quality of life for individuals with severe motor and speech impairments.

Building and Training a Grasp-Point Detector

Mentor: David Ott
E-Mail: david.ottspam prevention@uni-tuebingen.de

The goal of this thesis is to implement and train a modern deep learning-based grasp-point detection model for robotic arms, inspired by the AnyGrasp architecture.

Grasp-point detection is a core problem in robotic manipulation, where the objective is to determine 6-DoF poses in 3D space that a robotic gripper can move to in order to successfully grasp an object. A common pipeline involves detecting these grasp-points from raw 3D point-cloud data and then using off-the-shelf motion planning and control frameworks to execute the grasp. A current state-of-the-art approach for this is AnyGrasp, but unfortunately a fully open-source training pipeline has not been released and it is based on outdated dependencies and frameworks.

This thesis will focus on re-implementing a grasp-point detection architecture similar to AnyGrasp in a modern PyTorch-based machine learning stack. The implementation should be modular, clean, and leverage current PyTorch best practices. The model will be trained and evaluated on public datasets relevant to grasp-point detection such as GraspNet or similar benchmarks. The model will then be integrated into a real world robotic grasping pipeline using a Franka Emika Panda robot arm.

Depending on progress and interest, the work may be extended to improve the performance of the system by introducing newer components (e.g., updated feature extractors, point-cloud encoders, or transformer-based modules), or by augmenting the dataset with more challenging objects and scenes.

Necessary Prerequisites: Solid understanding of PyTorch and deep learning training pipelines, strong Python programming skills
Useful but not required Skills: Experience with SLURM, Docker, distributed ML training, robotics, 3D coordinate transforms, point-cloud data processing, ROS / ROS 2

Building a Robust Point Cloud Scene Representation for Robot Arms

Mentor: David Ott
E-Mail: david.ottspam prevention@uni-tuebingen.de

The goal of this thesis is to investigate and evaluate techniques for building robust, persistent point cloud representations of the world that can serve as input to robot arm control systems. Such representations go beyond simple RGB(D) camera inputs by integrating observations over time and across viewpoints to create a more consistent and complete 3D understanding of the scene.

In many robotic manipulation scenarios, intermediate sensor representations are essential. Robot policies often require a reliable model of the world that handles occlusion, tracks object movement, and fuses multiple sensory inputs. Instead of using raw depth maps or RGBD frames, a robust point cloud can serve as a more expressive and stable input to downstream tasks such as grasp planning, motion execution, or object manipulation.

This thesis will explore and compare various techniques for generating such scene representations, focusing on fusing multiple point clouds obtained at different times and/or from different viewpoints into a single, coherent world-model. Particular attention will be paid to the following sub-problems:
- Point Cloud Registration: Aligning and fusing point clouds collected from different time steps or different camera viewpoints.
- Downsampling and Filtering: Efficiently reducing the size of point clouds while preserving relevant structural information.
- Handling Dynamic Objects: Updating the scene representation when objects are moved, and modeling uncertainty or “staleness” of unobserved areas.
- Sensor and Camera Calibration: Transforming point clouds into a common world frame, especially when using multiple RGBD sensors.

The resulting fused scene representations will be evaluated in terms of their robustness and usefulness for downstream robotic tasks. Depending on progress and interest, the student may implement simple evaluation tasks (e.g., pick-and-place) or integrate the system into a simulated or physical robot platform for testing.

Necessary Prerequisites: Python, Basic understanding of 3D geometry and linear algebra

Useful but not required Skills: C++, ROS / ROS2, experience with 3D coordinate transforms and point cloud processing, familiarity with robot arm platforms and RGB(D) sensors

Privacy settings