Proseminar/Seminar: 3D Vision

This proseminar/seminar is on 3D vision. It can be taken by both Bachelor students (as Proseminar) and Master students (as Seminar). Students write and review reports and present a topic in the field of 3D vision in groups of 2 students. Participation is limited to 8 Bachelor students and 8 Master students. Registration is now open via the two booking pools in ILIAS. First come first serve.

Overview

  • Course number: ML-4507
  • Credits: 3 ECTS (2h)
  • Total Workload: 90h
  • The seminar is held in a physical format in the MvL6 lecture hall. Students must bring a 3G proof, a mobile phone with QR scanning app, and their university credentials (username/password) for registering and contact tracing.
  • Presence is mandatory during all scheduled sessions

Prerequisites

  • Basic Computer Science skills: Variables, functions, loops, classes, algorithms
  • Basic Math skills: Linear algebra, analysis, probability theory
  • Basic knowledge of Deep Learning is beneficial, but not required

Registration

  • To participate in this seminar, you must register in the ILIAS booking pool

Schedule

Date

Topic

29.10.

Introduction and Assignment of Topics and Reviews

05.11.

Introduction to Scientific Writing and Presenting

12.11.

No Seminar

19.11.

TA Feedback Sessions

26.11.

No Seminar

03.12.

TA Feedback Sessions

10.12.

Deadline for Initial Reports and Slides

17.12.

Presentation 1 and 2, Deadline for all Reviews

Christmas Break

14.01.

Presentation 3 and 4

21.01.

Presentation 5 and 6

28.01.

Presentation 7 and 8

04.02.

Deadline for Final Drafts and Slides

Topics

The students may choose among the following topics for their seminar paper.

1. Novel View Synthesis
Novel View Synthesis (NVS) addresses the problem of rendering a scene from unobserved viewpoints, given a number of RGB images and camera poses as input, e.g., for interactive exploration.

2. Generative models for 3D objects and scenes
Generative models like Generative Adversarial Networks allow generating images that resemble objects or scenes from the training dataset. However, by default, it is not possible to perform 3D manipulations like viewpoint changes or transformations of individual objects in the generated scenes. Recently, several works add inductive biases to generative models that enable 3D controllability.

3. 3D Reconstruction based on Deep Implicit Representations
A 3D reconstruction pipeline receives one or multiple RGB images and optionally depth maps and tries to infer the underlying geometry from these sparse inputs. Deep Implicit Representations compactly encode the geometry of a scene as the level set of a deep neural network. Recently, promising results have been achieved by adopting Deep Implicit Representations for 3D reconstruction.

4. 3D Reconstruction based on Multi-View Stereo
Multi-View Stereo (MVS) is a classical technique for dense 3D reconstruction that gets as input multiple RGB images of a scene and the corresponding camera poses. The MVS algorithm tries to find pixels in different views (images) that correspond to the same 3D point. Solving this correspondence problem allows estimating depth and normal maps that can be processed by subsequent stages of the reconstruction pipeline into a mesh. Recently, Deep Neural Networks have been used to improve the performance of MVS.

5. Semantic segmentation of 3D scenes
In semantic segmentation for 3D data the goal is to partition the input space into segments and assign semantic labels to each segment. Since in the majority of 3D vision applications input data is in the form of point clouds (e.g. LIDAR sensor data), input 3D space can be discretized to a regular grid of voxels, making the problem more similar to 2D image segmentation. With the popularization of coordinate MLPs, recent methods can be applied directly to 3D points and are able to efficiently process large point clouds of indoor and outdoor scenes.

6. 3D object detection (and tracking)
A classical task of 2D object detection predicts a 2D bounding box that localizes the object in the image space. In contrast, a 3D object detector should output a 3D bounding box that estimates the object's location, pose and size in 3D space which is especially useful for applications in self-driving, surveillance etc. Additionally, tracking detected objects across time might be of interest. 3D detection methods can work on RGB images, pointclouds or combine multiple modalities in order to solve the task. 

7. Optical and scene flow
Both optical and scene flow deal with the problem of estimating a dense motion field between two consecutive time-frames (e.g. two image captures) of a dynamic scene. Scene flow estimates for each pixel a 3D vector that represents the motion of the (scene) surface point visible in that pixel, while optical flow describes a 2D displacement of each pixel between two frames. 

8. Visual SLAM
Visual Simultaneous Localization And Mapping is a technique for estimating a pose of the agent and at the same time a reconstruction of the surrounding environment map using only visual input (images). Visual SLAM is a crucial component of many autonomous systems, however, despite the popularity and breadth of deep learning applications in computer vision, current state-of-the-art systems are still based on more traditional approaches (optimization, geometry etc.)