Proseminar/Seminar: 3D Vision

This proseminar/seminar is on 3D vision. It can be taken by both Bachelor students (as Proseminar) and Master students (as Seminar). Students write and review reports and present a topic in the field of 3D vision in groups of 2 students. Participation is limited to 8 Bachelor students and 8 Master students. Registration is now open via the two booking pools in ILIAS. First come first serve.


  • Course number: ML-4507
  • Credits: 3 ECTS (2h)
  • Total Workload: 90h
  • The seminar is planned to be held in a physical format


  • Basic Computer Science skills: Variables, functions, loops, classes, algorithms
  • Basic Math skills: Linear algebra, analysis, probability theory
  • Basic knowledge of Deep Learning is beneficial, but not required


  • To participate in this seminar, you must register in the ILIAS booking pool
  • Participation is limited to 8 Bachelor students and 8 Master students, first come first serve





Introduction of Topics and Voting, Assignment of Topics and Reviews


Introduction to Scientific Writing and Presenting



TA Feedback Sessions



Deadline for Initial Drafts and Slides


Presentation 1 and 2, Deadline for Reviews


Presentation 3 and 4


Presentation 5 and 6


Presentation 7 and 8


Deadline for Final Drafts and Slides



The students may choose among the following topics for their seminar paper.

Novel View Synthesis
Novel View Synthesis (NVS) addresses the problem of rendering a scene from unobserved viewpoints, given a number of RGB images and camera poses as input, e.g., for interactive exploration.

  • Lombardi, S., Simon, T., Saragih, J., Schwartz, G., Lehrmann, A., & Sheikh, Y. (2019). Neural volumes: learning dynamic renderable volumes from images. ACM Transactions on Graphics (TOG), 38(4), 1-14.
  • Riegler, G., & Koltun, V. (2020, August). Free view synthesis. In European Conference on Computer Vision (pp. 623-640). Springer, Cham.

Generative models for 3D objects and scenes
Generative models like Generative Adversarial Networks allow generating images that resemble objects or scenes from the training dataset. However, by default, it is not possible to perform 3D manipulations like viewpoint changes or transformations of individual objects in the generated scenes. Recently, several works add inductive biases to generative models that enable 3D controllability.

  • Nguyen-Phuoc, T., Li, C., Theis, L., Richardt, C., & Yang, Y. L. (2019). Hologan: Unsupervised learning of 3d representations from natural images. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 7588-7597).
  • Liao, Y., Schwarz, K., Mescheder, L., & Geiger, A. (2020, June). Towards Unsupervised Learning of Generative Models for 3D Controllable Image Synthesis. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 5870-5879). IEEE Computer Society.

3D Reconstruction based on Deep Implicit Representations
A 3D reconstruction pipeline receives one or multiple RGB images and optionally depth maps and tries to infer the underlying geometry from these sparse inputs. Deep Implicit Representations compactly encode the geometry of a scene as the level set of a deep neural network. Recently, promising results have been achieved by adopting Deep Implicit Representations for 3D reconstruction.

  • Mescheder, L., Oechsle, M., Niemeyer, M., Nowozin, S., & Geiger, A. (2019). Occupancy networks: Learning 3d reconstruction in function space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4460-4470).
  • Niemeyer, M., Mescheder, L., Oechsle, M., & Geiger, A. (2020, June). Differentiable Volumetric Rendering: Learning Implicit 3D Representations Without 3D Supervision. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 3501-3512). IEEE.

Semantic segmentation of 3D scenes
In semantic segmentation for 3D data the goal is to partition the input space into segments and assign semantic labels to each segment. Since in the majority of 3D vision applications input data is in the form of point clouds (e.g. LIDAR sensor data), input 3D space can be discretized to a regular grid of voxels, making the problem more similar to 2D image segmentation. With the popularization of coordinate MLPs, recent methods can be applied directly to 3D points and are able to efficiently process large point clouds of indoor and outdoor scenes.

  • Qi, C. R., Su, H., Mo, K., & Guibas, L. J. (2017). Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 652-660).
  • Hu, Q., Yang, B., Xie, L., Rosa, S., Guo, Y., Wang, Z., ... & Markham, A. (2020). Randla-net: Efficient semantic segmentation of large-scale point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 11108-11117).

3D object detection (and tracking)
A classical task of 2D object detection predicts a 2D bounding box that localizes the object in the image space. In contrast, a 3D object detector should output a 3D bounding box that estimates the object's location, pose and size in 3D space which is especially useful for applications in self-driving, surveillance etc. Additionally, tracking detected objects across time might be of interest. 3D detection methods can work on RGB images, pointclouds or combine multiple modalities in order to solve the task. 

  • Chen, X., Kundu, K., Zhang, Z., Ma, H., Fidler, S., & Urtasun, R. (2016). Monocular 3d object detection for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2147-2156).
  • Yin, T., Zhou, X., & Krahenbuhl, P. (2021). Center-based 3d object detection and tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 11784-11793).

Optical and scene flow
Both optical and scene flow deal with the problem of estimating a dense motion field between two consecutive time-frames (e.g. two image captures) of a dynamic scene. Scene flow estimates for each pixel a 3D vector that represents the motion of the (scene) surface point visible in that pixel, while optical flow describes a 2D displacement of each pixel between two frames. 

  • Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., ... & Brox, T. (2015). Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE international conference on computer vision (pp. 2758-2766).
  • Menze, M., Heipke, C., & Geiger, A. (2018). Object scene flow. ISPRS Journal of Photogrammetry and Remote Sensing, 140, 60-76.

Visual SLAM
Visual Simultaneous Localization And Mapping is a technique for estimating a pose of the agent and at the same time a reconstruction of the surrounding environment map using only visual input (images). Visual SLAM is a crucial component of many autonomous systems, however, despite the popularity and breadth of deep learning applications in computer vision, current state-of-the-art systems are still based on more traditional approaches (optimization, geometry etc.) 

  • Kerl, C., Sturm, J., & Cremers, D. (2013, November). Dense visual SLAM for RGB-D cameras. In 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems (pp. 2100-2106). IEEE.
  • Mur-Artal, R., & Tardós, J. D. (2017). Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras. IEEE transactions on robotics, 33(5), 1255-1262.