Proseminar/Seminar: 3D Vision
This combined proseminar/seminar is on 3D Computer Vision. It can be taken by both Bachelor students (as Proseminar) and Master students (as Seminar). Students write and review reports and present a topic in the field of 3D vision in groups of 2 students.
Qualification Goals
Students gain a deep unterstanding of a scientific topic. They learn to efficiently search, navigate and read relevant literature and to summarize a topic clearly in their own words in a written report. Moreover, students present their topic to an audience of students and researchers, and provide feedback to others in the form of reviews and discussions. During the seminar, students learn to put scientific research into context, practice critical thinking and identify advantages and problems of a studied scientific method.
Overview
- Course number: ML-4507
- Credits: 3 ECTS (2h)
- Total Workload: 90h
- The seminar is held in a physical format in the MvL6 lecture hall. Students must bring a 3G proof, a mobile phone with QR scanning app, and their university credentials (username/password) for registering and contact tracing.
- Presence is mandatory during all scheduled sessions
Deliverables
- Report (5-6 pages, double column, excluding references)
- Presentation (25-30 minutes, max. 20 slides)
- Review of another report (1 page, double column)
- Discussion (during all presentations)
Prerequisites
- Basic Computer Science skills: Variables, functions, loops, classes, algorithms
- Basic Math skills: Linear algebra, analysis, probability theory
- Basic knowledge of Deep Learning is beneficial, but not required
Registration
- To participate in this seminar, you must register in the ILIAS booking pool
Templates
Links to Latex/Overleaf templates for reports, reviews and slides. Reports and reviews must use the corresponding template. Presentation slides can be done with other tools, e.g., PowerPoint, Keynote.
- Report template: https://www.overleaf.com/read/cwjmyfnhjmgh
- Review template: https://www.overleaf.com/read/xqfbmcpqqddq
- Slide template: https://www.overleaf.com/read/wxxwtgbrgcbm
Schedule
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
|
|
|
|
|
|
|
|
Topics
The students may choose among the following topics for their seminar paper.
1. Novel View Synthesis
Novel View Synthesis (NVS) addresses the problem of rendering a scene from unobserved viewpoints, given a number of RGB images and camera poses as input, e.g., for interactive exploration.
- Lombardi, S., Simon, T., Saragih, J., Schwartz, G., Lehrmann, A., & Sheikh, Y. (2019). Neural volumes: learning dynamic renderable volumes from images. ACM Transactions on Graphics (TOG), 38(4), 1-14.
- Riegler, G., & Koltun, V. (2020, August). Free view synthesis. In European Conference on Computer Vision (pp. 623-640). Springer, Cham.
2. Generative models for 3D objects and scenes
Generative models like Generative Adversarial Networks allow generating images that resemble objects or scenes from the training dataset. However, by default, it is not possible to perform 3D manipulations like viewpoint changes or transformations of individual objects in the generated scenes. Recently, several works add inductive biases to generative models that enable 3D controllability.
- Nguyen-Phuoc, T., Li, C., Theis, L., Richardt, C., & Yang, Y. L. (2019). Hologan: Unsupervised learning of 3d representations from natural images. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 7588-7597).
- Liao, Y., Schwarz, K., Mescheder, L., & Geiger, A. (2020, June). Towards Unsupervised Learning of Generative Models for 3D Controllable Image Synthesis. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 5870-5879). IEEE Computer Society.
3. 3D Reconstruction based on Deep Implicit Representations
A 3D reconstruction pipeline receives one or multiple RGB images and optionally depth maps and tries to infer the underlying geometry from these sparse inputs. Deep Implicit Representations compactly encode the geometry of a scene as the level set of a deep neural network. Recently, promising results have been achieved by adopting Deep Implicit Representations for 3D reconstruction.
- Mescheder, L., Oechsle, M., Niemeyer, M., Nowozin, S., & Geiger, A. (2019). Occupancy networks: Learning 3d reconstruction in function space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4460-4470).
- Niemeyer, M., Mescheder, L., Oechsle, M., & Geiger, A. (2020, June). Differentiable Volumetric Rendering: Learning Implicit 3D Representations Without 3D Supervision. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 3501-3512). IEEE.
4. 3D Reconstruction based on Multi-View Stereo
Multi-View Stereo (MVS) is a classical technique for dense 3D reconstruction that gets as input multiple RGB images of a scene and the corresponding camera poses. The MVS algorithm tries to find pixels in different views (images) that correspond to the same 3D point. Solving this correspondence problem allows estimating depth and normal maps that can be processed by subsequent stages of the reconstruction pipeline into a mesh. Recently, Deep Neural Networks have been used to improve the performance of MVS.
- Schönberger, J. L., Zheng, E., Frahm, J. M., & Pollefeys, M. (2016, October). Pixelwise view selection for unstructured multi-view stereo. In European Conference on Computer Vision (pp. 501-518). Springer, Cham.
- Yao, Y., Luo, Z., Li, S., Fang, T., & Quan, L. (2018). Mvsnet: Depth inference for unstructured multi-view stereo. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 767-783).
5. Semantic segmentation of 3D scenes
In semantic segmentation for 3D data the goal is to partition the input space into segments and assign semantic labels to each segment. Since in the majority of 3D vision applications input data is in the form of point clouds (e.g. LIDAR sensor data), input 3D space can be discretized to a regular grid of voxels, making the problem more similar to 2D image segmentation. With the popularization of coordinate MLPs, recent methods can be applied directly to 3D points and are able to efficiently process large point clouds of indoor and outdoor scenes.
- Qi, C. R., Su, H., Mo, K., & Guibas, L. J. (2017). Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 652-660).
- Hu, Q., Yang, B., Xie, L., Rosa, S., Guo, Y., Wang, Z., ... & Markham, A. (2020). Randla-net: Efficient semantic segmentation of large-scale point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 11108-11117).
6. 3D object detection (and tracking)
A classical task of 2D object detection predicts a 2D bounding box that localizes the object in the image space. In contrast, a 3D object detector should output a 3D bounding box that estimates the object's location, pose and size in 3D space which is especially useful for applications in self-driving, surveillance etc. Additionally, tracking detected objects across time might be of interest. 3D detection methods can work on RGB images, pointclouds or combine multiple modalities in order to solve the task.
- Chen, X., Kundu, K., Zhang, Z., Ma, H., Fidler, S., & Urtasun, R. (2016). Monocular 3d object detection for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2147-2156).
- Yin, T., Zhou, X., & Krahenbuhl, P. (2021). Center-based 3d object detection and tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 11784-11793).
7. Optical and scene flow
Both optical and scene flow deal with the problem of estimating a dense motion field between two consecutive time-frames (e.g. two image captures) of a dynamic scene. Scene flow estimates for each pixel a 3D vector that represents the motion of the (scene) surface point visible in that pixel, while optical flow describes a 2D displacement of each pixel between two frames.
- Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., ... & Brox, T. (2015). Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE international conference on computer vision (pp. 2758-2766).
- Menze, M., Heipke, C., & Geiger, A. (2018). Object scene flow. ISPRS Journal of Photogrammetry and Remote Sensing, 140, 60-76.
8. Visual SLAM
Visual Simultaneous Localization And Mapping is a technique for estimating a pose of the agent and at the same time a reconstruction of the surrounding environment map using only visual input (images). Visual SLAM is a crucial component of many autonomous systems, however, despite the popularity and breadth of deep learning applications in computer vision, current state-of-the-art systems are still based on more traditional approaches (optimization, geometry etc.)
- Kerl, C., Sturm, J., & Cremers, D. (2013, November). Dense visual SLAM for RGB-D cameras. In 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems (pp. 2100-2106). IEEE.
- Mur-Artal, R., & Tardós, J. D. (2017). Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras. IEEE transactions on robotics, 33(5), 1255-1262.