Seminar series of the Autonomous Vision Group with invited talks by experts in the field.
Video recordings of most talks can be found on our Youtube channel.
Bio: Jia Deng is an Assistant Professor of Computer Science at Princeton University. His research focuses on computer vision and machine learning. He received his Ph.D. from Princeton University and his B.Eng. from Tsinghua University, both in computer science. He is a recipient of the Sloan Research Fellowship, the NSF CAREER award, the ONR Young Investigator award, an ICCV Marr Prize, and two ECCV Best Paper Awards.https://www.cs.princeton.edu/~jiadeng/
Abstract: I will present our recent work on unsupervised disentanglement of the underlying factors of variation in naturalistic videos. Previous work suggests that representations can be disentangled if all but a few factors in the environment stay constant at any point in time. As a result, algorithms proposed for this problem have only been tested on carefully constructed datasets with this exact property, leaving it unclear whether they will transfer to natural scenes. Here we provide evidence that objects in segmented natural movies undergo transitions that are typically small in magnitude with occasional large jumps, which is characteristic of a temporally sparse distribution. We leverage this finding and present SlowVAE, a model for unsupervised representation learning that uses a sparse prior on temporally adjacent observations to disentangle generative factors without any assumptions on the number of changing factors. We provide a proof of identifiability and show that the model reliably learns disentangled representations on several established benchmark datasets, often surpassing the current state-of-the-art. We additionally demonstrate transferability towards video datasets with natural dynamics, Natural Sprites and KITTI Masks, which we contribute as benchmarks for guiding disentanglement research towards more natural data domains.
Bio: Yash Sharma is a PhD student at the International Max Planck Research School for Intelligent Systems (IMPRS-IS), advised by Matthias Bethge and Wieland Brendel. His research focuses on the robustness of machine learning models, improving their performance beyond the i.i.d. setting. He has researched previously at Borealis AI and IBM Research, and is a Kaggle Competitions Master. https://www.yash-sharma.com/
Abstract: Reproducing the appearance of real-world objects and scenes is a long-standing task in computer graphics and computer vision. With the advent of virtual reality and augmented reality, the demand for efficient and accurate methods for digital 3D content creation has been unprecedentedly increasing. In this talk, I will present our research on recovering the faithful appearance of real-world scenes with different representations. First, I will introduce our work on recovering high-quality textures maps for RGB-D reconstructions with rough geometries and inaccurate camera poses. Then we take one step further and propose a learning-based method to reconstruct high-quality meshes with per-vertex BRDFs from a sparse set of images captured under collocated camera and light. Finally we go beyond traditional mesh representations and propose to learn a volumetric representation from unstructured mobile phone captures under flashlight for joint view synthesis and relighting. In the end, I will discuss some future works on the topic of appearance acquisition.
Bio: Sai Bi is currently a PhD candidate at the Department of Computer Science and Engineering, UC San Diego, advised by Prof. Ravi Ramamoorthi. Before coming to UCSD, he obtained Bachelor of Engineering in Computer Science with First Class Honors from The University of Hong Kong. He was also once a research assistant at the Department of Computer Science HKU advised by Prof. Yizhou Yu. His research interests lie in appearance acquisition and 3D reconstruction. http://cseweb.ucsd.edu/~bisai/
Abstract: Image formation is a complex process where lighting, geometry and materials interact to determine appearance. Recovery of those factors from images is a highly ill-posed problem, especially when materials are diverse, distant parts of the scene influence appearance and lighting displays local variations due to inter-reflections, shadows and refractions. Advances in deep learning have led to impressive gains, but generalization of a purely data-driven approach to handle such effects is expensive. This talk describes our recent progress towards recovering shape, spatially-varying material and lighting in complex scenes, using just a single or few images. We achieve this through novel designs such as differentiable rendering layers that incorporate knowledge of image formation, with parsimonious yet effective representations that leverage physical insights. Since real data is hard to obtain for complex light paths, we develop new synthetic datasets rendered with a high degree of photorealism. Together, these enable applications that may augment a scene to insert new objects, edit materials in the scene, or visualize it under a different illumination. We envision that our work will help transform mobile phones into cheap and accessible augmented reality devices.
Bio: Manmohan Chandraker is an assistant professor at the CSE department of University of California, San Diego. His research interests are in computer vision and machine learning, with applications in self-driving and augmented reality. His work has been recognized with the 2019 and 2018 Google Research Awards, the 2018 NSF CAREER Award, the Best Paper Award at CVPR 2014 and the Marr Prize honorable mention at ICCV 2007. https://cseweb.ucsd.edu/~mkchandraker/
Abstract: In this talk, I will cover three works related to scene analysis of non-Lambertian objects under a single view point. First, I will introduce our work for learning transparent object matting from a single image. Second, I will introduce a flexible network architecture to handle calibrated photometric stereo for non-Lambertian objects under directional lightings (lightings are assumed to be known at test time). Last, I will introduce a convolutional network to estimate directional lightings from photometric stereo images, and analyse what features have been learned by the network to resolve the ambiguity in lighting estimation. Integrated with our lighting estimation network, existing calibrated methods can be adopted to handle the problem of uncalibrated photometric stereo and achieve state-of-the-art results.
Bio: Guanying Chen is currently a final year Ph.D. student in the Department of Computer Science at The University of Hong Kong, supervised by Prof. Kenneth K.Y. Wong. He was a research intern in Osaka University, supervised by Prof. Yasuyuki Matsushita in 2019. He received his B.E. degree from Sun Yat-sen University in 2016. His recent research interests are learning based methods for low-level vision, physics-based vision, 3D vision. http://guanyingc.github.io/
Abstract: How we represent signals has major implications for the algorithms we build to analyze them. Today, most signals are represented discretely: Images as grids of pixels, shapes as point clouds, audio as grids of amplitudes, etc. If images weren't pixel grids - would we be using convolutional neural networks today? What makes a good or bad representation? Can we do better? I will talk about leveraging emerging implicit neural representations for complex & large signals, such as room-scale geometry, images, audio, video, and physical signals defined via partial differential equations. By embedding an implicit scene representation in a neural rendering framework and learning a prior over these representations, I will show how we can enable 3D reconstruction from only a single posed 2D image. Finally, I will show how gradient-based meta-learning can enable fast inference of implicit representations, and how the features we learn in the process are already useful to the downstream task of semantic segmentation.
Bio: Vincent Sitzmann just finished his PhD at Stanford University with a thesis on "Self-Supervised Scene Representation Learning". His research interest lies in neural scene representations - the way neural networks learn to represent information on our world. His goal is to allow independent agents to reason about our world given visual observations, such as inferring a complete model of a scene with information on geometry, material, lighting etc. from only few observations, a task that is simple for humans, but currently impossible for AI. In July, Vincent will join Joshua Tenenbaum's group at MIT CSAIL for a Postdoc. https://vsitzmann.github.io/
Abstract: In this talk I will present our recent work on Neural Radiance Fields (NeRFs) for view synthesis. We are able to achieve state-of-the-art results for synthesizing novel views of scenes with complex geometry and view dependent effects from a sparse set of input views by optimizing an underlying continuous volumetric scene function parameterized as a fully-connected deep network. In this work we combine the recent advances in coordinate based neural representations with classic methods for volumetric rendering. In order to recover high frequency content in the scene, we find that it is necessary to map the input coordinates to a higher dimensional space using Fourier features before feeding them through the network. In our followup work we use Neural Tangent Kernel analysis to show that this is equivalent to transforming our network into a stationary kernel with tunable bandwidth.
Bio: Matthew studied Computer Science and Physics at Massachusetts Institute of Technology where he received his bachelor degree in 2016 and master degree in 2017. During his masters degree he worked under the supervision of Ramesh Raskar and Fredo Durand and published his thesis on non-line-of-sight imaging using data driven approaches. He began his PhD in 2018 at UC Berkeley under the supervision of Ren Ng. He is currently interested in exploring the intersection of vision and graphics for robotic perception and view synthesis applications. https://www.matthewtancik.com/
Abstract: In this talk, I will cover three works related to 3D scene understanding. I will start with the design of a scene analysis network whose design is inspired by classical robust / non-convex optimization for permutation equivariant learning (i.e. wide-baseline stereo, robust parametric fitting, and robust classification). I will then introduce a representation of digital humans based on pose-conditioned implicit functions, which, in contrast to classical mesh-based representations (SMPL) can be learnt end-to-end, as well as illustrate their immediate applicability in classical downstream dense tracking applications. I will conclude by introducing a hybrid implicit/explicit differentiable representation of 3D geometry, one that is easy to train (as it uses implicit functions), yet generates polygonal meshes without the need for iso-surface extraction (e.g. marching cubes).
Bio: Andrea Tagliasacchi is a senior research scientist in the Brain/Toronto office headed by Geoffrey Hinton, where he leads the inverse graphics research pillar. With the exception of a brief hiatus at Google/Stadia, he spent most of 2018 as a visiting faculty in Daydream Augmented Perception (Shahram Izadi) working on 4D data (capture, tracking, compression, modeling, simulation of geometry). Before joining Google, he was an assistant professor at the University of Victoria (2015-2017), where he held the industrial research chair in 3D sensing. His alma mater includes EPFL (postdoc, Mark Pauly), SFU (PhD, Richard Zhang and Daniel Cohen-or, NSERC Alexander Graham Bell fellow), and Politecnico di Milano (gold medalist). He is also an adjunct faculty at the University of Toronto. http://gfx.uvic.ca/people/ataiya/