Computergrafik

SpatialDETR: Robust Scalable Transformer-Based 3D Object Detection from Multi-View Camera Images with Global Cross-Sensor Attention

Simon Doll1,3 , Richard Schulz1 , Lukas Schneider1 , Viviane Benzin1 ,
Markus Enzweiler2 , and Hendrik P.A. Lensch3
Mercedes-Benz, simon.doll@mercedes-benz.com1
Esslingen University of Applied Sciences2
University of Tübingen3

Abstract

Based on the key idea of DETR this paper introduces an
object-centric 3D object detection framework that operates on a limited
number of 3D object queries instead of dense bounding box proposals
followed by non-maximum suppression. After image feature extraction a
decoder-only transformer architecture is trained on a set-based loss. SpatialDETR
infers the classification and bounding box estimates based on
attention both spatially within each image and across the different views.
To fuse the multi-view information in the attention block we introduce a
novel geometric positional encoding that incorporates the view ray geometry
to explicitly consider the extrinsic and intrinsic camera setup. This
way, the spatially-aware cross-view attention exploits arbitrary receptive
fields to integrate cross-sensor data and therefore global context. Extensive
experiments on the nuScenes benchmark demonstrate the potential
of global attention and result in state-of-the-art performance. Code available
at https://github.com/cgtuebingen/SpatialDETR.

Links

Bibtex

@inproceedings{Doll2022ECCV,
 author = {Doll, Simon and Schulz, Richard and Schneider, Lukas and Benzin, Viviane and Enzweiler Markus and Lensch, Hendrik P.A.},
 title = {SpatialDETR: Robust Scalable Transformer-Based 3D Object Detection from Multi-View Camera Images with Global Cross-Sensor Attention},
 booktitle = {European Conference on Computer Vision (ECCV)},
 year = {2022}
}