Surrounding-aware representation prediction in Birds-Eye-View using transformers

Jiahui Yu; Wenli Zheng; Yongquan Chen; Yutong Zhang; Rui Huang

doi:10.3389/fnins.2023.1219363

Surrounding-aware representation prediction in Birds-Eye-View using transformers

Front Neurosci. 2023 Jul 4:17:1219363. doi: 10.3389/fnins.2023.1219363. eCollection 2023.

Authors

Jiahui Yu^#¹, Wenli Zheng^#², Yongquan Chen¹, Yutong Zhang¹, Rui Huang¹

Affiliations

¹ Shenzhen Institute of Artificial Intelligence and Robotics for Society, and the SSE/IRIM, The Chinese University of Hong Kong, Shenzhen, Guangdong, China.
² The Shenzhen Academy of Inspection Quarantine, Shenzhen, Guangdong, China.

^# Contributed equally.

Abstract

Birds-Eye-View (BEV) maps provide an accurate representation of sensory cues present in the surroundings, including dynamic and static elements. Generating a semantic representation of BEV maps can be a challenging task since it relies on object detection and image segmentation. Recent studies have developed Convolutional Neural networks (CNNs) to tackle the underlying challenge. However, current CNN-based models encounter a bottleneck in perceiving subtle nuances of information due to their limited capacity, which constrains the efficiency and accuracy of representation prediction, especially for multi-scale and multi-class elements. To address this issue, we propose novel neural networks for BEV semantic representation prediction that are built upon Transformers without convolution layers in a significantly different way from existing pure CNNs and hybrid architectures that merge CNNs and Transformers. Given a sequence of image frames as input, the proposed neural networks can directly output the BEV maps with per-class probabilities in end-to-end forecasting. The core innovations of the current study contain (1) a new pixel generation method powered by Transformers, (2) a novel algorithm for image-to-BEV transformation, and (3) a novel network for image feature extraction using attention mechanisms. We evaluate the proposed Models performance on two challenging benchmarks, the NuScenes dataset and the Argoverse 3D dataset, and compare it with state-of-the-art methods. Results show that the proposed model outperforms CNNs, achieving a relative improvement of 2.4 and 5.2% on the NuScenes and Argoverse 3D datasets, respectively.

Keywords: BEV maps; attention; autonomous driving; deep learning; transformers.

Grants and funding

The authors would like to acknowledge the support from the Shenzhen Science and Technology Program (JCYJ20210324115604012, JSGG20220606140201003, JCYJ20220818103006012, and ZDSYS20220606100601002), the Guangdong Basic and Applied Basic Research Foundation (2021B1515120008 and 2023A1515011347), and the Institute of Artificial Intelligence and Robotics for Society.