TY - JOUR
T1 - Leveraging Frozen Foundation Models and Multimodal Fusion for BEV Segmentation and Occupancy Prediction
AU - Hayes, Seamie
AU - Sistu, Ganesh
AU - Eising, Ciaran
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2025
Y1 - 2025
N2 - In Bird's Eye View perception, significant emphasis is placed on deploying well-performing, convoluted model architectures and leveraging as many sensor modalities as possible to reach maximal performance. This paper investigates whether foundation models and multi-sensor deployments are essential for enhancing BEV perception. We examine the relative importance of advanced feature extraction versus the number of sensor modalities and assess whether foundation models can address feature extraction limitations and reduce the need for extensive training data. Specifically, incorporating the self-supervised DINOv2 for feature extraction and Metric3Dv2 for depth estimation into the Lift-Splat-Shoot framework results in a 7.4 IoU point increase in vehicle segmentation, representing a relative improvement of 22.4%, while requiring only half the training data and iterations compared to the original model. Furthermore, using Metric3Dv2’s depth maps as a pseudo-LiDAR point cloud within the Simple-BEV model improves IoU by 2.9 points, marking a 6.1% relative increase compared to the Camera-only setup. Finally, we extend the famous Gaussian Splatting BEV perception models, GaussianFormer and GaussianOcc, through multimodal deployment. The addition of LiDAR information in GaussianFormer results in a 9.4-point increase in mIoU, a 48.7% improvement over the Camera-only model, nearing state-of-the-art multimodal performance even with limited LiDAR scans. In the self-supervised GaussianOcc model, incorporating LiDAR leads to a 0.36-point increase in mIoU, representing a 3.6% improvement over the Camera-only model. This limited gain can be attributed to the absence of LiDAR encoding and the self-supervised nature of the model. Overall, our findings highlight the critical role of foundation models and multi-sensor integration in advancing BEV perception. By leveraging sophisticated foundation models and multi-sensor deployment, we can further model performance and reduce data requirements, addressing key challenges in BEV perception.
AB - In Bird's Eye View perception, significant emphasis is placed on deploying well-performing, convoluted model architectures and leveraging as many sensor modalities as possible to reach maximal performance. This paper investigates whether foundation models and multi-sensor deployments are essential for enhancing BEV perception. We examine the relative importance of advanced feature extraction versus the number of sensor modalities and assess whether foundation models can address feature extraction limitations and reduce the need for extensive training data. Specifically, incorporating the self-supervised DINOv2 for feature extraction and Metric3Dv2 for depth estimation into the Lift-Splat-Shoot framework results in a 7.4 IoU point increase in vehicle segmentation, representing a relative improvement of 22.4%, while requiring only half the training data and iterations compared to the original model. Furthermore, using Metric3Dv2’s depth maps as a pseudo-LiDAR point cloud within the Simple-BEV model improves IoU by 2.9 points, marking a 6.1% relative increase compared to the Camera-only setup. Finally, we extend the famous Gaussian Splatting BEV perception models, GaussianFormer and GaussianOcc, through multimodal deployment. The addition of LiDAR information in GaussianFormer results in a 9.4-point increase in mIoU, a 48.7% improvement over the Camera-only model, nearing state-of-the-art multimodal performance even with limited LiDAR scans. In the self-supervised GaussianOcc model, incorporating LiDAR leads to a 0.36-point increase in mIoU, representing a 3.6% improvement over the Camera-only model. This limited gain can be attributed to the absence of LiDAR encoding and the self-supervised nature of the model. Overall, our findings highlight the critical role of foundation models and multi-sensor integration in advancing BEV perception. By leveraging sophisticated foundation models and multi-sensor deployment, we can further model performance and reduce data requirements, addressing key challenges in BEV perception.
KW - Bird's eye view
KW - foundation model
KW - LiDAR
KW - multimodal
KW - semantic occupancy
UR - https://www.scopus.com/pages/publications/105003498469
U2 - 10.1109/OJVT.2025.3563677
DO - 10.1109/OJVT.2025.3563677
M3 - Article
AN - SCOPUS:105003498469
SN - 2644-1330
VL - 6
SP - 1241
EP - 1261
JO - IEEE Open Journal of Vehicular Technology
JF - IEEE Open Journal of Vehicular Technology
ER -