TY - JOUR
T1 - Evaluating Small Vision-Language Models on Distance-Dependent Traffic Perception
AU - Theodoridis, Nikos
AU - Brophy, Tim
AU - Mohandas, Reenu
AU - Sistu, Ganesh
AU - Collins, Fiachra
AU - Scanlan, Anthony
AU - Eising, Ciarán
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2026
Y1 - 2026
N2 - Vision-Language Models (VLMs) are becoming increasingly powerful, demonstrating strong performance on tasks that require both visual and textual understanding. Their strong generalisation abilities make them a promising component for automated driving systems, which must handle unexpected corner cases. However, to be trusted in such safety-critical applications, a model must first possess a reliable perception system. Since critical objects and agents in traffic scenes are often at a distance, we require systems that are not “shortsighted,” i.e., systems with strong perception capabilities at both close (up to 20 meters) and long (30+ meters) range. With this in mind, we introduce Distance-Annotated Traffic Perception Question Answering (DTPQA), the first Visual Question Answering (VQA) benchmark focused solely on perception-based questions in traffic scenes, enriched with distance annotations. By excluding questions that require reasoning, we ensure that model performance reflects perception capabilities alone. Since automated driving hardware has limited processing power and cannot support large VLMs, our study centers on smaller VLMs. We evaluate several state-of-the-art (SOTA) small VLMs on DTPQA and show that, despite the simplicity of the questions, these models significantly underperform compared to humans (∼60% average accuracy for the best-performing small VLM versus ∼85% human performance). However, the human sample size was relatively small, which imposes statistical limitations. We also identify specific perception tasks, such as distinguishing left from right, that remain particularly challenging. We hope our findings will encourage further research into improving the perception capabilities of small VLMs in traffic scenarios, making them more suitable for automated driving applications.
AB - Vision-Language Models (VLMs) are becoming increasingly powerful, demonstrating strong performance on tasks that require both visual and textual understanding. Their strong generalisation abilities make them a promising component for automated driving systems, which must handle unexpected corner cases. However, to be trusted in such safety-critical applications, a model must first possess a reliable perception system. Since critical objects and agents in traffic scenes are often at a distance, we require systems that are not “shortsighted,” i.e., systems with strong perception capabilities at both close (up to 20 meters) and long (30+ meters) range. With this in mind, we introduce Distance-Annotated Traffic Perception Question Answering (DTPQA), the first Visual Question Answering (VQA) benchmark focused solely on perception-based questions in traffic scenes, enriched with distance annotations. By excluding questions that require reasoning, we ensure that model performance reflects perception capabilities alone. Since automated driving hardware has limited processing power and cannot support large VLMs, our study centers on smaller VLMs. We evaluate several state-of-the-art (SOTA) small VLMs on DTPQA and show that, despite the simplicity of the questions, these models significantly underperform compared to humans (∼60% average accuracy for the best-performing small VLM versus ∼85% human performance). However, the human sample size was relatively small, which imposes statistical limitations. We also identify specific perception tasks, such as distinguishing left from right, that remain particularly challenging. We hope our findings will encourage further research into improving the perception capabilities of small VLMs in traffic scenarios, making them more suitable for automated driving applications.
KW - Automated driving
KW - perception
KW - Vision-Language Models
KW - VQA benchmark
UR - https://www.scopus.com/pages/publications/105021030159
U2 - 10.1109/OJVT.2025.3629318
DO - 10.1109/OJVT.2025.3629318
M3 - Article
AN - SCOPUS:105021030159
SN - 2644-1330
VL - 7
SP - 54
EP - 72
JO - IEEE Open Journal of Vehicular Technology
JF - IEEE Open Journal of Vehicular Technology
ER -