TY - JOUR
T1 - Mimicking human attention in driving scenarios for enhanced Visual Question Answering
T2 - Insights from eye-tracking and the human attention filter
AU - Rekanar, Kaavya
AU - Hayes, Martin J.
AU - Eising, Ciarán
N1 - Publisher Copyright:
© 2025 The Authors
PY - 2025/12
Y1 - 2025/12
N2 - Visual Question Answering (VQA) models serve a critical role in interpreting visual data and responding to textual queries, particularly within the domain of autonomous driving. These models enhance situational awareness and enable naturalistic interaction between passengers and vehicle systems. However, existing VQA architectures often underperform in driving contexts due to their generic design and lack of alignment with domain-specific perceptual cues. This study introduces a targeted enhancement strategy based on the integration of human visual attention patterns into VQA systems. The proposed approach investigates visual subjectivity by analysing human responses and gaze behaviours captured through an eye-tracking experiment conducted in a realistic driving scenario. This method enables the direct observation of authentic attention patterns and mitigates the limitations introduced by subjective self-reporting. From these findings, a Human Attention Filter (HAF) is constructed to selectively preserve task-relevant features while suppressing visually distracting but semantically irrelevant content. Three VQA models – LXMERT, ViLBERT, and ViLT – are evaluated to demonstrate the adaptability and impact of HAF across different visual representation strategies, including region-based and patch-based architectures. Case studies involving LXMERT and ViLBERT are conducted to assess the integration of the HAF within region-based multimodal pipelines, showing measurable improvements in performance and alignment with human-like attention. Quantitative analysis reveals statistically significant performance trends correlated with driving experience, highlighting cognitive variability among human participants and informing model interpretability. In addition, failure cases are examined to identify potential limitations introduced by attention filtering, offering critical insight into the boundaries of gaze-guided model alignment.The findings validate the effectiveness of human-informed filtering for improving both accuracy and transparency in autonomous VQA systems, and present HAF as a sustainable, cognitively aligned strategy for advancing trustworthy AI in real-world environments.
AB - Visual Question Answering (VQA) models serve a critical role in interpreting visual data and responding to textual queries, particularly within the domain of autonomous driving. These models enhance situational awareness and enable naturalistic interaction between passengers and vehicle systems. However, existing VQA architectures often underperform in driving contexts due to their generic design and lack of alignment with domain-specific perceptual cues. This study introduces a targeted enhancement strategy based on the integration of human visual attention patterns into VQA systems. The proposed approach investigates visual subjectivity by analysing human responses and gaze behaviours captured through an eye-tracking experiment conducted in a realistic driving scenario. This method enables the direct observation of authentic attention patterns and mitigates the limitations introduced by subjective self-reporting. From these findings, a Human Attention Filter (HAF) is constructed to selectively preserve task-relevant features while suppressing visually distracting but semantically irrelevant content. Three VQA models – LXMERT, ViLBERT, and ViLT – are evaluated to demonstrate the adaptability and impact of HAF across different visual representation strategies, including region-based and patch-based architectures. Case studies involving LXMERT and ViLBERT are conducted to assess the integration of the HAF within region-based multimodal pipelines, showing measurable improvements in performance and alignment with human-like attention. Quantitative analysis reveals statistically significant performance trends correlated with driving experience, highlighting cognitive variability among human participants and informing model interpretability. In addition, failure cases are examined to identify potential limitations introduced by attention filtering, offering critical insight into the boundaries of gaze-guided model alignment.The findings validate the effectiveness of human-informed filtering for improving both accuracy and transparency in autonomous VQA systems, and present HAF as a sustainable, cognitively aligned strategy for advancing trustworthy AI in real-world environments.
KW - Eye-tracking
KW - Feature embeddings
KW - Human-attention patterns
KW - Object detection
KW - VQA models
UR - https://www.scopus.com/pages/publications/105016461621
U2 - 10.1016/j.iswa.2025.200578
DO - 10.1016/j.iswa.2025.200578
M3 - Article
AN - SCOPUS:105016461621
SN - 2667-3053
VL - 28
JO - Intelligent Systems with Applications
JF - Intelligent Systems with Applications
M1 - 200578
ER -