Abstract
Automatic Audio Captioning (AAC) aims at generating natural language descriptions for audio content. However, existing methods are often affected by latent confounders and spurious co-occurrence patterns in the data, leading to bias and semantic inaccuracies. This paper proposes FD-DeCap, a front-door causal inference-based framework, for the AAC task. The framework consists of three core components: (1) an AudioAug module introduces noise perturbations in audio features to enhance robustness against environmental interference; (2) a MedGate module explicitly introduces a mediator variable to satisfy the identifiability conditions of the front-door criterion, thereby disentangling direct and indirect effects; and (3) a MSeCE consistency loss jointly optimizes cross-entropy and MSE constraints, encouraging reliance on mediator representations rather than spurious correlations. Experimental results demonstrate that FD-DeCap achieves stable performance improvements, compared to state-of-the-art frameworks, on the Clotho and AudioCaps datasets, with SPIDEr scores of 0.282 and 0.429, respectively. A multi-perspective causal validation of the front-door adjustment, performed on the Clotho dataset, includes analyzes of similarity-score distributions, feature distributions, and representative case studies. After debiasing, the similarity between generated captions and reference captions shifts upward overall, the mediator feature distributions become more dispersed, and the representative cases more accurately capture true acoustic scenes. These findings indicate that the proposed FD-DeCap framework effectively alleviates bias caused by latent confounders and spurious co-occurrence, enhances semantic consistency and robustness of generated captions, and provides a novel solution for the AAC task in complex acoustic scenarios.
| Original language | English |
|---|---|
| Pages (from-to) | 6029-6042 |
| Number of pages | 14 |
| Journal | IEEE Access |
| Volume | 14 |
| DOIs | |
| Publication status | Published - 2026 |
Keywords
- Automatic Audio Captioning (AAC)
- Bias
- Causal Inference
- Front-Door Adjustment
Fingerprint
Dive into the research topics of 'FD-DeCap: A Front-Door Causal Inference-based Framework for Debiasing Automatic Audio Captioning'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver