TY - GEN
T1 - From Sound to Sight
T2 - 2025 IEEE/CVF International Conference on Computer Vision Workshops, ICCV-W 2025
AU - Vitasovic, Leo
AU - Grasshof, Stella
AU - Kloft, Agnes Mercedes
AU - Lehtola, Ville V.
AU - Cunneen, Martin
AU - Starostka, Justyna
AU - McGarry, Glenn
AU - Li, Kun
AU - Brandt, Sami S.
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - Conventional music visualisation systems rely on hand-crafted ad hoc transformations of shapes and colours that offer only limited expressiveness. We propose two novel pipelines for automatically generating music videos from any user-specified, vocal or instrumental song using off-the-shelf deep learning models. Inspired by the manual work-flows of music video producers, we experiment on how well latent feature-based techniques can analyse audio to detect musical qualities, such as emotional cues and instrumental patterns, and distil them into textual scene descriptions using a language model. Next, we employ a generative model to produce the corresponding video clips. To assess the generated videos, we identify several critical aspects and design and conduct a preliminary user evaluation that demonstrates storytelling potential, visual coherency and emotional alignment with the music. Our findings underscore the potential of latent feature techniques and deep generative models to expand music visualisation beyond traditional approaches.
AB - Conventional music visualisation systems rely on hand-crafted ad hoc transformations of shapes and colours that offer only limited expressiveness. We propose two novel pipelines for automatically generating music videos from any user-specified, vocal or instrumental song using off-the-shelf deep learning models. Inspired by the manual work-flows of music video producers, we experiment on how well latent feature-based techniques can analyse audio to detect musical qualities, such as emotional cues and instrumental patterns, and distil them into textual scene descriptions using a language model. Next, we employ a generative model to produce the corresponding video clips. To assess the generated videos, we identify several critical aspects and design and conduct a preliminary user evaluation that demonstrates storytelling potential, visual coherency and emotional alignment with the music. Our findings underscore the potential of latent feature techniques and deep generative models to expand music visualisation beyond traditional approaches.
KW - computational synesthesia
KW - contrastive language-audio pre-training
KW - generative AI
KW - large audio language models
KW - large language models
KW - music analysis
KW - music visualisation
KW - storytelling
KW - video generation
UR - https://www.scopus.com/pages/publications/105035144103
U2 - 10.1109/ICCVW69036.2025.00401
DO - 10.1109/ICCVW69036.2025.00401
M3 - Conference contribution
AN - SCOPUS:105035144103
T3 - Proceedings - 2025 IEEE/CVF International Conference on Computer Vision Workshops, ICCV-W 2025
SP - 3851
EP - 3861
BT - Proceedings - 2025 IEEE/CVF International Conference on Computer Vision Workshops, ICCV-W 2025
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 19 October 2025 through 20 October 2025
ER -