Skip to main navigation Skip to search Skip to main content

From Sound to Sight: Towards AI-authored Music Videos

  • Leo Vitasovic
  • , Stella Grasshof
  • , Agnes Mercedes Kloft
  • , Ville V. Lehtola
  • , Martin Cunneen
  • , Justyna Starostka
  • , Glenn McGarry
  • , Kun Li
  • , Sami S. Brandt
  • IT University of Copenhagen
  • Aalto University
  • University of Twente
  • University of Nottingham

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Conventional music visualisation systems rely on hand-crafted ad hoc transformations of shapes and colours that offer only limited expressiveness. We propose two novel pipelines for automatically generating music videos from any user-specified, vocal or instrumental song using off-the-shelf deep learning models. Inspired by the manual work-flows of music video producers, we experiment on how well latent feature-based techniques can analyse audio to detect musical qualities, such as emotional cues and instrumental patterns, and distil them into textual scene descriptions using a language model. Next, we employ a generative model to produce the corresponding video clips. To assess the generated videos, we identify several critical aspects and design and conduct a preliminary user evaluation that demonstrates storytelling potential, visual coherency and emotional alignment with the music. Our findings underscore the potential of latent feature techniques and deep generative models to expand music visualisation beyond traditional approaches.

Original languageEnglish
Title of host publicationProceedings - 2025 IEEE/CVF International Conference on Computer Vision Workshops, ICCV-W 2025
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages3851-3861
Number of pages11
ISBN (Electronic)9798331589882
DOIs
Publication statusPublished - 2025
Event2025 IEEE/CVF International Conference on Computer Vision Workshops, ICCV-W 2025 - Honolulu, United States
Duration: 19 Oct 202520 Oct 2025

Publication series

NameProceedings - 2025 IEEE/CVF International Conference on Computer Vision Workshops, ICCV-W 2025

Conference

Conference2025 IEEE/CVF International Conference on Computer Vision Workshops, ICCV-W 2025
Country/TerritoryUnited States
CityHonolulu
Period19/10/2520/10/25

Keywords

  • computational synesthesia
  • contrastive language-audio pre-training
  • generative AI
  • large audio language models
  • large language models
  • music analysis
  • music visualisation
  • storytelling
  • video generation

Fingerprint

Dive into the research topics of 'From Sound to Sight: Towards AI-authored Music Videos'. Together they form a unique fingerprint.

Cite this