Xavier Serra (UPF) — Large-scale Self-supervised Audio Representation Models for Music Understanding
Don't miss any Success Story following us on X and LinkedIn!
@RES_HPC RES - Red Española de Supercomputación @res-icts.bsky.social
Check this Success Story at our LinkedIn: Large-scale Self-supervised Audio Representation Models for Music Understanding
📋 "Large-scale Self-supervised Audio Representation Models for Music Understanding" led by Xavier Serra from Music Technology Group at Universitat Pompeu Fabra - Barcelona
Digital music platforms and AI-powered tools are transforming the audio industry and the way people consume music. However, audio isn't symbolic information like text in typical language-based models, but a physical signal instead, presenting unique challenges to model and train these AIs.
🖥️ Thanks to RES supercomputer hashtag#MareNostrum5 Acc, the team could train and evaluate representation models with a large dataset of 300.000 hours of music. This is usually out of the scope and budget of the open, academic researchers, making state-of-the-art models not accessible (i.e ByteDance's Music FM).
The models are based on BestRQ, a self-supervised learning paradigm that allows to predict masked features (or tokens) from the input data and capable of state-of-the-art results in speech, sound and music domains. The resulting models were able to predict several of these features simultaneously and boosted its performance by making specific combinations of target features.
These openly shared, state-of-the-art models allow for reproducibility and practical use in applications such as next-generation music recommendation systems, AI-assisted composition tools, and sophisticated audio analysis applications.
📸 The image shows the an audio processing pipeline that inputs a raw audio signal and transforms it through multiple stages with both knowledge-based and data-driven representations to analyze music.