PianoBind: A Multimodal Joint Embedding Model for Pop-Piano Music

Anonymous Authors · Anonymous Affiliations
View Full Paper

Abstract

Solo piano music, despite being a single-instrument medium, possesses significant expressive capabilities, conveying rich semantic information across genres, moods, and styles. However, current general-purpose music representation models, predominantly trained on large-scale datasets, often struggle to capture subtle semantic distinctions within homogeneous solo piano music.

We propose PianoBind, a piano-specific multimodal joint embedding model that integrates multiple modalities of solo piano music—audio, symbolic (MIDI), and textual descriptions—within a unified embedding space, enabling a more comprehensive representation and capturing the fine-grained semantic characteristics of piano music.

PianoBind Model Architecture

Figure 1: Illustration of PianoBind - A multimodal piano music representation model integrating audio, MIDI, and text.

Text-to-music Retrieval

Upbeat, Ragtime, Playful, Bright, Happy

Model Rank 1 Rank 2 Rank 3
PianoBind (Ours)
CLaMP3

Pop-piano cover, Passionate, Speedy, Powerful, Difficult

Model Rank 1 Rank 2 Rank 3
PianoBind (Ours)
CLaMP3

Bossa Nova, Jazz, Dreamy, Upbeat

Model Rank 1 Rank 2 Rank 3
PianoBind (Ours)
CLaMP3

Key Results

Our experimental results demonstrate that PianoBind learns multimodal representations that effectively capture subtle nuances of piano music, achieving superior text-to-music retrieval performance on in-domain and out-of-domain piano datasets compared to general-purpose music joint embedding models.