PianoBind: A Multimodal Joint Embedding Model for Pop-Piano Music

Abstract

Solo piano music, despite being a single-instrument medium, possesses significant expressive capabilities, conveying rich semantic information across genres, moods, and styles. However, current general-purpose music representation models, predominantly trained on large-scale datasets, often struggle to capture subtle semantic distinctions within homogeneous solo piano music.

We propose PianoBind, a piano-specific multimodal joint embedding model that integrates multiple modalities of solo piano music—audio, symbolic (MIDI), and textual descriptions—within a unified embedding space, enabling a more comprehensive representation and capturing the fine-grained semantic characteristics of piano music.

Figure 1: Illustration of PianoBind - A multimodal piano music representation model integrating audio, MIDI, and text.

Text-to-music Retrieval

Upbeat, Ragtime, Playful, Bright, Happy

Model	Rank 1	Rank 2	Rank 3
PianoBind (Ours)
CLaMP3

Pop-piano cover, Passionate, Speedy, Powerful, Difficult

Model	Rank 1	Rank 2	Rank 3
PianoBind (Ours)
CLaMP3

Bossa Nova, Jazz, Dreamy, Upbeat

Model	Rank 1	Rank 2	Rank 3
PianoBind (Ours)
CLaMP3

Key Results

Our experimental results demonstrate that PianoBind learns multimodal representations that effectively capture subtle nuances of piano music, achieving superior text-to-music retrieval performance on in-domain and out-of-domain piano datasets compared to general-purpose music joint embedding models.

Significant improvement in Recall@10 metrics (52.76% vs 22.11% for best baseline)
Lower Median Rank (10 vs 39 for best baseline) on in-domain evaluations
Superior performance on out-of-domain evaluation with natural language queries
Trimodal integration consistently outperforms bimodal approaches