Abstract
Solo piano music, despite being a single-instrument medium, possesses significant expressive capabilities, conveying rich semantic information across genres, moods, and styles. However, current general-purpose music representation models, predominantly trained on large-scale datasets, often struggle to capture subtle semantic distinctions within homogeneous solo piano music.
We propose PianoBind, a piano-specific multimodal joint embedding model that integrates multiple modalities of solo piano music—audio, symbolic (MIDI), and textual descriptions—within a unified embedding space, enabling a more comprehensive representation and capturing the fine-grained semantic characteristics of piano music.

Figure 1: Illustration of PianoBind - A multimodal piano music representation model integrating audio, MIDI, and text.
Text-to-music Retrieval
Upbeat, Ragtime, Playful, Bright, Happy
Model | Rank 1 | Rank 2 | Rank 3 |
---|---|---|---|
PianoBind (Ours) | |||
CLaMP3 |
Pop-piano cover, Passionate, Speedy, Powerful, Difficult
Model | Rank 1 | Rank 2 | Rank 3 |
---|---|---|---|
PianoBind (Ours) | |||
CLaMP3 |
Bossa Nova, Jazz, Dreamy, Upbeat
Model | Rank 1 | Rank 2 | Rank 3 |
---|---|---|---|
PianoBind (Ours) | |||
CLaMP3 |
Key Results
Our experimental results demonstrate that PianoBind learns multimodal representations that effectively capture subtle nuances of piano music, achieving superior text-to-music retrieval performance on in-domain and out-of-domain piano datasets compared to general-purpose music joint embedding models.
- Significant improvement in Recall@10 metrics (52.76% vs 22.11% for best baseline)
- Lower Median Rank (10 vs 39 for best baseline) on in-domain evaluations
- Superior performance on out-of-domain evaluation with natural language queries
- Trimodal integration consistently outperforms bimodal approaches