Mood Classification
Mood and energy estimation is hard because there is no ground truth — “energy 0.75” is a subjective label, not a physical measurement. TuneLab uses MAEST (Music Audio Efficient Spectrogram Transformer) to produce rich 768-dimensional embeddings, then trains small MLP heads on top for each mood dimension. The result: six independent scores with known, published accuracy — not a black box regression with mystery numbers.
Why embeddings-first?
Training a separate CNN end-to-end for every mood feature is expensive, wasteful, and hard to maintain. The modern approach — proven across NLP and now standard in audio — is to pre-train one large foundation model on a massive corpus, freeze its embeddings, and fit lightweight heads on top for each downstream task.
Concretely, this gives us:
- A single encoder forward pass amortized across all 6 heads (plus similarity search, plus genre, plus future tasks).
- Heads that are 2–3 layer MLPs, trainable in minutes on a laptop. Easy to retrain when new labeled data arrives.
- A stable embedding space that lets Pro customers train their own classifiers for custom taxonomies (see below).
- Deterministic, reproducible pipelines — the embedding is a content hash of the audio, the heads are pure functions on top.
The MAEST encoder
- Architecture: Transformer pre-trained on ~5M music tracks with self-supervised learning.
- Input: log-mel spectrogram, 128 mel bins × 30 seconds of audio.
- Output: a single 768-dimensional embedding per track (pooled from the encoder’s intermediate layer — layer 6 specifically, which empirically gives the best transfer performance).
- Inference: ~200 ms per track on CPU, no GPU needed.
- Size: 40M parameters — an order of magnitude smaller than MuLan or AudioMAE, but task-appropriate and efficient to serve at scale.
The 6 heads
Each dimension is trained independently, with its own dataset and loss function:
| Dimension | Range | Training data | What it measures |
|---|---|---|---|
energy | 0–100 | Internal 114K track corpus | Overall intensity — loud, fast, dense arrangements score high |
danceability | 0–100 | Spotify-adjacent labels (external) | Rhythmic regularity + beat strength + tempo stability |
happiness | 0–100 | MTG-Jamendo mood tags | Valence — major key bias, bright timbre, uptempo |
acousticness | 0–100 | Internal corpus | Proportion of acoustic vs electronic instrumentation |
instrumentalness | 0–100 | MIR vocal detection | Probability that the track contains no vocals |
speechiness | 0–100 | GTZAN + podcast audio | Probability of spoken word content (not singing) |
Training methodology
Each head is a small 2-layer MLP: 768 → 256 → 1, with ReLU in the middle and a sigmoid
on the output. Loss depends on the label type:
- MSE for continuous regression targets (energy, danceability, happiness, acousticness).
- Binary cross-entropy for binary-ish targets (instrumentalness, speechiness), which behave more like classification than regression.
Training runs for ~50 epochs with early stopping on validation Pearson correlation. Frozen embeddings mean each head fits in seconds; the expensive part is precomputing the 40M-parameter encoder pass on the training set once.
Validation scores
Pearson correlation against human-labeled ground truth on a held-out 10K track set. Higher is better; values above ~0.7 indicate the head is useful in practice for ranking and filtering.
| Dimension | Pearson r | Ship? |
|---|---|---|
energy | 0.812 | Yes |
danceability | 0.776 | Yes |
happiness | 0.703 | Yes |
acousticness | 0.854 | Yes |
instrumentalness | 0.791 | Yes |
speechiness | 0.723 | Yes |
liveness | 0.448 | Dropped in v1 |
A Pearson correlation of 0.448 is worse than flipping a coin on the tail of the distribution. We believe liveness needs full-track audio (not 30s) plus a binary classification framing rather than regression. It is on the v1.1 backlog.
Integer 0–100, not float 0.0–1.0
The API returns mood scores as integers in the range 0–100. Internally the models produce floats, but we round at the API boundary for two reasons:
- Ordinal, not interval. Mood scores are fundamentally ranking-like — there’s no meaningful difference between 0.734 and 0.739. Integer precision matches what the data actually supports.
- Human-readable. An energy of
82is more intuitive to developers building UIs than0.8234167, and it cuts the JSON payload size. It also prevents developers from over-interpreting false precision.
Raw embedding access
Pro-tier customers can retrieve the raw 768-dimensional MAEST embedding via GET /v1/embedding/{id}. This unlocks a powerful pattern: train your own classifiers on top of our embeddings for custom taxonomies without ever touching an audio encoder.
Example use cases:
- A lo-fi / drill / phonk classifier for a playlist curation product.
- A decade classifier (60s / 70s / 80s / ...) trained on label metadata.
- A language detector for vocal music.
- A k-NN nearest-neighbor “more like this” index over your own catalog.
Training a 2-layer MLP on precomputed embeddings takes minutes on a laptop. You inherit all the benefits of the pretrained encoder — genre invariance, timbre robustness, noise suppression — without paying for GPU training or needing an ML engineer on staff.
Known limitations
Honest trade-offs worth flagging:
- Cultural bias. The training data skews Western. Non-Western genres (Hindustani classical, gamelan, West African rhythms) may score accurately on some dimensions and wildly off on others. Danceability especially penalizes non-4/4 time signatures.
- Instrumentalness can be fooled by vocoded vocals, heavily processed vocal samples, or vocals mixed very quietly underneath a dense instrumental bed.
- Happiness is correlated with mode. A minor-key cover of a “happy” song will score lower on happiness than the original — even if the arrangement is otherwise upbeat. This is a limitation of training labels, not the model.
- Danceability favors 4/4. Odd meters (7/8, 5/4) score lower even when they’re rhythmically compelling, because the training labels come from genres dominated by 4/4.
Code example
curl https://api.tunelab.dev/v1/analyze \
-H "Authorization: Bearer tl_live_xxx" \
-H "Content-Type: audio/mpeg" \
--data-binary @track.mp3
{
"energy": 82,
"danceability": 77,
"happiness": 64,
"acousticness": 12,
"instrumentalness": 4,
"speechiness": 3,
"_meta": {
"latency_ms": 214,
"credits_used": 5,
"credits_remaining": 995,
"trace_id": "afa-7a2d9f"
}
}
Further reading
- Alonso-Jiménez et al., Efficient supervised training of audio transformers for music representation learning (ISMIR 2023) — the paper behind the MAEST encoder.
- GET /v1/analyze — the unified analyze endpoint that returns all 6 mood dimensions in a single call.
- GET /v1/embedding/{id} — raw 768-dim embeddings for custom downstream classifiers.
- Song Structure — combine mood scores per segment for fine-grained timeline analytics.