← Technology

Mood Classification

Mood and energy estimation is hard because there is no ground truth — “energy 0.75” is a subjective label, not a physical measurement. TuneLab uses MAEST (Music Audio Efficient Spectrogram Transformer) to produce rich 768-dimensional embeddings, then trains small MLP heads on top for each mood dimension. The result: six independent scores with known, published accuracy — not a black box regression with mystery numbers.

Why embeddings-first?

Training a separate CNN end-to-end for every mood feature is expensive, wasteful, and hard to maintain. The modern approach — proven across NLP and now standard in audio — is to pre-train one large foundation model on a massive corpus, freeze its embeddings, and fit lightweight heads on top for each downstream task.

Concretely, this gives us:

A single encoder forward pass amortized across all 6 heads (plus similarity search, plus genre, plus future tasks).
Heads that are 2–3 layer MLPs, trainable in minutes on a laptop. Easy to retrain when new labeled data arrives.
A stable embedding space that lets Pro customers train their own classifiers for custom taxonomies (see below).
Deterministic, reproducible pipelines — the embedding is a content hash of the audio, the heads are pure functions on top.

The MAEST encoder

Architecture: Transformer pre-trained on ~5M music tracks with self-supervised learning.
Input: log-mel spectrogram, 128 mel bins × 30 seconds of audio.
Output: a single 768-dimensional embedding per track (pooled from the encoder’s intermediate layer — layer 6 specifically, which empirically gives the best transfer performance).
Inference: ~200 ms per track on CPU, no GPU needed.
Size: 40M parameters — an order of magnitude smaller than MuLan or AudioMAE, but task-appropriate and efficient to serve at scale.

The 6 heads

Each dimension is trained independently, with its own dataset and loss function:

Dimension	Range	Training data	What it measures
`energy`	0–100	Internal 114K track corpus	Overall intensity — loud, fast, dense arrangements score high
`danceability`	0–100	Spotify-adjacent labels (external)	Rhythmic regularity + beat strength + tempo stability
`happiness`	0–100	MTG-Jamendo mood tags	Valence — major key bias, bright timbre, uptempo
`acousticness`	0–100	Internal corpus	Proportion of acoustic vs electronic instrumentation
`instrumentalness`	0–100	MIR vocal detection	Probability that the track contains no vocals
`speechiness`	0–100	GTZAN + podcast audio	Probability of spoken word content (not singing)

Training methodology

Each head is a small 2-layer MLP: 768 → 256 → 1, with ReLU in the middle and a sigmoid on the output. Loss depends on the label type:

MSE for continuous regression targets (energy, danceability, happiness, acousticness).
Binary cross-entropy for binary-ish targets (instrumentalness, speechiness), which behave more like classification than regression.

Training runs for ~50 epochs with early stopping on validation Pearson correlation. Frozen embeddings mean each head fits in seconds; the expensive part is precomputing the 40M-parameter encoder pass on the training set once.

Validation scores

Pearson correlation against human-labeled ground truth on a held-out 10K track set. Higher is better; values above ~0.7 indicate the head is useful in practice for ranking and filtering.

Dimension	Pearson r	Ship?
`energy`	0.812	Yes
`danceability`	0.776	Yes
`happiness`	0.703	Yes
`acousticness`	0.854	Yes
`instrumentalness`	0.791	Yes
`speechiness`	0.723	Yes
`liveness`	0.448	Dropped in v1

Why we dropped liveness.
A Pearson correlation of 0.448 is worse than flipping a coin on the tail of the distribution. We believe liveness needs full-track audio (not 30s) plus a binary classification framing rather than regression. It is on the v1.1 backlog.

Integer 0–100, not float 0.0–1.0

The API returns mood scores as integers in the range 0–100. Internally the models produce floats, but we round at the API boundary for two reasons:

Ordinal, not interval. Mood scores are fundamentally ranking-like — there’s no meaningful difference between 0.734 and 0.739. Integer precision matches what the data actually supports.
Human-readable. An energy of 82 is more intuitive to developers building UIs than 0.8234167, and it cuts the JSON payload size. It also prevents developers from over-interpreting false precision.

Raw embedding access

Pro-tier customers can retrieve the raw 768-dimensional MAEST embedding via GET /v1/embedding/{id}. This unlocks a powerful pattern: train your own classifiers on top of our embeddings for custom taxonomies without ever touching an audio encoder.

Example use cases:

A lo-fi / drill / phonk classifier for a playlist curation product.
A decade classifier (60s / 70s / 80s / ...) trained on label metadata.
A language detector for vocal music.
A k-NN nearest-neighbor “more like this” index over your own catalog.

Training a 2-layer MLP on precomputed embeddings takes minutes on a laptop. You inherit all the benefits of the pretrained encoder — genre invariance, timbre robustness, noise suppression — without paying for GPU training or needing an ML engineer on staff.

Known limitations

Honest trade-offs worth flagging:

Cultural bias. The training data skews Western. Non-Western genres (Hindustani classical, gamelan, West African rhythms) may score accurately on some dimensions and wildly off on others. Danceability especially penalizes non-4/4 time signatures.
Instrumentalness can be fooled by vocoded vocals, heavily processed vocal samples, or vocals mixed very quietly underneath a dense instrumental bed.
Happiness is correlated with mode. A minor-key cover of a “happy” song will score lower on happiness than the original — even if the arrangement is otherwise upbeat. This is a limitation of training labels, not the model.
Danceability favors 4/4. Odd meters (7/8, 5/4) score lower even when they’re rhythmically compelling, because the training labels come from genres dominated by 4/4.

Code example

curl https://api.tunelab.dev/v1/analyze \
  -H "Authorization: Bearer tl_live_xxx" \
  -H "Content-Type: audio/mpeg" \
  --data-binary @track.mp3

{
  "energy":           82,
  "danceability":     77,
  "happiness":        64,
  "acousticness":     12,
  "instrumentalness":  4,
  "speechiness":       3,
  "_meta": {
    "latency_ms":   214,
    "credits_used": 5,
    "credits_remaining": 995,
    "trace_id":     "afa-7a2d9f"
  }
}