← Technology

Mood Classification

Mood and energy estimation is hard because there is no ground truth — “energy 0.75” is a subjective label, not a physical measurement. TuneLab uses MAEST (Music Audio Efficient Spectrogram Transformer) to produce rich 768-dimensional embeddings, then trains small MLP heads on top for each mood dimension. The result: six independent scores with known, published accuracy — not a black box regression with mystery numbers.

Why embeddings-first?

Training a separate CNN end-to-end for every mood feature is expensive, wasteful, and hard to maintain. The modern approach — proven across NLP and now standard in audio — is to pre-train one large foundation model on a massive corpus, freeze its embeddings, and fit lightweight heads on top for each downstream task.

Concretely, this gives us:

The MAEST encoder

The 6 heads

Each dimension is trained independently, with its own dataset and loss function:

DimensionRangeTraining dataWhat it measures
energy0–100Internal 114K track corpusOverall intensity — loud, fast, dense arrangements score high
danceability0–100Spotify-adjacent labels (external)Rhythmic regularity + beat strength + tempo stability
happiness0–100MTG-Jamendo mood tagsValence — major key bias, bright timbre, uptempo
acousticness0–100Internal corpusProportion of acoustic vs electronic instrumentation
instrumentalness0–100MIR vocal detectionProbability that the track contains no vocals
speechiness0–100GTZAN + podcast audioProbability of spoken word content (not singing)

Training methodology

Each head is a small 2-layer MLP: 768 → 256 → 1, with ReLU in the middle and a sigmoid on the output. Loss depends on the label type:

Training runs for ~50 epochs with early stopping on validation Pearson correlation. Frozen embeddings mean each head fits in seconds; the expensive part is precomputing the 40M-parameter encoder pass on the training set once.

Validation scores

Pearson correlation against human-labeled ground truth on a held-out 10K track set. Higher is better; values above ~0.7 indicate the head is useful in practice for ranking and filtering.

DimensionPearson rShip?
energy0.812Yes
danceability0.776Yes
happiness0.703Yes
acousticness0.854Yes
instrumentalness0.791Yes
speechiness0.723Yes
liveness0.448Dropped in v1
i
Why we dropped liveness.
A Pearson correlation of 0.448 is worse than flipping a coin on the tail of the distribution. We believe liveness needs full-track audio (not 30s) plus a binary classification framing rather than regression. It is on the v1.1 backlog.

Integer 0–100, not float 0.0–1.0

The API returns mood scores as integers in the range 0–100. Internally the models produce floats, but we round at the API boundary for two reasons:

Raw embedding access

Pro-tier customers can retrieve the raw 768-dimensional MAEST embedding via GET /v1/embedding/{id}. This unlocks a powerful pattern: train your own classifiers on top of our embeddings for custom taxonomies without ever touching an audio encoder.

Example use cases:

Training a 2-layer MLP on precomputed embeddings takes minutes on a laptop. You inherit all the benefits of the pretrained encoder — genre invariance, timbre robustness, noise suppression — without paying for GPU training or needing an ML engineer on staff.

Known limitations

Honest trade-offs worth flagging:

Code example

cURL
curl https://api.tunelab.dev/v1/analyze \
  -H "Authorization: Bearer tl_live_xxx" \
  -H "Content-Type: audio/mpeg" \
  --data-binary @track.mp3
response
{
  "energy":           82,
  "danceability":     77,
  "happiness":        64,
  "acousticness":     12,
  "instrumentalness":  4,
  "speechiness":       3,
  "_meta": {
    "latency_ms":   214,
    "credits_used": 5,
    "credits_remaining": 995,
    "trace_id":     "afa-7a2d9f"
  }
}

Further reading