← Technology

Key Detection

TuneLab uses KeyNet, a CNN trained on Constant-Q Transform (CQT) spectrograms, to estimate the musical key of a track. Unlike chroma template matching — fast but fragile against kick drums, vocals, and reverb — a deep CNN learns to ignore distractors and focus on true tonal content. Inference runs in roughly 70 ms per track on a single CPU core.

Why CQT, not STFT?

The Constant-Q Transform uses logarithmically-spaced frequency bins aligned to the equal-temperament scale. Every octave spans the same number of bins, and every semitone occupies one bin. This matters because transposing a song by an octave — or by any fixed interval — becomes a pure vertical shift in the CQT. A CNN operating on CQT images can learn pitch-invariant features with ordinary 2D convolutions, which is exactly the right inductive bias for key estimation.

An STFT, by contrast, spaces frequency bins linearly. A shift of one semitone at 100 Hz looks nothing like the same semitone shift at 1 kHz, forcing the network to learn the pitch grid from scratch. That wastes capacity and generalizes poorly across timbres.

The KeyNet architecture

Because the time axis is collapsed by global average pooling, the model is robust to the exact length of the analysis window. Feeding a shorter clip (10–30 s) works fine; feeding a full track works too, just at higher latency.

Training data

KeyNet is trained on the GiantSteps-MTG Key dataset (604 annotated tracks) plus an internal corpus curated to balance the 24 classes — raw MIR datasets skew heavily toward C major, which bakes a bias directly into any naively trained classifier. We rebalance so each class sees roughly equal training examples.

Augmentation during training:

Output format

The POST /v1/key endpoint returns four fields:

FieldTypeDescription
keystringPitch class: one of C C# D D# E F F# G G# A A# B
modestring"major" or "minor"
camelotstringCamelot wheel code, e.g. "8A" for A minor
confidencefloatSoftmax probability of the predicted class, in [0.0, 1.0]

DJs almost always prefer Camelot notation. It encodes the circle of fifths as a clock face: adjacent numbers are a fifth apart (harmonically compatible), and the A/B suffix distinguishes minor (A) from major (B). Mixing between tracks sharing a Camelot code — or differing by ±1 number, or switching the A/B suffix — produces harmonically smooth transitions without thinking about accidentals.

Camelot Wheel mapping

A subset to illustrate the relationship between musical key and Camelot code:

KeyModeCamelot
Aminor8A
Cmajor8B
Eminor9A
Gmajor9B
Dminor7A
Fmajor7B

A complete Camelot chart covers all 24 keys — see the Mixed In Key harmonic mixing guide for the full wheel and mixing rules.

Accuracy

Benchmarked on the GiantSteps evaluation set (held out from training):

MetricScoreNotes
Exact match82.1%Predicted key + mode match reference exactly
Mode-only91.3%Correct pitch class, mode may differ
Within a fifth94.7%Relative key, parallel key, or dominant — harmonically interchangeable

For reference, Essentia's built-in tonal/key extractor (the open-source standard, chroma + template matching) scores around 72% exact match on the same benchmark. The CNN's gap comes from its ability to suppress percussion and vocal artifacts that confuse template-based methods.

Known limitations

Where KeyNet struggles:

Low confidence? Treat it as a signal.
If confidence is below 0.5, the track is likely atonal, modulating, or ambiguous. Surface the raw value to end users rather than hiding it — DJ apps in particular benefit from flagging “uncertain key” rather than silently committing to a wrong answer.

Code example

cURL
curl https://api.tunelab.dev/v1/key \
  -H "Authorization: Bearer tl_live_xxx" \
  -H "Content-Type: audio/mpeg" \
  --data-binary @track.mp3
response
{
  "key": "F#",
  "mode": "minor",
  "camelot": "11A",
  "confidence": 0.94,
  "_meta": {
    "latency_ms": 72,
    "credits_used": 2,
    "credits_remaining": 998,
    "trace_id": "afa-9c1d2e"
  }
}

Further reading