Key Detection
TuneLab uses KeyNet, a CNN trained on Constant-Q Transform (CQT) spectrograms, to estimate the musical key of a track. Unlike chroma template matching — fast but fragile against kick drums, vocals, and reverb — a deep CNN learns to ignore distractors and focus on true tonal content. Inference runs in roughly 70 ms per track on a single CPU core.
Why CQT, not STFT?
The Constant-Q Transform uses logarithmically-spaced frequency bins aligned to the equal-temperament scale. Every octave spans the same number of bins, and every semitone occupies one bin. This matters because transposing a song by an octave — or by any fixed interval — becomes a pure vertical shift in the CQT. A CNN operating on CQT images can learn pitch-invariant features with ordinary 2D convolutions, which is exactly the right inductive bias for key estimation.
An STFT, by contrast, spaces frequency bins linearly. A shift of one semitone at 100 Hz looks nothing like the same semitone shift at 1 kHz, forcing the network to learn the pitch grid from scratch. That wastes capacity and generalizes poorly across timbres.
The KeyNet architecture
- Input: CQT spectrogram, 84 bins (7 octaves × 12 semitones) × ~500 frames covering the first 30 seconds of audio.
- Backbone: 4 convolutional blocks, each Conv2D → BatchNorm → ReLU → MaxPool.
- Pooling: global average pooling across the time axis.
- Head: a single dense layer with 24 outputs and softmax activation.
- Classes: 24 = 12 pitch classes × 2 modes (major / minor).
- Parameters: ~1.2M — small enough for fast inference, large enough to survive real-world production noise.
Because the time axis is collapsed by global average pooling, the model is robust to the exact length of the analysis window. Feeding a shorter clip (10–30 s) works fine; feeding a full track works too, just at higher latency.
Training data
KeyNet is trained on the GiantSteps-MTG Key dataset (604 annotated tracks) plus an internal corpus curated to balance the 24 classes — raw MIR datasets skew heavily toward C major, which bakes a bias directly into any naively trained classifier. We rebalance so each class sees roughly equal training examples.
Augmentation during training:
- Pitch shift (±2 semitones, with label adjusted accordingly) — doubles effective training set size and teaches pitch equivariance.
- Time stretch (±5%) — invariance to tempo is important because tonal content shouldn't depend on BPM.
- Random EQ — random parametric EQ curves simulate different mastering decisions and playback systems.
Output format
The POST /v1/key endpoint returns four fields:
| Field | Type | Description |
|---|---|---|
key | string | Pitch class: one of C C# D D# E F F# G G# A A# B |
mode | string | "major" or "minor" |
camelot | string | Camelot wheel code, e.g. "8A" for A minor |
confidence | float | Softmax probability of the predicted class, in [0.0, 1.0] |
DJs almost always prefer Camelot notation. It encodes the circle of fifths as a clock face: adjacent numbers are a fifth apart (harmonically compatible), and the A/B suffix distinguishes minor (A) from major (B). Mixing between tracks sharing a Camelot code — or differing by ±1 number, or switching the A/B suffix — produces harmonically smooth transitions without thinking about accidentals.
Camelot Wheel mapping
A subset to illustrate the relationship between musical key and Camelot code:
| Key | Mode | Camelot |
|---|---|---|
| A | minor | 8A |
| C | major | 8B |
| E | minor | 9A |
| G | major | 9B |
| D | minor | 7A |
| F | major | 7B |
A complete Camelot chart covers all 24 keys — see the Mixed In Key harmonic mixing guide for the full wheel and mixing rules.
Accuracy
Benchmarked on the GiantSteps evaluation set (held out from training):
| Metric | Score | Notes |
|---|---|---|
| Exact match | 82.1% | Predicted key + mode match reference exactly |
| Mode-only | 91.3% | Correct pitch class, mode may differ |
| Within a fifth | 94.7% | Relative key, parallel key, or dominant — harmonically interchangeable |
For reference, Essentia's built-in tonal/key extractor (the open-source standard, chroma + template matching) scores around 72% exact match on the same benchmark. The CNN's gap comes from its ability to suppress percussion and vocal artifacts that confuse template-based methods.
Known limitations
Where KeyNet struggles:
- Atonal music — classical modernism, noise, ambient drones. No fixed tonal center means no correct answer; confidence will be low.
- Modulating tracks — songs that legitimately change key mid-way. The model reports the dominant key across the analysis window.
- Single-instrument recordings — solo piano, a cappella. Fewer harmonic cues mean higher variance, particularly between relative major / minor.
- Very short clips — under 10 seconds. Not enough data for robust estimation.
If confidence is below 0.5, the track is likely atonal, modulating, or ambiguous. Surface the raw value to end users rather than hiding it — DJ apps in particular benefit from flagging “uncertain key” rather than silently committing to a wrong answer.
Code example
curl https://api.tunelab.dev/v1/key \
-H "Authorization: Bearer tl_live_xxx" \
-H "Content-Type: audio/mpeg" \
--data-binary @track.mp3
{
"key": "F#",
"mode": "minor",
"camelot": "11A",
"confidence": 0.94,
"_meta": {
"latency_ms": 72,
"credits_used": 2,
"credits_remaining": 998,
"trace_id": "afa-9c1d2e"
}
}
Further reading
- Korzeniowski & Widmer, End-to-End Musical Key Estimation Using a Convolutional Neural Network (ISMIR 2018) — the CQT-CNN baseline KeyNet builds on.
- POST /v1/key — full API endpoint reference, including file-upload vs cached lookup modes.
- Mixed In Key — Harmonic Mixing Guide — the definitive explainer on Camelot wheel DJ workflows.
- BPM Detection — paired with key, the two features every DJ app starts with.