← Technology

Key Detection

TuneLab uses KeyNet, a CNN trained on Constant-Q Transform (CQT) spectrograms, to estimate the musical key of a track. Unlike chroma template matching — fast but fragile against kick drums, vocals, and reverb — a deep CNN learns to ignore distractors and focus on true tonal content. Inference runs in roughly 70 ms per track on a single CPU core.

Why CQT, not STFT?

The Constant-Q Transform uses logarithmically-spaced frequency bins aligned to the equal-temperament scale. Every octave spans the same number of bins, and every semitone occupies one bin. This matters because transposing a song by an octave — or by any fixed interval — becomes a pure vertical shift in the CQT. A CNN operating on CQT images can learn pitch-invariant features with ordinary 2D convolutions, which is exactly the right inductive bias for key estimation.

An STFT, by contrast, spaces frequency bins linearly. A shift of one semitone at 100 Hz looks nothing like the same semitone shift at 1 kHz, forcing the network to learn the pitch grid from scratch. That wastes capacity and generalizes poorly across timbres.

The KeyNet architecture

Input: CQT spectrogram, 84 bins (7 octaves × 12 semitones) × ~500 frames covering the first 30 seconds of audio.
Backbone: 4 convolutional blocks, each Conv2D → BatchNorm → ReLU → MaxPool.
Pooling: global average pooling across the time axis.
Head: a single dense layer with 24 outputs and softmax activation.
Classes: 24 = 12 pitch classes × 2 modes (major / minor).
Parameters: ~1.2M — small enough for fast inference, large enough to survive real-world production noise.

Because the time axis is collapsed by global average pooling, the model is robust to the exact length of the analysis window. Feeding a shorter clip (10–30 s) works fine; feeding a full track works too, just at higher latency.

Training data

KeyNet is trained on the GiantSteps-MTG Key dataset (604 annotated tracks) plus an internal corpus curated to balance the 24 classes — raw MIR datasets skew heavily toward C major, which bakes a bias directly into any naively trained classifier. We rebalance so each class sees roughly equal training examples.

Augmentation during training:

Pitch shift (±2 semitones, with label adjusted accordingly) — doubles effective training set size and teaches pitch equivariance.
Time stretch (±5%) — invariance to tempo is important because tonal content shouldn't depend on BPM.
Random EQ — random parametric EQ curves simulate different mastering decisions and playback systems.

Output format

The POST /v1/key endpoint returns four fields:

Field	Type	Description
`key`	string	Pitch class: one of `C C# D D# E F F# G G# A A# B`
`mode`	string	`"major"` or `"minor"`
`camelot`	string	Camelot wheel code, e.g. `"8A"` for A minor
`confidence`	float	Softmax probability of the predicted class, in [0.0, 1.0]

DJs almost always prefer Camelot notation. It encodes the circle of fifths as a clock face: adjacent numbers are a fifth apart (harmonically compatible), and the A/B suffix distinguishes minor (A) from major (B). Mixing between tracks sharing a Camelot code — or differing by ±1 number, or switching the A/B suffix — produces harmonically smooth transitions without thinking about accidentals.

Camelot Wheel mapping

A subset to illustrate the relationship between musical key and Camelot code:

Key	Mode	Camelot
A	minor	`8A`
C	major	`8B`
E	minor	`9A`
G	major	`9B`
D	minor	`7A`
F	major	`7B`

A complete Camelot chart covers all 24 keys — see the Mixed In Key harmonic mixing guide for the full wheel and mixing rules.

Accuracy

Benchmarked on the GiantSteps evaluation set (held out from training):

Metric	Score	Notes
Exact match	82.1%	Predicted key + mode match reference exactly
Mode-only	91.3%	Correct pitch class, mode may differ
Within a fifth	94.7%	Relative key, parallel key, or dominant — harmonically interchangeable

For reference, Essentia's built-in tonal/key extractor (the open-source standard, chroma + template matching) scores around 72% exact match on the same benchmark. The CNN's gap comes from its ability to suppress percussion and vocal artifacts that confuse template-based methods.

Known limitations

Where KeyNet struggles:

Atonal music — classical modernism, noise, ambient drones. No fixed tonal center means no correct answer; confidence will be low.
Modulating tracks — songs that legitimately change key mid-way. The model reports the dominant key across the analysis window.
Single-instrument recordings — solo piano, a cappella. Fewer harmonic cues mean higher variance, particularly between relative major / minor.
Very short clips — under 10 seconds. Not enough data for robust estimation.

✓

Low confidence? Treat it as a signal.
If confidence is below 0.5, the track is likely atonal, modulating, or ambiguous. Surface the raw value to end users rather than hiding it — DJ apps in particular benefit from flagging “uncertain key” rather than silently committing to a wrong answer.

Code example

curl https://api.tunelab.dev/v1/key \
  -H "Authorization: Bearer tl_live_xxx" \
  -H "Content-Type: audio/mpeg" \
  --data-binary @track.mp3

{
  "key": "F#",
  "mode": "minor",
  "camelot": "11A",
  "confidence": 0.94,
  "_meta": {
    "latency_ms": 72,
    "credits_used": 2,
    "credits_remaining": 998,
    "trace_id": "afa-9c1d2e"
  }
}