← Technology

BPM Detection

TuneLab’s BPM detector uses an ensemble of bidirectional LSTM models originally trained for beat tracking. Unlike metadata lookups that return integer BPMs (“128”), we report float precision (127.87). That precision matters for DJ software, beatmatching, and remixing.

Why float precision matters

Real tempo is rarely exactly integer. A 127.87 BPM house track is subtly different from 128.00 — DJs feel it when syncing decks over long mixes, and producers need the extra decimal places to time-stretch accurately without phase artifacts. Metadata scrapers round because their source data is rounded. We compute from raw audio, so we don’t have to.

The pipeline

Each track goes through six stages:

  1. Audio loading. Decode to mono 44.1 kHz PCM. Clip to the first 60 seconds for efficiency — tempo almost never changes enough mid-track to matter.
  2. 3-resolution spectrogram. Compute STFT at three FFT sizes (1024, 2048, 4096) for different time/frequency trade-offs. Stack into a 314-dimensional feature vector per frame at 100 Hz frame rate.
  3. BiLSTM inference. Run a bidirectional LSTM ensemble (three models, averaged). Output shape is (n_frames, 3): probabilities for [no-beat, beat, downbeat] at each frame.
  4. Peak-picking. Find local maxima in the beat probability curve with a minimum distance of 7 frames (70 ms, matching typical fastest beat intervals in extreme genres).
  5. IOI histogram. Compute inter-onset intervals between consecutive beats. Build a weighted histogram. The dominant peak corresponds to the track’s tempo.
  6. Float refinement. Fit a parabola around the histogram peak to get sub-bin precision. This is where float precision comes from — we’re interpolating between histogram bins.

Viterbi phase analysis

For the beat grid endpoint (not just BPM), we additionally run Viterbi decoding on the LSTM output. This finds the globally-optimal sequence of beat positions given a narrow BPM search range (±10 BPM around the histogram peak). The result guarantees that beats are equally-spaced across the whole track, even if individual frame probabilities are noisy during quiet intros or sparse breakdowns.

Trade-off: Viterbi is O(n × k²) where k is the number of tempo candidates — roughly 10× slower than peak-picking. For the /v1/bpm endpoint we skip it. For /v1/beatgrid we always run it.

Accuracy

DatasetToleranceAccuracy
GTZAN±2%94.8%
Ballroom±2%96.2%
MIREX-style±4%98.1%
SMC (hard cases)±4%79.3%

Accuracy degrades on electronic genres with complex rhythmic structures (trap, dubstep sub-bass drops, juke) where the perceptual “beat” is ambiguous. In those cases we report both primary and alternative BPM candidates so your application can decide.

Half/double tempo ambiguity

A classic problem: a 70 BPM hip-hop track could also be interpreted as 140 BPM (counting hi-hats as beats). We detect this by looking for harmonic ratios in the IOI histogram. When both 70 and 140 have strong peaks, we return:

ambiguous response
{
  "tempo": 70.14,
  "bpm_alt": 140.28,
  "confidence": 0.63
}

When confidence is high (>0.85), bpm_alt is null. Your client can use genre heuristics to pick the right one — hip-hop is usually half-time, drum & bass is usually full-time.

Known limitations

Confidence score

We report confidence as the ratio of the dominant IOI histogram peak to the second-highest peak. Rule of thumb:

Code example

cURL
curl https://api.tunelab.dev/v1/bpm \
  -H "Authorization: Bearer tl_live_xxx" \
  -H "Content-Type: audio/mpeg" \
  --data-binary @track.mp3
response
{
  "tempo": 127.87,
  "confidence": 0.93,
  "bpm_alt": null,
  "_meta": {
    "latency_ms": 1420,
    "credits_used": 2,
    "credits_remaining": 948
  }
}

Further reading