Song Structure
Song structure detection turns a flat audio file into a timeline: intro 0–8s, verse 8–24s, chorus 24–40s, ... This is a hard problem because sections are not labeled in audio — they have to be inferred from repetition and change. TuneLab combines a classic MIR technique (chroma self-similarity matrices) with a novelty detector (Foote kernel) to find section boundaries, then clusters similar segments to label them. No trained classifier, no genre bias — just signal geometry.
Step 1: Chroma features
Chroma is a 12-dimensional vector per frame representing the energy in each of the 12 pitch classes (C, C#, D, ..., B). It is deliberately octave-invariant — an A3 and an A4 map to the same chroma dimension — which makes it an excellent representation for harmonic content: chord progressions, melodies, key changes.
We compute chroma directly from a CQT (see Key Detection for the CQT rationale) at a frame rate of 10 Hz. That’s high enough to capture chord changes with precision but low enough to keep the self-similarity matrix a manageable size for a typical 4-minute track.
Step 2: Self-Similarity Matrix
Given chroma features at every frame, we compute the pairwise cosine similarity between every
frame and every other frame. The result is a symmetric N × N matrix where cell
(i, j) tells you how harmonically similar frame i is to frame j.
When you visualize an SSM for a pop song, you see:
- A bright diagonal — every frame is maximally similar to itself.
- Bright off-diagonal blocks where the same section repeats — most notably the chorus.
- Less bright but still coherent blocks for verses, bridges, and other repeated material.
- “Paths” parallel to the diagonal when a section is repeated verbatim later in the track.
Step 3: Foote Novelty Kernel
The Foote novelty score (Foote, 2000) is a classic technique for detecting change points in an SSM. Slide a 2D “checkerboard” kernel — positive in the top-left and bottom-right quadrants, negative in the top-right and bottom-left — along the diagonal of the SSM. The convolution output is the novelty score at each timestep.
Intuitively: the kernel measures how different the past few seconds are from the next few seconds. A high novelty score means the audio’s character just changed — a transition into a new section. We use a kernel width of roughly 4 seconds.
Peak-picking on the novelty score, with a minimum distance constraint of 5 seconds between peaks, gives us candidate section boundaries. The min-distance constraint prevents false positives from single chord changes or fills.
Step 4: Segment clustering
Between consecutive boundaries we have contiguous segments. For each segment we compute a mean chroma vector (the segment’s “harmonic fingerprint”) and run agglomerative clustering with cosine distance.
Two segments with cosine similarity > 0.85 are merged into the same cluster —
they are almost certainly the same section repeated. This gives us the section labels: cluster
1 might be “chorus,” cluster 2 “verse,” cluster 3 “bridge,” and so on.
Step 5: Labeling
Clusters are labeled by heuristics, not a trained classifier. This is deliberate — a trained classifier would bake in strong genre bias, and the heuristics are transparent, auditable, and easy to reason about:
- The first segment is labeled
intro. - The last segment is labeled
outro. - The most frequently repeated cluster is labeled
chorus. - Segments that flank chorus segments are labeled
verse. - A single-occurrence mid-track segment with lower energy is
breakdown(EDM) orbridge(pop). - A single-occurrence mid-track segment with a sharp energy rise right after its boundary is
drop.
Drop detection specifically fuses structure with the energy contour: a segment boundary immediately followed by a steep rise in short-term loudness and low-frequency energy is the classic EDM drop signature. This is the one place where the structure module talks to the mood/energy module, and the handoff is explicit.
Example output
A typical 4-minute EDM track produces something like this:
{
"structure": [
{ "label": "intro", "start_s": 0.0, "end_s": 8.2, "confidence": 0.89 },
{ "label": "verse", "start_s": 8.2, "end_s": 24.5, "confidence": 0.82 },
{ "label": "chorus", "start_s": 24.5, "end_s": 40.1, "confidence": 0.91 },
{ "label": "verse", "start_s": 40.1, "end_s": 56.3, "confidence": 0.78 },
{ "label": "drop", "start_s": 56.3, "end_s": 88.0, "confidence": 0.93 },
{ "label": "breakdown", "start_s": 88.0, "end_s": 104.2, "confidence": 0.71 },
{ "label": "chorus", "start_s": 104.2, "end_s": 120.5, "confidence": 0.88 },
{ "label": "outro", "start_s": 120.5, "end_s": 140.0, "confidence": 0.85 }
]
}
Use cases
- DJ apps: auto-cue at section boundaries, paint section markers on the waveform, jump-to-chorus button.
- Mashup tools: extract chorus sections for looping, align drops across two tracks for beatmatched transitions.
- Music discovery: compare tracks by their hook rather than the full song — a much better “sounds like” signal than averaging embeddings across a track.
- Podcast trimming: advertisement breaks are often detected as distinct segments with dissimilar chroma profiles. Not perfect, but a useful first pass.
- Video editing: cut on boundaries for automatic music-video sync.
Confidence scores
Each segment carries a confidence value. This is the strength of the SSM cluster the segment belongs to — roughly, how tightly the segment’s chroma fingerprint agrees with other members of its cluster. High confidence (> 0.85) means the segment is a clean, repeating section. Low confidence (< 0.70) means the segment is unique, ambiguous, or transitional.
Always surface the confidence values to your users or downstream logic. A chorus with confidence 0.92 should be treated very differently from a “chorus” with confidence 0.58.
Known limitations
Cases where structure detection degrades gracefully or fails outright:
- Classical music — through-composed works have little literal repetition. Expect coarse, generic labels.
- Very short tracks (< 60s) — not enough duration for clustering to stabilize. May return only 1–2 sections.
- Ambient / drone — no clear boundaries because the material doesn’t change. The novelty score stays flat.
- Acoustic material with minimal dynamic variation — confidence values will be low across the board; use them to decide whether to trust the output.
- Unusual song forms (free jazz, prog rock, experimental) — labels like “verse” and “chorus” are approximations. The boundaries are still useful; the labels may not be.
If you’re building a DJ or editing tool and the labels don’t fit your genre, ignore them and keep the boundary timestamps. The boundary detection is independent of the labeling heuristics, and it works across a much wider range of material.
Code example
curl https://api.tunelab.dev/v1/structure \
-H "Authorization: Bearer tl_live_xxx" \
-H "Content-Type: audio/mpeg" \
--data-binary @track.mp3
Further reading
- Foote, Automatic Audio Segmentation Using A Measure Of Audio Novelty (ICME 2000) — the original novelty kernel paper; still the backbone of modern structure detection.
- Serrà et al., Unsupervised Detection of Music Boundaries by Time Series Structure Features (ISMIR 2012) — a more recent unsupervised approach that beats trained baselines.
- POST /v1/structure — full API endpoint reference.
- Beat Grid — pair with structure for quantized section markers aligned to downbeats.
- BPM Detection — tempo-aware tools frequently want both tempo and structure in the same request.