← Technology

Song Structure

Song structure detection turns a flat audio file into a timeline: intro 0–8s, verse 8–24s, chorus 24–40s, ... This is a hard problem because sections are not labeled in audio — they have to be inferred from repetition and change. TuneLab combines a classic MIR technique (chroma self-similarity matrices) with a novelty detector (Foote kernel) to find section boundaries, then clusters similar segments to label them. No trained classifier, no genre bias — just signal geometry.

Step 1: Chroma features

Chroma is a 12-dimensional vector per frame representing the energy in each of the 12 pitch classes (C, C#, D, ..., B). It is deliberately octave-invariant — an A3 and an A4 map to the same chroma dimension — which makes it an excellent representation for harmonic content: chord progressions, melodies, key changes.

We compute chroma directly from a CQT (see Key Detection for the CQT rationale) at a frame rate of 10 Hz. That’s high enough to capture chord changes with precision but low enough to keep the self-similarity matrix a manageable size for a typical 4-minute track.

Step 2: Self-Similarity Matrix

Given chroma features at every frame, we compute the pairwise cosine similarity between every frame and every other frame. The result is a symmetric N × N matrix where cell (i, j) tells you how harmonically similar frame i is to frame j.

When you visualize an SSM for a pop song, you see:

A bright diagonal — every frame is maximally similar to itself.
Bright off-diagonal blocks where the same section repeats — most notably the chorus.
Less bright but still coherent blocks for verses, bridges, and other repeated material.
“Paths” parallel to the diagonal when a section is repeated verbatim later in the track.

Step 3: Foote Novelty Kernel

The Foote novelty score (Foote, 2000) is a classic technique for detecting change points in an SSM. Slide a 2D “checkerboard” kernel — positive in the top-left and bottom-right quadrants, negative in the top-right and bottom-left — along the diagonal of the SSM. The convolution output is the novelty score at each timestep.

Intuitively: the kernel measures how different the past few seconds are from the next few seconds. A high novelty score means the audio’s character just changed — a transition into a new section. We use a kernel width of roughly 4 seconds.

Peak-picking on the novelty score, with a minimum distance constraint of 5 seconds between peaks, gives us candidate section boundaries. The min-distance constraint prevents false positives from single chord changes or fills.

Step 4: Segment clustering

Between consecutive boundaries we have contiguous segments. For each segment we compute a mean chroma vector (the segment’s “harmonic fingerprint”) and run agglomerative clustering with cosine distance.

Two segments with cosine similarity > 0.85 are merged into the same cluster — they are almost certainly the same section repeated. This gives us the section labels: cluster 1 might be “chorus,” cluster 2 “verse,” cluster 3 “bridge,” and so on.

Step 5: Labeling

Clusters are labeled by heuristics, not a trained classifier. This is deliberate — a trained classifier would bake in strong genre bias, and the heuristics are transparent, auditable, and easy to reason about:

The first segment is labeled intro.
The last segment is labeled outro.
The most frequently repeated cluster is labeled chorus.
Segments that flank chorus segments are labeled verse.
A single-occurrence mid-track segment with lower energy is breakdown (EDM) or bridge (pop).
A single-occurrence mid-track segment with a sharp energy rise right after its boundary is drop.

Drop detection specifically fuses structure with the energy contour: a segment boundary immediately followed by a steep rise in short-term loudness and low-frequency energy is the classic EDM drop signature. This is the one place where the structure module talks to the mood/energy module, and the handoff is explicit.

Example output

A typical 4-minute EDM track produces something like this:

{
  "structure": [
    { "label": "intro",     "start_s":   0.0, "end_s":   8.2, "confidence": 0.89 },
    { "label": "verse",     "start_s":   8.2, "end_s":  24.5, "confidence": 0.82 },
    { "label": "chorus",    "start_s":  24.5, "end_s":  40.1, "confidence": 0.91 },
    { "label": "verse",     "start_s":  40.1, "end_s":  56.3, "confidence": 0.78 },
    { "label": "drop",      "start_s":  56.3, "end_s":  88.0, "confidence": 0.93 },
    { "label": "breakdown", "start_s":  88.0, "end_s": 104.2, "confidence": 0.71 },
    { "label": "chorus",    "start_s": 104.2, "end_s": 120.5, "confidence": 0.88 },
    { "label": "outro",     "start_s": 120.5, "end_s": 140.0, "confidence": 0.85 }
  ]
}

Use cases

DJ apps: auto-cue at section boundaries, paint section markers on the waveform, jump-to-chorus button.
Mashup tools: extract chorus sections for looping, align drops across two tracks for beatmatched transitions.
Music discovery: compare tracks by their hook rather than the full song — a much better “sounds like” signal than averaging embeddings across a track.
Podcast trimming: advertisement breaks are often detected as distinct segments with dissimilar chroma profiles. Not perfect, but a useful first pass.
Video editing: cut on boundaries for automatic music-video sync.

Confidence scores

Each segment carries a confidence value. This is the strength of the SSM cluster the segment belongs to — roughly, how tightly the segment’s chroma fingerprint agrees with other members of its cluster. High confidence (> 0.85) means the segment is a clean, repeating section. Low confidence (< 0.70) means the segment is unique, ambiguous, or transitional.

Always surface the confidence values to your users or downstream logic. A chorus with confidence 0.92 should be treated very differently from a “chorus” with confidence 0.58.

Known limitations

Cases where structure detection degrades gracefully or fails outright:

Classical music — through-composed works have little literal repetition. Expect coarse, generic labels.
Very short tracks (< 60s) — not enough duration for clustering to stabilize. May return only 1–2 sections.
Ambient / drone — no clear boundaries because the material doesn’t change. The novelty score stays flat.
Acoustic material with minimal dynamic variation — confidence values will be low across the board; use them to decide whether to trust the output.
Unusual song forms (free jazz, prog rock, experimental) — labels like “verse” and “chorus” are approximations. The boundaries are still useful; the labels may not be.

✓

Labels optional, boundaries gold.
If you’re building a DJ or editing tool and the labels don’t fit your genre, ignore them and keep the boundary timestamps. The boundary detection is independent of the labeling heuristics, and it works across a much wider range of material.

Code example

curl https://api.tunelab.dev/v1/structure \
  -H "Authorization: Bearer tl_live_xxx" \
  -H "Content-Type: audio/mpeg" \
  --data-binary @track.mp3