← Technology

Song Structure

Song structure detection turns a flat audio file into a timeline: intro 0–8s, verse 8–24s, chorus 24–40s, ... This is a hard problem because sections are not labeled in audio — they have to be inferred from repetition and change. TuneLab combines a classic MIR technique (chroma self-similarity matrices) with a novelty detector (Foote kernel) to find section boundaries, then clusters similar segments to label them. No trained classifier, no genre bias — just signal geometry.

Step 1: Chroma features

Chroma is a 12-dimensional vector per frame representing the energy in each of the 12 pitch classes (C, C#, D, ..., B). It is deliberately octave-invariant — an A3 and an A4 map to the same chroma dimension — which makes it an excellent representation for harmonic content: chord progressions, melodies, key changes.

We compute chroma directly from a CQT (see Key Detection for the CQT rationale) at a frame rate of 10 Hz. That’s high enough to capture chord changes with precision but low enough to keep the self-similarity matrix a manageable size for a typical 4-minute track.

Step 2: Self-Similarity Matrix

Given chroma features at every frame, we compute the pairwise cosine similarity between every frame and every other frame. The result is a symmetric N × N matrix where cell (i, j) tells you how harmonically similar frame i is to frame j.

When you visualize an SSM for a pop song, you see:

Step 3: Foote Novelty Kernel

The Foote novelty score (Foote, 2000) is a classic technique for detecting change points in an SSM. Slide a 2D “checkerboard” kernel — positive in the top-left and bottom-right quadrants, negative in the top-right and bottom-left — along the diagonal of the SSM. The convolution output is the novelty score at each timestep.

Intuitively: the kernel measures how different the past few seconds are from the next few seconds. A high novelty score means the audio’s character just changed — a transition into a new section. We use a kernel width of roughly 4 seconds.

Peak-picking on the novelty score, with a minimum distance constraint of 5 seconds between peaks, gives us candidate section boundaries. The min-distance constraint prevents false positives from single chord changes or fills.

Step 4: Segment clustering

Between consecutive boundaries we have contiguous segments. For each segment we compute a mean chroma vector (the segment’s “harmonic fingerprint”) and run agglomerative clustering with cosine distance.

Two segments with cosine similarity > 0.85 are merged into the same cluster — they are almost certainly the same section repeated. This gives us the section labels: cluster 1 might be “chorus,” cluster 2 “verse,” cluster 3 “bridge,” and so on.

Step 5: Labeling

Clusters are labeled by heuristics, not a trained classifier. This is deliberate — a trained classifier would bake in strong genre bias, and the heuristics are transparent, auditable, and easy to reason about:

Drop detection specifically fuses structure with the energy contour: a segment boundary immediately followed by a steep rise in short-term loudness and low-frequency energy is the classic EDM drop signature. This is the one place where the structure module talks to the mood/energy module, and the handoff is explicit.

Example output

A typical 4-minute EDM track produces something like this:

response
{
  "structure": [
    { "label": "intro",     "start_s":   0.0, "end_s":   8.2, "confidence": 0.89 },
    { "label": "verse",     "start_s":   8.2, "end_s":  24.5, "confidence": 0.82 },
    { "label": "chorus",    "start_s":  24.5, "end_s":  40.1, "confidence": 0.91 },
    { "label": "verse",     "start_s":  40.1, "end_s":  56.3, "confidence": 0.78 },
    { "label": "drop",      "start_s":  56.3, "end_s":  88.0, "confidence": 0.93 },
    { "label": "breakdown", "start_s":  88.0, "end_s": 104.2, "confidence": 0.71 },
    { "label": "chorus",    "start_s": 104.2, "end_s": 120.5, "confidence": 0.88 },
    { "label": "outro",     "start_s": 120.5, "end_s": 140.0, "confidence": 0.85 }
  ]
}

Use cases

Confidence scores

Each segment carries a confidence value. This is the strength of the SSM cluster the segment belongs to — roughly, how tightly the segment’s chroma fingerprint agrees with other members of its cluster. High confidence (> 0.85) means the segment is a clean, repeating section. Low confidence (< 0.70) means the segment is unique, ambiguous, or transitional.

Always surface the confidence values to your users or downstream logic. A chorus with confidence 0.92 should be treated very differently from a “chorus” with confidence 0.58.

Known limitations

Cases where structure detection degrades gracefully or fails outright:

Labels optional, boundaries gold.
If you&rsquo;re building a DJ or editing tool and the labels don&rsquo;t fit your genre, ignore them and keep the boundary timestamps. The boundary detection is independent of the labeling heuristics, and it works across a much wider range of material.

Code example

cURL
curl https://api.tunelab.dev/v1/structure \
  -H "Authorization: Bearer tl_live_xxx" \
  -H "Content-Type: audio/mpeg" \
  --data-binary @track.mp3

Further reading