SoundScape-Bench — "can a model describe everything in an audio clip?"

Comparing the LAION Universal Audio Annotation pipeline (two ASR back-ends) against omni LLMs (Gemini & GPT-Audio), on 200 held-out soundscapes. Beginner-friendly explanations below.

1. What is being tested (in plain words)

We play the model a soundscape — a short clip where several things happen at once: people talking (often in different languages, sometimes over each other), background music, sound effects (a door, a dog, rain), and little vocal noises (a laugh, a sigh). The model must write down everything it hears as a list, and for each item say when it starts, when it ends, what type it is, and a description — for speech also the exact words, which speaker, the language, and the emotion/how it's said.

Because we built every clip by gluing together pieces we already understand, we know the perfect answer (the "answer key"). So grading can be 100% automatic.

2. How we grade (the metrics)

For each real event we ask two questions and multiply the answers (so you need both right): did the timing line up? × did the description match?

★ Headline score — "Reward" (the ranking metric). For every event in the answer key we take its best-matching guess of the same type and reward it for getting both the timing and the content right, then average over all events (a missed event scores 0). The content part depends on the event type:
  • Speech has two things to get right — the words and how they're said — so:
    reward = IoU × ( ½·cos(emotion/style) + ½·(1 − WER) )
    i.e. a weighted sum of IoU×cos and IoU×(1−WER) — so the transcription accuracy (1 − Word Error Rate) counts for ASR/speech segments, alongside the "how-it's-said" caption match.
  • Sound effects, music, vocal bursts (no transcription): reward = IoU × cos(caption).
Building blocks (also shown as their own columns): IoU ∈ [0,1] = time-overlap ÷ time-union of the two events (1 = identical timing). cos ∈ [0,1] = cosine similarity of the two captions in an embedding space (close meaning → 1). 1−WER = word accuracy (1 minus Word Error Rate; per-character for Chinese). Multiplying by IoU means a guess only scores if it lands at the right time AND describes the right thing. The plain IoU×cos column ignores WER (so you can see content-match without transcription); the IoU column is timing alone.

The other columns break the Reward down so you can see where a model wins or loses:

IoU (timing)"Intersection over Union": of the time the real event and the guess cover between them, what fraction do they agree on? 1 = perfect overlap, 0 = no overlap.
meaningHow close the descriptions are. For speech: half is word-accuracy (1 − Word Error Rate) and half is how well the emotion/"how" caption matches (sentence-embedding cosine). For sounds/music/bursts: caption cosine similarity.
speakerFor speech/bursts, did it attach the line to the right person? (speaker labels are matched up first, since names are arbitrary).
F1The headline grade. Combines recall ("of all real events, how many did we catch & describe well?") and precision ("of all our guesses, how many were real?") into one number. Low if you miss things OR invent things.
WERWord Error Rate on the transcription — lower is better (how wrong the words are; Chinese is scored per-character).
how (cos)Cosine similarity (0–1) between the model's emotion/style caption and the ground-truth one — "did it get how it was said?"
snd (cos)Caption cosine for sound-effect & music events — "did it describe the sounds/music right?"
spkAccFraction of matched speech events attached to the correct speaker (diarization-with-identity).
hallucHallucination rate: fraction of the model's guesses that matched no real event (made-up).

Embedding model for all caption similarities: google/embeddinggemma-300m. WER via jiwer. Matching uses the Hungarian algorithm per event type (whitepaper recipe: event score = IoU × meaning × speaker → per-clip F1).

3. Systems compared

4. Results (200-clip subset)

rank · systemReward ▾IoU×cosIoUF1recallWERhow (cos)snd (cos)spkAcchalluc#pred/#true
1. Gemini 3.1 Pro (omni) 0.297 0.303 0.615 0.2700.293 72%0.2570.385 99%23% 5.2 / 4.0
2. Gemini 3.5 Flash (omni) 0.256 0.263 0.556 0.2330.253 67%0.2640.310 98%23% 5.2 / 4.0
3. Gemma-4-12B TEXT-only fusion (EXP2 + DiCoW experts, NO audio) — replaces MOSS-8B (experimental — text-only 12B LLM fusion, no audio) 0.253 0.262 0.515 0.1490.238 59%0.3130.273 82%43% 9.2 / 4.0
4. Gemma-4-12B TEXT-only fusion (EXP2 experts, NO audio) — replaces MOSS-8B (experimental — text-only 12B LLM fusion, no audio) 0.248 0.259 0.512 0.1440.234 56%0.3130.278 82%44% 9.4 / 4.0
5. Gemma-4-E4B TEXT-only fusion (EXP2 + DiCoW experts, NO audio) — replaces MOSS-8B (experimental — text-only LLM fusion, no audio) 0.244 0.253 0.490 0.1510.231 59%0.3190.269 86%44% 8.5 / 4.0
6. Gemma-4-E4B TEXT-only fusion (EXP2 experts, NO audio) — replaces MOSS-8B (experimental — text-only LLM fusion, no audio) 0.237 0.247 0.479 0.1460.225 59%0.3160.266 86%44% 8.7 / 4.0
7. UAAP EXP2 — VibeVoice+Sortformer diarization + Nemotron 3.5 words + detailed SFX/music captions (→ MOSS-8B) (experimental configuration — not the standard pipeline) 0.236 0.241 0.457 0.1910.226 65%0.3000.287 91%27% 5.6 / 4.0
8. Combo A — EXP2 + DiCoW overlap-aware ASR (diarization-conditioned Whisper) (experimental — EXP2 + new model) 0.233 0.237 0.453 0.1850.226 67%0.2990.271 93%28% 5.8 / 4.0
9. Combo D — EXP2 + DiCoW + PretrainedSED + PANNs (full stack) (experimental — EXP2 + new models) 0.229 0.233 0.442 0.1740.223 67%0.2960.249 96%30% 6.4 / 4.0
10. Combo C — EXP2 + PretrainedSED + PANNs music gate (experimental — EXP2 + new models) 0.226 0.234 0.446 0.1830.220 65%0.2940.269 94%28% 5.6 / 4.0
11. Combo B — EXP2 + PretrainedSED strong-label sound events (experimental — EXP2 + new model) 0.222 0.229 0.435 0.1710.215 65%0.3020.250 94%28% 5.9 / 4.0
12. Gemini 3 Flash (omni) 0.212 0.217 0.450 0.1720.209 66%0.2550.262 98%33% 6.6 / 4.0
13. UAAP EXP3 — pyannote diarization+overlap detection + Nemotron 3.5 (full + overlap-targeted re-ASR) + detailed SFX/music (→ MOSS-8B) (experimental configuration — not the standard pipeline) 0.206 0.212 0.429 0.1260.202 56%0.2930.250 97%40% 9.2 / 4.0
14. UAAP pipeline — triple-ASR ensemble (VibeVoice + Parakeet + Qwen3 → MOSS-8B) 0.196 0.200 0.388 0.1450.189 66%0.2960.226 93%32% 6.5 / 4.0
15. UAAP EXP1 — VibeVoice diarization + Parakeet/Nemotron 3.5 wording + dual-caption SFX (→ MOSS-8B) (experimental configuration — not the standard pipeline) 0.190 0.195 0.383 0.1470.183 67%0.2950.195 94%29% 6.1 / 4.0
16. UAAP pipeline — Sortformer + Nemotron 3.5 ASR (→ MOSS-8B) 0.153 0.162 0.331 0.1120.150 59%0.2960.223 96%35% 6.5 / 4.0
17. GPT-Audio 1.5 (omni) 0.097 0.104 0.223 0.0970.106 61%0.2510.152 98%36% 4.5 / 4.0

Ranked by Reward (the headline score, explained above): weighted IoU×cos + IoU×(1−WER), so timing, description and transcription accuracy all count. F1 is the stricter all-or-nothing grade (it also punishes inventing events). WER lower = better. The benchmark is deliberately hard — overlapping multilingual speech over dense sound — so absolute numbers are low; the ranking and the per-dimension breakdown are what matter.

SoundScape-Bench · built from MLS / Emilia / AudioSet-grounded-captions / vocal-bursts / AI-music, answer keys in the UAAP schema. See PROTOCOL.md for how the data was made.