Comparing the LAION Universal Audio Annotation pipeline (two ASR back-ends) against omni LLMs (Gemini & GPT-Audio), on 200 held-out soundscapes. Beginner-friendly explanations below.
We play the model a soundscape — a short clip where several things happen at once: people talking (often in different languages, sometimes over each other), background music, sound effects (a door, a dog, rain), and little vocal noises (a laugh, a sigh). The model must write down everything it hears as a list, and for each item say when it starts, when it ends, what type it is, and a description — for speech also the exact words, which speaker, the language, and the emotion/how it's said.
Because we built every clip by gluing together pieces we already understand, we know the perfect answer (the "answer key"). So grading can be 100% automatic.
For each real event we ask two questions and multiply the answers (so you need both right): did the timing line up? × did the description match?
reward = IoU × ( ½·cos(emotion/style) + ½·(1 − WER) )IoU×cos and IoU×(1−WER) — so the transcription
accuracy (1 − Word Error Rate) counts for ASR/speech segments, alongside the "how-it's-said" caption match.reward = IoU × cos(caption).The other columns break the Reward down so you can see where a model wins or loses:
| IoU (timing) | "Intersection over Union": of the time the real event and the guess cover between them, what fraction do they agree on? 1 = perfect overlap, 0 = no overlap. |
| meaning | How close the descriptions are. For speech: half is word-accuracy (1 − Word Error Rate) and half is how well the emotion/"how" caption matches (sentence-embedding cosine). For sounds/music/bursts: caption cosine similarity. |
| speaker | For speech/bursts, did it attach the line to the right person? (speaker labels are matched up first, since names are arbitrary). |
| F1 | The headline grade. Combines recall ("of all real events, how many did we catch & describe well?") and precision ("of all our guesses, how many were real?") into one number. Low if you miss things OR invent things. |
| WER | Word Error Rate on the transcription — lower is better (how wrong the words are; Chinese is scored per-character). |
| how (cos) | Cosine similarity (0–1) between the model's emotion/style caption and the ground-truth one — "did it get how it was said?" |
| snd (cos) | Caption cosine for sound-effect & music events — "did it describe the sounds/music right?" |
| spkAcc | Fraction of matched speech events attached to the correct speaker (diarization-with-identity). |
| halluc | Hallucination rate: fraction of the model's guesses that matched no real event (made-up). |
Embedding model for all caption similarities: google/embeddinggemma-300m. WER via
jiwer. Matching uses the Hungarian algorithm per event type (whitepaper recipe:
event score = IoU × meaning × speaker → per-clip F1).
music
type at all, so its music events score zero — EXP2 closes that gap.| rank · system | Reward ▾ | IoU×cos | IoU | F1 | recall | WER | how (cos) | snd (cos) | spkAcc | halluc | #pred/#true |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1. Gemini 3.1 Pro (omni) | 0.297 | 0.303 | 0.615 | 0.270 | 0.293 | 72% | 0.257 | 0.385 | 99% | 23% | 5.2 / 4.0 |
| 2. Gemini 3.5 Flash (omni) | 0.256 | 0.263 | 0.556 | 0.233 | 0.253 | 67% | 0.264 | 0.310 | 98% | 23% | 5.2 / 4.0 |
| 3. Gemma-4-12B TEXT-only fusion (EXP2 + DiCoW experts, NO audio) — replaces MOSS-8B (experimental — text-only 12B LLM fusion, no audio) | 0.253 | 0.262 | 0.515 | 0.149 | 0.238 | 59% | 0.313 | 0.273 | 82% | 43% | 9.2 / 4.0 |
| 4. Gemma-4-12B TEXT-only fusion (EXP2 experts, NO audio) — replaces MOSS-8B (experimental — text-only 12B LLM fusion, no audio) | 0.248 | 0.259 | 0.512 | 0.144 | 0.234 | 56% | 0.313 | 0.278 | 82% | 44% | 9.4 / 4.0 |
| 5. Gemma-4-E4B TEXT-only fusion (EXP2 + DiCoW experts, NO audio) — replaces MOSS-8B (experimental — text-only LLM fusion, no audio) | 0.244 | 0.253 | 0.490 | 0.151 | 0.231 | 59% | 0.319 | 0.269 | 86% | 44% | 8.5 / 4.0 |
| 6. Gemma-4-E4B TEXT-only fusion (EXP2 experts, NO audio) — replaces MOSS-8B (experimental — text-only LLM fusion, no audio) | 0.237 | 0.247 | 0.479 | 0.146 | 0.225 | 59% | 0.316 | 0.266 | 86% | 44% | 8.7 / 4.0 |
| 7. UAAP EXP2 — VibeVoice+Sortformer diarization + Nemotron 3.5 words + detailed SFX/music captions (→ MOSS-8B) (experimental configuration — not the standard pipeline) | 0.236 | 0.241 | 0.457 | 0.191 | 0.226 | 65% | 0.300 | 0.287 | 91% | 27% | 5.6 / 4.0 |
| 8. Combo A — EXP2 + DiCoW overlap-aware ASR (diarization-conditioned Whisper) (experimental — EXP2 + new model) | 0.233 | 0.237 | 0.453 | 0.185 | 0.226 | 67% | 0.299 | 0.271 | 93% | 28% | 5.8 / 4.0 |
| 9. Combo D — EXP2 + DiCoW + PretrainedSED + PANNs (full stack) (experimental — EXP2 + new models) | 0.229 | 0.233 | 0.442 | 0.174 | 0.223 | 67% | 0.296 | 0.249 | 96% | 30% | 6.4 / 4.0 |
| 10. Combo C — EXP2 + PretrainedSED + PANNs music gate (experimental — EXP2 + new models) | 0.226 | 0.234 | 0.446 | 0.183 | 0.220 | 65% | 0.294 | 0.269 | 94% | 28% | 5.6 / 4.0 |
| 11. Combo B — EXP2 + PretrainedSED strong-label sound events (experimental — EXP2 + new model) | 0.222 | 0.229 | 0.435 | 0.171 | 0.215 | 65% | 0.302 | 0.250 | 94% | 28% | 5.9 / 4.0 |
| 12. Gemini 3 Flash (omni) | 0.212 | 0.217 | 0.450 | 0.172 | 0.209 | 66% | 0.255 | 0.262 | 98% | 33% | 6.6 / 4.0 |
| 13. UAAP EXP3 — pyannote diarization+overlap detection + Nemotron 3.5 (full + overlap-targeted re-ASR) + detailed SFX/music (→ MOSS-8B) (experimental configuration — not the standard pipeline) | 0.206 | 0.212 | 0.429 | 0.126 | 0.202 | 56% | 0.293 | 0.250 | 97% | 40% | 9.2 / 4.0 |
| 14. UAAP pipeline — triple-ASR ensemble (VibeVoice + Parakeet + Qwen3 → MOSS-8B) | 0.196 | 0.200 | 0.388 | 0.145 | 0.189 | 66% | 0.296 | 0.226 | 93% | 32% | 6.5 / 4.0 |
| 15. UAAP EXP1 — VibeVoice diarization + Parakeet/Nemotron 3.5 wording + dual-caption SFX (→ MOSS-8B) (experimental configuration — not the standard pipeline) | 0.190 | 0.195 | 0.383 | 0.147 | 0.183 | 67% | 0.295 | 0.195 | 94% | 29% | 6.1 / 4.0 |
| 16. UAAP pipeline — Sortformer + Nemotron 3.5 ASR (→ MOSS-8B) | 0.153 | 0.162 | 0.331 | 0.112 | 0.150 | 59% | 0.296 | 0.223 | 96% | 35% | 6.5 / 4.0 |
| 17. GPT-Audio 1.5 (omni) | 0.097 | 0.104 | 0.223 | 0.097 | 0.106 | 61% | 0.251 | 0.152 | 98% | 36% | 4.5 / 4.0 |
Ranked by Reward (the headline score, explained above): weighted IoU×cos +
IoU×(1−WER), so timing, description and transcription accuracy all count. F1 is the stricter all-or-nothing
grade (it also punishes inventing events). WER lower = better. The benchmark is deliberately hard
— overlapping multilingual speech over dense sound — so absolute numbers are low; the ranking and
the per-dimension breakdown are what matter.
SoundScape-Bench · built from MLS /
Emilia / AudioSet-grounded-captions / vocal-bursts / AI-music, answer keys in the UAAP schema. See
PROTOCOL.md for how the data was made.