SoundScape-Bench — "can a model describe everything in an audio clip?"

Comparing the LAION Universal Audio Annotation pipeline (two ASR back-ends) against omni LLMs (Gemini & GPT-Audio), on 200 held-out soundscapes. Beginner-friendly explanations below.

1. What is being tested (in plain words)

We play the model a soundscape — a short clip where several things happen at once: people talking (often in different languages, sometimes over each other), background music, sound effects (a door, a dog, rain), and little vocal noises (a laugh, a sigh). The model must write down everything it hears as a list, and for each item say when it starts, when it ends, what type it is, and a description — for speech also the exact words, which speaker, the language, and the emotion/how it's said.

Because we built every clip by gluing together pieces we already understand, we know the perfect answer (the "answer key"). So grading can be 100% automatic.

2. How we grade (the metrics)

For each real event we ask two questions and multiply the answers (so you need both right): did the timing line up? × did the description match?

★ Headline score — "Reward" (the ranking metric). For every event in the answer key we take its best-matching guess of the same type and reward it for getting both the timing and the content right, then average over all events (a missed event scores 0). The content part depends on the event type:

Speech has two things to get right — the words and how they're said — so:
reward = IoU × ( ½·cos(emotion/style) + ½·(1 − WER) )
i.e. a weighted sum of IoU×cos and IoU×(1−WER) — so the transcription accuracy (1 − Word Error Rate) counts for ASR/speech segments, alongside the "how-it's-said" caption match.
Sound effects, music, vocal bursts (no transcription): reward = IoU × cos(caption).

Building blocks (also shown as their own columns): IoU ∈ [0,1] = time-overlap ÷ time-union of the two events (1 = identical timing). cos ∈ [0,1] = cosine similarity of the two captions in an embedding space (close meaning → 1). 1−WER = word accuracy (1 minus Word Error Rate; per-character for Chinese). Multiplying by IoU means a guess only scores if it lands at the right time AND describes the right thing. The plain IoU×cos column ignores WER (so you can see content-match without transcription); the IoU column is timing alone.

The other columns break the Reward down so you can see where a model wins or loses:

IoU (timing)	"Intersection over Union": of the time the real event and the guess cover between them, what fraction do they agree on? 1 = perfect overlap, 0 = no overlap.
meaning	How close the descriptions are. For speech: half is word-accuracy (1 − Word Error Rate) and half is how well the emotion/"how" caption matches (sentence-embedding cosine). For sounds/music/bursts: caption cosine similarity.
speaker	For speech/bursts, did it attach the line to the right person? (speaker labels are matched up first, since names are arbitrary).
F1	The headline grade. Combines recall ("of all real events, how many did we catch & describe well?") and precision ("of all our guesses, how many were real?") into one number. Low if you miss things OR invent things.
WER	Word Error Rate on the transcription — lower is better (how wrong the words are; Chinese is scored per-character).
how (cos)	Cosine similarity (0–1) between the model's emotion/style caption and the ground-truth one — "did it get how it was said?"
snd (cos)	Caption cosine for sound-effect & music events — "did it describe the sounds/music right?"
spkAcc	Fraction of matched speech events attached to the correct speaker (diarization-with-identity).
halluc	Hallucination rate: fraction of the model's guesses that matched no real event (made-up).

Embedding model for all caption similarities: google/embeddinggemma-300m. WER via jiwer. Matching uses the Hungarian algorithm per event type (whitepaper recipe: event score = IoU × meaning × speaker → per-clip F1).

3. Systems compared

UAAP pipeline (triple-ASR ensemble) — the standard Universal Audio Annotation Pipeline: VibeVoice + Parakeet + Qwen3 ASR reconciled, Whisper expert voice tags, SFX-LoRA + vocal-burst detectors, all fused by MOSS-Audio-8B-Thinking into the final annotation.
UAAP pipeline (Sortformer + Nemotron 3.5) — the same pipeline with the three ASR models replaced by Sortformer diarization + Nemotron 3.5 ASR, everything else identical.
UAAP EXP1 (experimental) — a configuration test, not the standard pipeline. It reuses all the same intermediate results but changes how they are combined: timing & speaker diarization are taken from VibeVoice, the words are decided by a Parakeet + Nemotron 3.5 vote (agree → use verbatim; disagree → trust Nemotron 3.5), and every sound-effect window is given two captions (the SFX-LoRA caption plus one from the dedicated sound-effect captioner) which MOSS fuses into a more detailed sound description.
UAAP EXP2 (experimental) — a refinement of EXP1 (also reusing all intermediates). The noisy second SFX caption is removed; the words now come from Nemotron 3.5 alone (Sortformer-diarized, no Parakeet vote); VibeVoice + Sortformer supply diarization/timing; and MOSS is told to write more detailed sound-effect descriptions and to emit a dedicated music segment type with a rich description (genre, instruments, tempo, mood). The standard pipeline never emits a music type at all, so its music events score zero — EXP2 closes that gap.
UAAP EXP3 (experimental) — EXP2 with VibeVoice removed and pyannote (segmentation-3.0) added for speaker diarization and overlap detection. Where overlapping speech is detected, that region (plus surrounding speech) is cut out and re-transcribed by a second, targeted Nemotron 3.5 pass on top of the standard full-clip pass; both transcriptions are given to MOSS so it can recover words the full-clip pass garbles during overlaps. Tests whether targeted overlap re-ASR helps on the ~25% of clips with overlapping speech.
Omni LLMs — Gemini 3.1 Pro / 3.5 Flash / 3 Flash / 2.0 Flash, and GPT-Audio 1.5: the audio is sent directly with one standardized, detailed annotation prompt (identical for every model, with a worked JSON example) and the model returns the annotation in one shot.

4. Results (200-clip subset)

rank · system	Reward ▾	IoU×cos	IoU	F1	recall	WER	how (cos)	snd (cos)	spkAcc	halluc	#pred/#true
1. Gemini 3.1 Pro (omni)	0.297	0.303	0.615	0.270	0.293	72%	0.257	0.385	99%	23%	5.2 / 4.0
2. Gemini 3.5 Flash (omni)	0.256	0.263	0.556	0.233	0.253	67%	0.264	0.310	98%	23%	5.2 / 4.0
3. Gemma-4-12B + DiCoW — fixed prompt (ASR-anchored speech timestamps + empty-speech filter) (current default pipeline (fixed Gemma prompt))	0.255	0.264	0.516	0.147	0.243	61%	0.310	0.271	87%	45%	9.5 / 4.0
4. Gemma-4-12B TEXT-only fusion (EXP2 + DiCoW experts, NO audio) — replaces MOSS-8B (experimental — text-only 12B LLM fusion, no audio)	0.253	0.262	0.515	0.149	0.238	59%	0.313	0.273	82%	43%	9.2 / 4.0
5. Gemma-4-12B TEXT-only fusion (EXP2 experts, NO audio) — replaces MOSS-8B (experimental — text-only 12B LLM fusion, no audio)	0.248	0.259	0.512	0.144	0.234	56%	0.313	0.278	82%	44%	9.4 / 4.0
6. Gemma-4-E4B TEXT-only fusion (EXP2 + DiCoW experts, NO audio) — replaces MOSS-8B (experimental — text-only LLM fusion, no audio)	0.244	0.253	0.490	0.151	0.231	59%	0.319	0.269	86%	44%	8.5 / 4.0
7. Gemma-4-E4B TEXT-only fusion (EXP2 experts, NO audio) — replaces MOSS-8B (experimental — text-only LLM fusion, no audio)	0.237	0.247	0.479	0.146	0.225	59%	0.316	0.266	86%	44%	8.7 / 4.0
8. UAAP EXP2 — VibeVoice+Sortformer diarization + Nemotron 3.5 words + detailed SFX/music captions (→ MOSS-8B) (experimental configuration — not the standard pipeline)	0.236	0.241	0.457	0.191	0.226	65%	0.300	0.287	91%	27%	5.6 / 4.0
9. Combo A — EXP2 + DiCoW overlap-aware ASR (diarization-conditioned Whisper) (experimental — EXP2 + new model)	0.233	0.237	0.453	0.185	0.226	67%	0.299	0.271	93%	28%	5.8 / 4.0
10. Combo D — EXP2 + DiCoW + PretrainedSED + PANNs (full stack) (experimental — EXP2 + new models)	0.229	0.233	0.442	0.174	0.223	67%	0.296	0.249	96%	30%	6.4 / 4.0
11. Combo C — EXP2 + PretrainedSED + PANNs music gate (experimental — EXP2 + new models)	0.226	0.234	0.446	0.183	0.220	65%	0.294	0.269	94%	28%	5.6 / 4.0
12. Combo B — EXP2 + PretrainedSED strong-label sound events (experimental — EXP2 + new model)	0.222	0.229	0.435	0.171	0.215	65%	0.302	0.250	94%	28%	5.9 / 4.0
13. Gemini 3 Flash (omni)	0.212	0.217	0.450	0.172	0.209	66%	0.255	0.262	98%	33%	6.6 / 4.0
14. UAAP EXP3 — pyannote diarization+overlap detection + Nemotron 3.5 (full + overlap-targeted re-ASR) + detailed SFX/music (→ MOSS-8B) (experimental configuration — not the standard pipeline)	0.206	0.212	0.429	0.126	0.202	56%	0.293	0.250	97%	40%	9.2 / 4.0
15. UAAP pipeline — triple-ASR ensemble (VibeVoice + Parakeet + Qwen3 → MOSS-8B)	0.196	0.200	0.388	0.145	0.189	66%	0.296	0.226	93%	32%	6.5 / 4.0
16. UAAP EXP1 — VibeVoice diarization + Parakeet/Nemotron 3.5 wording + dual-caption SFX (→ MOSS-8B) (experimental configuration — not the standard pipeline)	0.190	0.195	0.383	0.147	0.183	67%	0.295	0.195	94%	29%	6.1 / 4.0
17. UAAP pipeline — Sortformer + Nemotron 3.5 ASR (→ MOSS-8B)	0.153	0.162	0.331	0.112	0.150	59%	0.296	0.223	96%	35%	6.5 / 4.0
18. GPT-Audio 1.5 (omni)	0.097	0.104	0.223	0.097	0.106	61%	0.251	0.152	98%	36%	4.5 / 4.0
19. fast_q6_vv	0.062	0.065	0.133	0.051	0.070	55%	0.309	0.271	86%	18%	3.2 / 4.0
20. fast_q6_novv	0.058	0.060	0.123	0.046	0.067	52%	0.302	0.278	95%	19%	3.3 / 4.0
21. fast_q8_novv	0.058	0.060	0.123	0.046	0.067	52%	0.299	0.293	95%	19%	3.4 / 4.0

Ranked by Reward (the headline score, explained above): weighted IoU×cos + IoU×(1−WER), so timing, description and transcription accuracy all count. F1 is the stricter all-or-nothing grade (it also punishes inventing events). WER lower = better. The benchmark is deliberately hard — overlapping multilingual speech over dense sound — so absolute numbers are low; the ranking and the per-dimension breakdown are what matter.

SoundScape-Bench · built from MLS / Emilia / AudioSet-grounded-captions / vocal-bursts / AI-music, answer keys in the UAAP schema. See PROTOCOL.md for how the data was made.