SoundScape-Bench — Gemma-12B + DiCoW predictions vs ground truth

20 samples. Each: the audio clip, the exact answer key (ground truth), and the text-only Gemma-4-12B fusion (no audio — fuses Nemotron/VibeVoice/Sortformer/Whisper/DiCoW expert outputs). Clips marked overlapping speech showcase DiCoW separating simultaneous speakers.

clip_000 · 5 GT events · langs: en, zh overlapping speech
Ground truth
0.0–24.6smusicThe track features distorted electric guitars playing a driving rhythm with a moderate tempo. Drums provide a basic rock beat. **Vocals**: - **Vocal Profile**: Male, slightly raspy voice. - **Timbre & Quality**: Clear, but slightly strained. - **Non-Lyrical Sounds**: None. - **Lyrics**: "Got a fuse that's burning close to the end...".
0.9–5.6sspeechAnyone can pull a muscle because of overuse, muscle fatigue, or through a fall.
spk speaker_1 · en · A young adult male, likely in his 20s or 30s, delivers a speech exhibiting a complex emotional landscape. Initially, the voice conveys a slightly negative emotional state, characterized by moderate concentration, contemplation, and a hint of distress. The timbre is clear and slightly nasal, with a moderate tempo and stable mid-range pitch. The delivery is natural and spontaneous, with precise articulation and a neutral airflow. The voice possesses a near-neutral tension and a slight roughness, with a balanced resonance between the head and chest. The overall enjoyment is medium, reflecting clear audio but a somewhat monotonous delivery
4.1–9.1sspeech因为我礼拜天的时候已经去看过了。我是超人粉丝,我觉得超棒的。
spk speaker_2 · zh · A young adult male, likely in his 20s, delivers a speech exhibiting a blend of contentment, affection, and elation, transitioning to a more subdued and contemplative tone. The voice possesses a soft, smooth timbre, leaning towards a male baritone with a slightly soft and neutral quality, exhibiting a near-neutral-slightly-bright resonance. A subtle breathiness and a hint of nasality are present, contributing to a relaxed vocal production. The voice is chest-mixed, with a near-neutral heavy weight and mild wear, yet remains mostly natural and stable. Art
5.9–10.3svocal_burstA solitary, deep, and prolonged vocal sound, perceived as a moan or groan, is heard. The sound is sustained and resonant, indicating a low-pitched vocalization. The sound suggests a vocal expression of discomfort, fatigue, or perhaps even pleasure, depending on context. The long duration implies a sustained feeling or reaction from the individual, possibly indicating release or relief.
spk speaker_1 ·
9.8–18.8ssound_eventThe audio contains a distinct "meow" sound. There is one instance of a clear, medium-pitched meow. The environment seems quiet otherwise.
moderate
Gemma-12B + DiCoW (text-only fusion)
0.0–1.2ssound_eventThe speech is recorded in a close-mic setting within a space that has some noticeable reverberation, giving it a slightly roomy quality. The background is very quiet, with only the room's inherent ambience audible. The timbre is clear but lacks significant low-end warmth. The spatial acoustics suggest a small, untreated room with short, audible reverb tails. There is no perceptible motion or complex editing. The recording is clean, with minimal noise or artifacts. The onset of speech is natural, and the segment ends clearly
moderate
1.2–7.1sspeechAnyone can pull a muscle because of overuse, muscle fatigue,
spk speaker_0 · en · slightly annoyed and matter-of-fact
4.5–8.5smusicLoud and energetic rock music: distorted electric guitars playing a driving, riff-based pattern, powerful and steady drum beat with punchy kick and sharp snare, gritty and saturated timbre, high-energy atmosphere
very_loud
7.1–9.8sspeech我禮拜天的时候已经去看过了我是超人粉丝我觉得超棒
spk speaker_0 · zh · exclamatory, surprised tone
8.5–24.5smusicLoud, aggressive rock music: heavily distorted electric guitars, driving drum kit with strong kick and snare pattern, audible bassline, saturated and aggressive timbre, intense and energetic mood
very_loud
8.7–9.8ssound_eventhigh-pitched, and sustained electronic tone, resembling a sine wave or a similar synthesized sound
moderate
clip_001 · 4 GT events · langs: fr
Ground truth
0.0–13.4smusicThe audio features a heavy, distorted guitar riff. A driving drum beat with a prominent kick and snare provides the rhythm. A male vocalist sings with a harsh, aggressive style. **Vocals**: - **Vocal Profile**: Male, likely adult, death metal vocalist. - **Timbre & Quality**: Harsh, guttural, and aggressive. - **Non-Lyrical Sounds**: None clearly audible beyond the vocal style. - **Lyrics**: "Lucas baby! Lucas baby!" Repeated with aggressive inflection.
0.5–8.5sspeecheuh, je, je sais pas trop, mais toujours est-il que le résultat, euh, est pas terrible, euh, dans le sens où deux LED au bout de trois semaines en mois à chaise.
spk speaker_1 · fr · A male speaker delivers a French monologue with a generally neutral and informative tone, exhibiting a slightly contemplative and hesitant delivery. The voice possesses a male baritone timbre, characterized by a slightly soft, dark, and somewhat breathy quality with a subtle nasality. The airflow is neutral, and the voice exhibits a near-neutral relaxed tension, occasionally displaying slight roughness. The resonance is chest-mixed, with a near-neutral heavy vocal weight and mild wear, maintaining a mostly natural and stable character. The speaker's pitch is moderate, and the tempo is slow and deliberate, contributing to a sense
2.4–12.0ssound_eventThe audio contains the sound of scraping or sanding on a surface. A rhythmic, repetitive action is heard. The sound is fairly close to the microphone.
moderate
3.3–8.3ssound_eventThe audio contains the sound of a door opening and closing. Sounds suggest a fairly solid door. After closing, no other discernible sounds can be heard.
quiet
Gemma-12B + DiCoW (text-only fusion)
0.0–13.4smusicHigh-energy, fast-tempo electronic music: driving four-on-the-floor kick drum, pulsating synth bassline, and layers of bright, melodic synthesizers; heavily compressed, brickwall-limited, wide stereo panning, very loud.
very_loud
1.6–9.0sspeechJe sais pas trop, mais toujours utile que le résultat soit terrible en train de délai depuis des trois millions à cesse
spk speaker_0 · fr · neutral with a slight hint of contemplation
8.5–9.5ssound_eventLoud, jarring, and aggressive car horn abruptly sounding and dominating the soundscape
very_loud
13.0–13.4ssound_eventTwo sharp, distinct mechanical clicks, resembling a plastic latch or switch being operated
moderate
clip_002 · 5 GT events · langs: de, es, nl overlapping speech
Ground truth
0.5–7.1sspeechDas ist noch nicht so, dass ich da absolut und vollumfänglich jetzt irgendwie zufrieden bin.
spk speaker_1 · de · , a young adult male voice delivers speech in German with a neutral accent. The recording boasts high quality, captured in a quiet, studio-like environment with minimal background noise. Initially, the voice exhibits a slightly hesitant and contemplative tone, conveying mild doubt and a hint of embarrassment. The delivery is natural and spontaneous, characterized by a soft, slightly breathy timbre and a mid-range pitch. The tempo is slow and deliberate, with precise articulation. The overall enjoyment is medium, reflecting the clear audio but subdued emotional expression. Professionalism is high, attributed to the excellent recording quality and articulate speech.
9.6–20.1sspeechsi yo hubiera creído lo que me dijiste yo hubiera excusado esta pesadumbre pero ya está hecho paciencia y escarmentar para desde aquí en adelante
spk speaker_2 · es · This recording features a male speaker delivering a message in Spanish with a neutral accent. The audio quality is excellent, exhibiting no background noise and a clear, studio-like environment. The speaker's voice is a male baritone, described as slightly soft and neutral, with a near-neutral-slightly-bright timbre and a touch of breathiness and slight nasality. There's a subtle tension and roughness present, contributing to a natural, stable vocal quality. The resonance is chest-mixed, with a near-neutral heavy vocal weight and mild wear, but overall the voice sounds healthy and natural.
21.9–33.6sspeechalle drommels riep van velzen van den overkant wie heeft jou zoo in t goud gezet in t goud sakkerloot wat n mooie ketting
spk speaker_3 · nl · A male speaker delivers a calm and informative message in Mandarin Chinese, exhibiting a neutral and slightly formal tone. The voice possesses a male baritone timbre, characterized by a slightly soft and neutral quality with a near-neutral-slightly-bright resonance. A subtle breathiness and a hint of nasality are present, contributing to a relaxed yet stable vocal delivery. The airflow is neutral, and the articulation is precise, indicative of a narration style. The speaking style is natural and spontaneous, with a moderate tempo and a generally even pitch. The voice exhibits a chest-mixed resonance, with a near-ne
27.2–37.0ssound_eventThe audio primarily features the distinct sounds of a large engine idling or running at low speed. There is a low, constant rumble, and other mechanical sounds are also present, perhaps from gears or other components. The general impression is that of an engine running, near a vehicle.
quiet
33.8–38.6svocal_burstThe audio features a series of high-pitched gurgling and cooing vocalizations, characteristic of an infant. These sounds are interspersed with distinct wet bubbling noises and a single, clear "smack" sound, indicative of a kiss. This soundscape suggests the presence of a baby, likely interacting playfully with an adult or making happy vocalizations. The gurgling and cooing, along with the kissing sound, indicate a tender and engaging moment.
spk speaker_3 ·
Gemma-12B + DiCoW (text-only fusion)
0.0–55.6ssound_eventVery faint, low-frequency electronic hum or room tone barely audible in the background
quiet
2.3–7.1sspeechDas ist noch nicht so, dass sie da absolut und vollumfänglich jetzt irgendwie zufrieden
spk speaker_0 · de · neutral and informative
9.1–20.1sspeechbin. Si yo hubiera creído lo que me dijiste, yo hubiera excusado esta pesadumbre, pero ya está hecho paciencia y escarmentar para desde aquí en adelante.
spk speaker_1 · es · neutral and informative
22.4–33.0sspeechAlle dromos, riep Van Velsen van de overkant. Wie heeft jou zo in het goud gezet? Een rood houtje, zakker lood voor de mooie ketting.
spk speaker_2 · nl · neutral and informative
33.5–38.5sspeechPeguera veintisiete veintisiete.
spk speaker_3 · es · rapid, insistent, and somewhat agitated
34.1–38.1ssound_eventHigh-pitched electronic hum or buzz accompanying a female voice in a foreign language
moderate
38.2–38.3ssound_eventSingle, sharp, bright electronic click
loud
38.3–45.5svocal_burstchuckle
spk speaker_3 · quiet, genuine-sounding male laughter
clip_003 · 6 GT events · langs: en
Ground truth
0.0–10.5smusicThe audio features a male vocalist with rock instrumentation. A driving drum beat, distorted guitars, and a bassline provide the foundation. The tempo is moderate, around 119 BPM. **Vocals**: - **Vocal Profile**: Adult male. - **Timbre & Quality**: Powerful and slightly raspy. - **Non-Lyrical Sounds**: None are clearly discernible. - **Lyrics**: "I'm an accident waiting to happen, yeah, dancing on the edge, without a care."
0.2–10.2ssound_eventThe prominent sound is a rhythmic, whirring noise. It has a distinct, mechanical tone with a moderate to high frequency. It has a rhythmic pulse like it is moving at a constant speed. There's also a faint low-frequency hum.
quiet
0.2–10.2ssound_eventThe predominant sound in the audio is the continuous rushing sound of running water. The sound has a consistent volume and a mid-range frequency, like water filling a sink.
quiet
1.2–10.2ssound_eventThe audio features a gurgling sound. The sound is wet, bubbling, and somewhat low in frequency. It is continuous throughout the duration of the file.
moderate
1.3–7.6sspeechNo, no, no, no, dude, dude. It's all good. Alright, we get it. You know, you, uh, you don't want to disappoint daddy.
spk speaker_1 · en · A young adult male, likely in his 20s or 30s, delivers a highly enjoyable and professional speech recording in English with a neutral American accent. The voice possesses a male baritone timbre, characterized by a slightly soft and neutral quality with a near-neutral-slightly-bright resonance. A subtle breathiness and a hint of nasality are present, accompanied by a touch of tension and slight roughness, creating a natural and stable vocal texture. The voice exhibits a chest-mixed resonance and a near-neutral heavy vocal weight, with mild wear suggesting a lived-in quality. The speaker's delivery
5.6–10.5svocal_burstThe audio features a distinct, forceful throat clearing sound, immediately followed by multiple instances of a short, sharp snorting noise. The snorting sounds are percussive and rapid, varying slightly in their intensity and duration. The sequence repeats with additional throat clearing and snorts. The sounds indicate a person clearing their throat, possibly due to congestion or irritation, followed by or accompanying a series of forceful snorting sounds. This could be a physiological response, a tic, or a deliberate action such as mimicking an animal sound.
spk speaker_1 ·
Gemma-12B + DiCoW (text-only fusion)
0.0–10.1smusicUpbeat, high-tempo pop-rock: driving drum beat with prominent reverberant snare, simple bassline, and electric guitars. Energetic live performance feel with a bright, slightly harsh quality in cymbals and guitars, recorded in a large reverberant hall.
moderate
0.0–10.5ssound_eventPersistent hiss and room tone from a live recording or broadcast.
quiet
1.7–10.6sspeechNo, no, no dude it's all good. All right, we get it.
spk speaker_0 · en · relaxed and slightly amused
3.7–9.9ssound_eventA distinct whoosh sound suggesting a transition or rapid movement of an object, such as a vehicle or projectile, following a male voice.
moderate
clip_004 · 5 GT events · langs: fr, zh
Ground truth
1.4–12.2sspeech就转移命令咗,即将接触加美拉斯军队嘅穿越号,结果穿越号就俾加美拉斯队落咗,令到岛大街,着咗个家人。就係咁。
spk speaker_1 · zh · A young adult male, likely in his 20s, delivers a clear and engaging speech in Korean with a neutral accent. The recording boasts high quality, captured in a quiet, studio-like environment with absolutely no background noise, resulting in moderately clear speech. Initially, the speaker exhibits strong elation and hope, accompanied by moderate interest and a hint of amusement. The voice possesses a bright, clear timbre, a fast tempo, and a high, dynamic pitch range, coupled with crisp articulation. The delivery is natural and spontaneous, conveying a sense of genuine enthusiasm. The vocal airflow is neutral, with a slight breath
11.0–20.0ssound_eventThis audio clip features a bright, upbeat dance track. There are distinct elements suggesting a brass section or synthesized brass. A female voice is singing in the style of funk or disco. The tempo is moderate to fast, driving the beat forward.
quiet
12.8–17.2svocal_burstA series of distinct burping sounds is heard. Each burp is sudden and loud, originating from the throat, with a slightly gurgling quality preceding or following the expulsion of air. These are human bodily sounds, specifically burps, indicating the release of gas from the stomach.
spk speaker_1 ·
12.9–17.7sspeechQui était votre professeur ? Qui était votre professeur ?
spk speaker_2 · fr · A young adult male, likely in his 20s or 30s, delivers a French-language speech with a neutral accent. The recording boasts high quality, captured in a quiet, studio-like environment with minimal background noise. Initially, the voice exhibits a slightly subdued and neutral tone, conveying mild interest and concentration, tinged with emotional numbness. The delivery is slow and deliberate, characterized by a low-pitched, slightly rough timbre and a monotonous, almost robotic quality, suggesting a text-to-speech-like delivery. The overall enjoyment is low due to the unemotional delivery
18.1–22.8svocal_burstA continuous, low-frequency, throbbing vibration sound is audible, characterized by a soft, rhythmic rumble with a gentle cadence. The sound is steady and consistent. This sound is highly characteristic of a cat purring, typically indicating contentment, relaxation, or affection from the animal.
spk speaker_2 ·
Gemma-12B + DiCoW (text-only fusion)
0.0–11.5smusicIntense cinematic orchestral music: driving strings and heavy percussion, high-stakes action-adventure atmosphere, building from moderately loud to very loud, dense and percussive texture.
very_loud
0.0–23.4ssound_eventQuiet room-tone
quiet
1.7–11.4sspeech早期就刚没到现在问得一个去女口, 一国去女和不着, 被干没那些得。 <zh-CN> 林都都都大概, 这个就够盖
spk speaker_0 · zh · neutral and informative
11.0–22.2ssound_eventPersistent, low-frequency, guttural snoring sound with a dark, menacing, and ominous texture
moderate
12.6–14.6sspeechvotre, qui était votre professeur.
spk speaker_0 · fr · slightly annoyed with a slightly raised tone
14.0–20.2sspeechvotre, qui était votre professeur.
spk speaker_1 · fr · neutral and stating a fact
19.9–22.7ssound_eventHigh-pitched, metallic squealing sound, characteristic of a rusty hinge or mechanical friction, continuous and abrasive
moderate
20.1–20.2sspeech<fr-FR>
spk speaker_0 · fr · neutral and informative
clip_005 · 7 GT events · langs: de, en, es overlapping speech
Ground truth
0.0–58.3smusicThe track features a slow, sustained synthesizer pad and a bright, melodic synthesizer arpeggio. The overall sound is clean and spacious, with a sense of depth.
0.5–9.7sspeechZu Missverständnissen oder Konflikten führen, wenn jemand was nicht sieht, obwohl man das abgeschickt hat und dann muss immer umsichtig sein und nachsichtig sein.
spk speaker_1 · de · , a young adult female voice delivers a thoughtful and somewhat melancholic monologue. The speaker exhibits a blend of contemplation, mild sadness, and a hint of underlying distress, transitioning to a more neutral and slightly negative emotional state with a touch of interest. The overall tone is calm and measured, with a moderate tempo and precise articulation. The voice possesses a soft, slightly breathy timbre, with a near-neutral-slightly-bright quality and a subtle nasality. The pitch remains stable in a mid-range, and the vocal weight is neutral. The delivery is natural and spontaneous, with a mostly clear
6.0–16.1sspeechcomo las aves que vuelan así amparará jehová de los ejércitos á jerusalem amparando librando pasando y salvando
spk speaker_2 · es · A clear, high-quality recording features a female adult voice speaking Spanish with a neutral accent. The voice exhibits a pleasant, natural, and spontaneous delivery, with no background noise. Initially, the speaker conveys a slightly positive and calm tone, expressing moderate interest and concentration, tinged with contemplation and a hint of vulnerability. The delivery is genuine, soft-spoken, and maintains a neutral pitch and volume, with a subtle sense of hope and optimism. The timbre is a warm, middle-aged female mezzo-soprano, slightly soft and neutral, with a near-neutral-slightly-bright
10.5–17.3sspeechThe nerves sense that something is there that is not supposed to be. They shoot the message to the brain.
spk speaker_3 · en · A young adult female voice delivers a clear, high-quality narration in English with a neutral American accent. The recording exhibits absolutely no background noise, contributing to exceptional speech clarity. Initially, the voice conveys a slightly hesitant yet genuine tone, marked by moderate concentration, a hint of doubt, and mild surprise. The delivery is soft and smooth, with a neutral pitch and volume, and a subtle, almost imperceptible hint of vulnerability. The speaking style is natural and spontaneous, with precise articulation and a balanced head-chest resonance. The timbre is a female mezzo, slightly soft and neutral, with a near-neutral
20.0–30.0ssound_eventThe audio contains sounds suggestive of a water vehicle. A motor can be heard running, with accompanying sounds of water splashing and waves lapping against a hull.
loud
28.3–38.2ssound_eventThe audio presents the sound of a large vehicle passing by. The recording begins with a constant, lower-frequency rumbling indicative of an engine with a higher-frequency whining noise as it passes by the recording device.
loud
45.8–55.4ssound_eventThe audio features a heavy engine sound, likely from a large vehicle. The engine is running, with distinct idling noises followed by a brief acceleration or increase in RPM. There is some background noise, possibly traffic or other machinery.
loud
Gemma-12B + DiCoW (text-only fusion)
0.0–15.5smusicAtmospheric ambient music: clean electric guitar with extensive reverb playing a simple repeating arpeggiated melody, layered over a sustained low-frequency synth pad and a subtle rhythmic electronic pulse, creating a dark, contemplative, and tense mood.
moderate
0.0–58.3ssound_eventQuiet room-tone / ambient background
quiet
1.1–9.1sspeechZu Missverständnissen unter Konflikten führen, wenn jemand, was nicht sieht, obwohl man das abgeschickt hat und dann muss immer offensichtig sein nach
spk speaker_0 · de · slightly frustrated while explaining something
11.4–12.5sspeechJerusalem,
spk speaker_1 · en · firm and assertive with a serious, somber tone
14.6–15.1sspeechpasando y
spk speaker_1 · es · robotic and emotionless
15.2–58.3sspeechsalvando.
spk speaker_2 · es · calm and informative
24.5–33.5ssound_eventSound of a car moving on a wet surface, including the prominent whoosh of tires on rain or puddles, with a consistent low-level electronic hum.
moderate
33.5–41.5ssound_eventA loud, deep, and powerful engine roar from a vehicle passing by close by, followed by the sound fading.
loud
38.5–46.5ssound_eventDeep, powerful roar of a large vehicle engine passing by on a wet surface at close proximity, panning across the stereo field.
loud
46.5–54.5smusicInstrumental atmospheric ambient music: clean electric guitar playing a melancholic, repeating melody with long reverb and delay, with a deep, sustained synth pad swelling underneath.
moderate
54.5–58.3smusicAtmospheric music with a new layer of low, ominous synth drones building tension and increasing in volume and intensity, featuring reverberant electric guitar melody.
moderate
clip_006 · 3 GT events · langs: zh
Ground truth
0.0–27.4ssound_eventThe primary sound event is a clatter. It sounds like multiple small, rigid objects, possibly metal or glass, are colliding or falling onto a hard surface. The clatter is sharp and brief.
quiet
1.3–5.1sspeech所以,最后,我再做一个小节,就是,可德性解释法。
spk speaker_1 · zh · A young adult male, likely in his 20s, delivers a speech exhibiting a dynamic shift in emotional expression and speaking style. Initially, the voice conveys strong elation, hope, and optimism, interwoven with moderate interest and a hint of amusement. The delivery is natural, spontaneous, and fluent, characterized by a clear, bright timbre and a moderate tempo. The pitch range is relatively high, and articulation is precise, suggesting a well-controlled vocal technique. The airflow is neutral, and the loudness is normal, with slight dynamic variations. The voice possesses a male baritone quality, slightly soft and
5.8–14.8ssound_eventA male speaker is talking at a conversational pace. The recording contains a slight echo. His voice is medium-pitched and relatively calm. He states "It can be a heck of a lot of fun. And who has more fun than us? For Andy Brickley."
moderate
Gemma-12B + DiCoW (text-only fusion)
0.0–0.0ssound_eventConsistent low-level background hiss and room tone in a moderate-sized reverberant space
quiet
0.0–1.4ssound_eventHis delivery is paced steadily and clearly. The recording has a noticeable room tone, suggesting a moderate-sized, somewhat reverberant space. The spectral balance is focused on the midrange, typical for speech, with a clear and present quality. The timbre is natural and unprocessed. Spatially, the recording exhibits a distinct room reverberation with a noticeable decay tail on the speech, indicating a space with some reflective surfaces
moderate
1.4–5.3sspeech所以最后我再做一个小街, 就是可能性节司法
spk speaker_0 · zh · neutral and informative
5.7–10.8ssound_eventMotorcycle engine starting with quick metallic cranking noises, settling into a low-idle rumble that fades in
moderate
6.9–8.4sspeechit can be a heck of a lot of
spk speaker_1 · en · casual and slightly humorous
8.4–8.5sspeechfun
spk speaker_0 · en · neutral
9.6–11.0sspeechand who has more fun than
spk speaker_1 · en · casual and slightly humorous
10.8–17.8ssound_eventMotorcycle engine: loud, throaty rev rising in pitch and volume (V-twin cruiser style), decelerating into a rough, chugging idle
moderate
11.1–11.2sspeechus
spk speaker_0 · en · neutral
13.0–13.6sspeechfour Andy
spk speaker_1 · en · slightly urgent and commanding
13.7–14.2sspeechBrickle
spk speaker_0 · en · neutral
14.6–14.7sspeechnow
spk speaker_1 · en · slightly urgent and commanding
15.1–15.2sspeechcome
spk speaker_0 · en · neutral
17.8–23.5ssound_eventMotorcycle engine idling and then revving and receding in the background; distant, reverberant voice of a third male speaker
moderate
23.5–27.5ssound_eventMotorcycle engine with loud, aggressive revs and a brief, high-pitched tire squeal as it accelerates away
loud
27.4–27.4ssound_eventRoom tone and background hiss
quiet
clip_007 · 4 GT events · langs: fr overlapping speech
Ground truth
0.0–30.4smusicThe track opens with a slow, deliberate drum beat, reminiscent of Nordic folk music, at approximately 87 BPM. A female voice sings a melodic phrase, layered with ethereal harmonies. A deep, resonant bass tone underlies the vocals. **Vocals**: - **Vocal Profile**: Adult female. - **Timbre & Quality**: Clear, slightly breathy, and ethereal. The harmonies are layered and wide in the stereo field. - **Non-Lyrical Sounds**: None discernible. - **Lyrics**: "She walks on conquered ground, the earth beneath her feet..."
0.9–9.3sspeechDonc il y a déjà un petit parallélisme qui s'établit entre samouraïs et mercenaires. Sauf que dans les Sept Samouraïs...
spk speaker_1 · fr · A young adult male, likely in his 20s or 30s, delivers a French monologue with a standard accent. The recording boasts high quality, captured in a quiet environment with minimal background noise, suggesting a professional setup. Initially, the speaker exhibits strong impatience, irritability, anger, and distress, characterized by a fast-paced, loud delivery and a high pitch. The timbre is somewhat harsh and strained, yet maintains a natural, spontaneous quality. The speech is clear, though the emotional intensity creates a medium level of enjoyment. The vocal delivery transitions to a slightly calmer and more neutral tone. The
9.4–13.8svocal_burstMultiple rapid, distinct, and highly localized wet smacking sounds are present. The sounds are sharp, very brief, and occur in quick, continuous succession, implying repeated, light contact. These sounds are strongly suggestive of a series of quick, affectionate kisses or rapid pecks.
spk speaker_1 ·
15.4–25.4ssound_eventThe predominant sound is that of a crackling fire. There are also some lower-frequency rumbling or whooshing sounds, possibly indicative of the fire consuming fuel and the movement of hot air. No voices or other sounds are audible.
moderate
Gemma-12B + DiCoW (text-only fusion)
0.0–10.5smusicQuiet, atmospheric indie-folk: distant male vocalist, gentle acoustic guitar, and subtle percussion, warm and somber mood.
quiet
0.0–9.6sspeechQui sera défendu par sept Pokéboys bienveillants. Donc il y a déjà un petit parallélisme qui s'est établi entre samouraï et mercenaire. Sauf que dans les sept samouraïs,
spk speaker_0 · fr · slightly annoyed and impatient
0.0–30.5ssound_eventQuiet room-tone
quiet
5.0–10.5ssound_eventRhythmic, metallic clicking sound, steady tempo similar to a revolver's cylinder being spun
moderate
9.6–15.8smusicContinuous, gentle indie-folk music: distant, reverberant male vocalist delivering poetic lyrics, sparse acoustic guitar, soft percussion, somber mood
quiet
10.1–13.5smusicElectronic dance music with a strong beat and a female vocal sample
moderate
10.5–17.5ssound_eventSequence of sharp, distinct metallic clicks, consistent with a revolver's cylinder being spun, in a highly reverberant room
loud
15.8–19.9sspeechFly me in a hot soul and fun conquer ground.
spk speaker_0 · en · calmness
17.5–24.5ssound_eventLoud, deep, continuous sound of a large vehicle (truck or bus) passing by with low-frequency rumble and mechanical noises
very_loud
19.9–25.9smusicQuiet, atmospheric indie-folk music: distant, reverberant male vocalist, acoustic guitar, and subtle percussion
quiet
24.5–30.5smusicAtmospheric indie-folk music: distant, reverberant male vocalist, acoustic guitar, and subtle percussion, swelling slightly in volume
moderate
25.9–30.4sspeechFly me in a hot soul and fun conquer ground.
spk speaker_0 · en · calmness
clip_008 · 7 GT events · langs: de, es overlapping speech
Ground truth
0.0–43.6ssound_eventThe audio features the distinct sound of rolling train wheels on a track. The sound is a repetitive, rhythmic clatter, suggesting the movement of a rail car or train wagon. The rolling sound appears consistent in pace and intensity, indicating a stable speed.
quiet
1.2–6.3sspeechUnd darüber hinaus, das ganz große, äh, plus ist natürlich, äh,
spk speaker_1 · de · A medium-quality recording features a male speaker, likely middle-aged, delivering speech in English with a non-native accent. The overall tone is neutral, with a slightly hesitant and contemplative quality emerging throughout. The speaker exhibits a moderate degree of doubt and confusion, though the valence remains largely neutral. The voice possesses a male baritone timbre, described as slightly soft and dark, with a noticeable breathiness and a subtle nasality. A slight roughness and chest-mixed resonance are present, contributing to a generally relaxed yet somewhat worn vocal quality. The delivery is characterized by a moderate pace, with a
3.9–15.8sspeechla embajada que le mandan los griegos a aquiles para que vuelva a ayudarlos en los combates porque desde que él no pelea están ganando los troyanos
spk speaker_2 · es · A male speaker delivers a speech in Spanish, exhibiting a calm and reflective tone initially, transitioning to a more contemplative and slightly melancholic state. The voice is a male baritone, likely belonging to a middle-aged adult, with a slightly soft and dark timbre, possessing a subtle breathiness and nasality. The vocal production is relaxed and stable, with a chest-mixed resonance and a mild wear suggesting natural use. Articulation is precise, and the airflow is neutral, contributing to a generally smooth delivery. The speaking style is natural and spontaneous, with a moderate tempo and a stable mid-range
5.1–14.7ssound_eventThe audio presents a series of distinct sounds. First, a distinct clicking sound, seemingly mechanical, followed by a brief vocalization that sounds like "pigeon" is heard. This is immediately followed by a cat meowing several times with a high-pitched tone. Finally, another distinct "meow" sound can be heard.
moderate
9.4–18.4ssound_eventThe audio features several distinct sounds. First, there is a series of repetitive clicking noises, possibly mechanical. Subsequently, there is a loud, percussive impact noise reminiscent of a gunshot. After that, sounds akin to a rapid gunfire is hear.
loud
15.9–20.3svocal_burstA low-pitched, prolonged vocalization resembling a deep groan or moan is heard. The sound is characterized by a strained, drawn-out quality, conveying a sense of discomfort or distress. The pitch subtly wavers throughout its duration. This vocalization strongly suggests a person experiencing physical pain, significant discomfort, or deep emotional distress. The prolonged and strained nature of the sound indicates an intense feeling of suffering or anguish.
spk speaker_2 ·
19.8–28.8ssound_eventThe audio is dominated by sounds relating to motor vehicle movement. A pronounced engine revving is heard, quickly building in intensity, followed by what sounds like a rapid increase in speed accompanied by loud "vroom" sounds. All the vehicle engine noises are close mic'ed.
loud
Gemma-12B + DiCoW (text-only fusion)
0.0–4.5ssound_eventQuiet but persistent outdoor ambience characterized by a high-pitched buzzing sound, like cicadas or crickets, and a subtle low-frequency rumble suggesting distant traffic or machinery
quiet
1.4–4.5sspeechUnd darüber hinaus das ganz große, äh, äh, plus.
spk speaker_0 · de · somewhat neutral, with a slight hint of interest
4.5–15.6sspeechLa embajada quiere mandar a los griegos a Aquiles, para que vuelva a ayudarlos en los combates, porque ves que si él no pelea, están ganando los troyanos.
spk speaker_1 · es · neutral and is simply stating a fact
7.8–26.5ssound_eventFaint, steady room tone in the background; toward the end, a distant female voice begins to shout briefly
quiet
24.5–37.0ssound_eventSharp mechanical sounds including a metallic click and whirring noise, followed by sustained, loud, rhythmic high-pressure spray like an industrial cleaner; persistent low-frequency mechanical hum and broad, airy hiss
loud
33.8–44.0ssound_eventContinuous, deep mechanical rumble mixed with a higher-pitched metallic whirring, with distinct sharp metallic clanks and a constant hiss-like broadband noise
moderate
43.6–43.6ssound_eventQuiet room-tone
quiet
clip_009 · 3 GT events · langs: en
Ground truth
0.0–39.1smusicThe clip features a female vocalist singing in Latvian. The instrumentation includes drums, bass, acoustic guitar, and potentially keyboards. The tempo is moderate. **Vocals**: - **Vocal Profile**: Female, possibly young adult. - **Timbre & Quality**: Clear, with a slight vibrato. - **Non-Lyrical Sounds**: None immediately discernible. - **Lyrics**: (In Latvian)
1.1–8.8sspeechMaybe I had begun to think that they were gone, destroyed, and I was safe. Now and then I would remember them.
spk speaker_1 · en · A high-quality recording features a female voice, likely in her middle age, speaking English with a neutral American accent. The audio exhibits absolutely no background noise, contributing to exceptional clarity. The speaker's timbre is a female mezzo, slightly soft and neutral, with a near-neutral brightness and a subtle breathiness and slight nasality. The voice is generally clear and healthy, with balanced head-chest resonance and a light vocal weight, sounding perfectly natural and stable. Initially, the voice conveys a slightly melancholic and contemplative tone, expressing moderate longing and sadness, tinged with a hint of
11.4–20.4ssound_eventThe primary sound event is a distinct series of hoofbeats. The sound is clear and rhythmic, suggesting a horse is walking or trotting. The recording has a dry acoustic, with little or no noticeable reverberation or echo.
moderate
Gemma-12B + DiCoW (text-only fusion)
0.0–39.1smusicLo-fi hip-hop track featuring a gentle electric piano melody, a simple drum machine beat, and a warm bassline; warm, mid-focused, and relaxed mood with a subtle vinyl crackle and tape saturation.
moderate
0.0–39.1ssound_eventSubtle vinyl crackle or tape saturation audible throughout the track
quiet
1.6–2.0sspeechMaybe I had
spk speaker_0 · en · slightly hesitant and reflective
2.0–2.4sspeechbegun to
spk speaker_1 · en · slightly sarcastic
2.5–5.0sspeechthink that they were gone destroyed and
spk speaker_0 · en · distressed and upset
5.0–5.5sspeechI will
spk speaker_1 · en · neutral
5.6–5.9sspeechsave
spk speaker_0 · en · distressed and upset
7.2–8.2sspeechnow and then I would
spk speaker_0 · en · slightly annoyed and impatient
8.3–8.5sspeechremember
spk speaker_1 · en · neutral
9.0–9.0sspeechthem
spk speaker_0 · en · slightly annoyed and impatient
11.5–12.5ssound_eventProminent, noisy vinyl scratch sound effect featuring rapid, backward-sounding manipulation of tape or vinyl
loud
12.5–12.8ssound_eventBrief, sharp burst of high-frequency static
loud
30.6–32.4sspeechsapel trok for sierā
spk speaker_0 · lv · gentle, expressive
33.7–38.8sspeechtik kajākstumus uzlāk toi lipisja lomas
spk speaker_0 · lv · gentle, expressive
clip_010 · 3 GT events · langs: zh
Ground truth
1.0–12.1sspeech所以这几个地方应该都可以去。是吗?诶,不是。剧情,是刚刚已经过过的剧情。
spk speaker_1 · zh · A low-quality recording features a female speaker, likely elderly, delivering speech in Mandarin Chinese with a neutral accent. The overall listening experience is unpleasant due to the poor audio quality and monotonous delivery. The speaker's voice exhibits a slightly soft, neutral timbre, leaning towards brightness with a subtle breathiness and a hint of nasality. There's a touch of tension and slight roughness, balanced between head and chest resonance, with a neutral weight and mild wear, overall sounding mostly natural and stable. The speaking style is slow and deliberate, with a narrow pitch range and a somewhat monotonous intonation, suggesting a natural
5.8–15.4ssound_eventThe audio features a distinct "click" sound, similar to a button click. The click is fairly sharp and isolated. There's only one click, and then the audio ends abruptly. The recording is clean with very little background noise.
quiet
13.2–19.2ssound_eventThe predominant sound is a passing vehicle, possibly a truck. The engine noise is loud and deep, with a prominent rumble. The sound quickly increases in volume, reaches a peak, and then fades into the distance. The vehicle appears to be accelerating rapidly as it passes.
loud
Gemma-12B + DiCoW (text-only fusion)
0.0–5.5ssound_eventVery quiet, high-frequency room tone in a live, slightly reverberant untreated room
quiet
0.0–5.5sspeech所以这几个地方应该都可以去
spk speaker_0 · zh · calm and somewhat detached tone
5.5–20.6ssound_eventLoud, guttural roar and deep rumble of a large vehicle engine (truck or bus) with significant dynamic variations, transitioning from an enclosed space to an open outdoor environment with a Doppler effect
very_loud
6.5–12.1sspeech对呀,已经不是了,剧情是刚已经过过的剧情。
spk speaker_0 · zh · calm and somewhat neutral tone
12.1–28.4ssound_eventContinuous loud, guttural roar and deep rumble of a large vehicle engine moving away in an open area with hard surfaces
very_loud
20.6–28.4ssound_eventVery quiet, high-frequency room tone
quiet
clip_011 · 6 GT events · langs: fr
Ground truth
0.0–13.7smusicThe clip begins with a slow, deliberate string arrangement, featuring violins, violas, and cellos playing a sustained, mournful melody. The dynamics are soft and subtle, creating a sense of space and atmosphere. The overall feeling is one of sadness and longing. There is a gradual crescendo throughout the clip.
0.5–4.2sspeechNon. La décision a été prise par la direction.
spk speaker_1 · fr · A male speaker delivers a formal and informative message in French, exhibiting a calm and neutral tone throughout. The voice is a male baritone, likely from a middle-aged adult, with a slightly soft and relaxed quality. The timbre is near-neutral, slightly bright, with a touch of breathiness and nasality, and a mild roughness contributing to a natural, stable sound. The vocal weight is near-neutral and heavy, showing mild wear but remaining mostly natural. The speaker maintains a moderate pace with precise articulation, demonstrating a controlled and deliberate delivery. Initially, the tone is serious and emotionally detached,
2.0–12.0ssound_eventA sound event characterized by a low-frequency rumble, followed by a swishing sound that increases in volume then fades away quickly. The sound event has a notable doppler effect.
loud
2.3–11.3ssound_eventThe primary sound event consists of several distinct footsteps, occurring in a regular rhythm. The steps sound like they are made on a hard surface. The rate of the footsteps is moderate, indicating a normal walking pace. There is some faint ambient noise in the background.
loud
2.8–12.4ssound_eventThe dominant sound in this audio is that of a vehicle engine running. The engine exhibits a rhythmic ticking or clicking sound in addition to the normal engine noise. No other prominent sounds are apparent.
quiet
4.6–9.0svocal_burstA series of wet, smacking sounds, combined with softer, prolonged sucking or chewing noises, are present in the audio. These sounds are repetitive and suggest the manipulation of a soft, moist substance in the mouth, accompanied by subtle mouth clicks. This audio likely depicts someone eating a soft, possibly chewy or sticky food item, or sucking on a sweet, such as a lollypop.
spk speaker_1 ·
Gemma-12B + DiCoW (text-only fusion)
0.0–13.3smusicTense, dramatic orchestral music cue with prominent, rhythmic string ostinatos; driving, cinematic quality creating urgency and anticipation; mid-forward tonal balance, smooth layered texture, moderately reverberant.
moderate
2.2–4.2sspeechNon, la décision a été prise par la direction.
spk speaker_0 · fr · neutral and informative
4.3–8.8ssound_eventHeavy, granular object being dragged or scraped across a rough surface; rough, abrasive, gritty, shuffling quality with sharp, aggressive transients
loud
4.8–5.3sspeechNon, la décision a été prise par la direction.
spk speaker_0 · fr · calm and neutral
9.9–10.6sspeechNon, la décision a été prise par la direction.
spk speaker_0 · fr · neutral, conversational
13.3–13.7ssound_eventQuiet room-tone
quiet
clip_012 · 5 GT events · langs: de, es overlapping speech
Ground truth
0.0–36.6smusicThe song opens with an upbeat rock drum beat and electric guitars. A male vocalist sings with a clear, slightly raspy voice. **Vocals**: - **Vocal Profile**: Adult male. - **Timbre & Quality**: Clear, slightly raspy, and energetic. - **Non-Lyrical Sounds**: None. - **Lyrics**: "You led the charge, you raised the bar, Michael Schumacher, the brightest star. Michael Schumacher, racing through the years, with passion and drive, facing all your fears. From Benettons blue and green to Ferraris red and white."
1.4–13.0sspeechIch weiß nicht, ob ich das Respekt nennen würde, aber so ein bisschen halt auch, es ist halt unsere Kultur und diese Dinge gehören auch so dazu. Oder würdet ihr sagen, ihr würdet diese ganzen Dinge am liebsten abschaffen und ändern?
spk speaker_1 · de · This recording features a young adult male speaker, likely in his 20s or 30s, delivering speech in German with a neutral accent. The audio quality is exceptionally high, captured in a quiet, studio-like environment with absolutely no background noise, resulting in clear and articulate speech. The speaker's voice is a male baritone, characterized by a slightly soft and neutral timbre with a near-neutral-slightly-bright quality. There's a subtle breathiness and a hint of nasality, accompanied by a touch of tension and slight roughness, creating a natural, yet somewhat worn vocal texture. The resonance is chest-
7.7–16.8ssound_eventThe audio consists of a loud, low-frequency rumbling sound. The sound appears to be continuous and includes a slight variation in intensity, as if coming closer and moving away.
loud
8.1–19.9sspeechrecitó con tan apasionado acento algunos versos en los que se mostraba un corazón deshecho o un alma destrozada por el infortunio
spk speaker_2 · es · This recording features a clearly feminine adult voice, likely speaking English with a standard American accent, delivered in a high-quality, studio-like environment with absolutely no background noise. The speaker exhibits a dynamic and expressive vocal performance, transitioning from a state of elation and hope in the first half to a more profound sense of sadness, distress, and helplessness in the second. Initially, the voice conveys strong elation, hope, and a hint of triumph, characterized by a bright, smooth timbre, a fast tempo, and a high, dynamic pitch range. The delivery is natural and spontaneous, with precise articulation and a neutral
13.4–18.0svocal_burstThe audio features a series of distinct, wet, smacking sounds, characteristic of repeated kissing. These brief, percussive noises are immediately followed by a singular, prolonged, and deep exhalation or sighing sound. This sequence of sounds suggests intimate physical interaction involving kissing, possibly accompanied by a feeling of contentment, relaxation, or even mild amusement, indicated by the subsequent sigh.
spk speaker_1 ·
Gemma-12B + DiCoW (text-only fusion)
0.0–16.4smusicEnergetic pop-rock track: prominent acoustic drum beat with driving kick and snare, simple bassline, and clean electric guitar playing rhythmic chords, moderately loud, instrumental underscore.
moderate
0.0–36.6ssound_eventQuiet room-tone / studio background ambience
quiet
1.8–4.2sspeechIch weiß nicht, ob ich das Respekt nennen würde, aber so ein bisschen
spk speaker_0 · de · slightly hesitant and unsure
4.4–6.2sspeechhalt auch. Es ist halt unsere
spk speaker_0 · de · neutral and informative
6.7–12.9sspeechKultur und diese Dinge gehören auch so dazu oder würde sie sagen, wir würde diese ganzen Dinge am nächsten erschaffen und
spk speaker_0 · de · somewhat excited and informative
12.8–20.2sspeechLos que se mostraban, corazón deshecho, un alma destrozada por él.
spk speaker_0 · es · confident, declarative
16.5–24.5smusicOngoing pop-rock track: driving drum rhythm, bass, and electric guitar, transitioning into a vocal showcase.
moderate
24.5–36.7smusicPop-rock music track: steady acoustic drum beat, foundational bassline, layered electric guitars. Features energetic and melodic singing in Spanish (mid-high tenor) followed by English (bright, slightly nasal tenor).
moderate
25.0–29.8sspeechcon tan apasionado acento algunos versos en los que se mostraba un corazón deshecho con un alma destrozada por el incortunio
spk speaker_0 · es · slightly aggressive
33.5–36.6sspeechMichael Schumacher, the brightest star.
spk speaker_0 · en · dramatic and intense
clip_013 · 5 GT events · langs: en
Ground truth
0.0–51.6smusicThe track opens with a distorted electric guitar riff. A driving rock drum beat enters, with a heavy emphasis on the snare drum. The vocals are powerful and slightly raspy. **Vocals**: - **Vocal Profile**: Male, likely young adult. - **Timbre & Quality**: Raspy, powerful, and slightly strained. - **Non-Lyrical Sounds**: None discernible. - **Lyrics**: "Got a fuse that's burning close to the end... I'm a reckless lover, a dangerous friend..."
0.4–4.6sspeechWhat are you talking about? No Mr. Hackett, no screaming kids.
spk speaker_1 · en · A young adult male, likely in his 20s, delivers a highly expressive and dynamic speech performance in English with a neutral American accent. The recording boasts excellent quality, captured in a quiet environment with minimal background noise, suggesting a professional setup. Initially, the speaker conveys strong interest and moderate astonishment, tinged with confusion. His voice possesses a clear, slightly bright timbre, residing in a medium-to-high pitch range, and exhibits a natural, spontaneous conversational style. The delivery is precise, with normal loudness and dynamic variations. The emotional landscape shifts significantly in the second half of the
4.9–9.2svocal_burstA sequence of several short, sharp, and distinct sniffing sounds is heard. Each sniff is brief and abrupt, indicating a rapid intake of air through the nostrils. The repeated sharp sniffing sounds suggest a person attempting to clear their nasal passages, possibly due to congestion or irritation, or indicating a state of being on the verge of crying.
spk speaker_1 ·
28.0–37.0ssound_eventThe audio consists primarily of rhythmic sounds, indicating physical movement. Repeated footstep-like sounds suggest someone is walking or running. The sound has a consistent, moderately fast tempo.
loud
40.0–49.0ssound_eventThe audio features sounds associated with road vehicles. There's the distinct sound of car tire rolling on a paved road surface. This is accompanied by the general rumble associated with motor vehicle movement. Additional road noises are mixed in the background.
quiet
Gemma-12B + DiCoW (text-only fusion)
0.0–5.5smusicPop-rock song with a driving drum beat and electric guitars, bright and slightly distorted character, establishing a rhythmic foundation.
moderate
1.3–2.2sspeechWhat are you talking
spk speaker_0 · en · neutral and slightly curious
2.2–4.5sspeechabout? No, Mister Hackett, no screaming kids
spk speaker_1 · en · casual and slightly playful
3.5–29.4smusicHigh-energy, anthemic pop-rock song featuring a driving drum machine beat, prominent bassline, and layered distorted electric guitars with a male vocalist.
moderate
4.8–5.2svocal_burstsneeze
spk speaker_2 · neutral and conversational
5.4–6.4sspeech
spk speaker_0 · en · energetic and enthusiastic
6.7–8.1sspeech
spk speaker_2 · unclear · angry and agitated
8.2–8.9svocal_burstscream
spk speaker_0 · extreme fear or pain
9.0–13.5sspeech
spk speaker_2 · en · slightly aggressive
12.7–13.3sspeecha warning
spk speaker_2 · en · neutral
13.5–14.1sspeechpain and
spk speaker_0 · en · melancholic
14.2–15.2sspeechred, but I keep
spk speaker_2 · en · melancholic
16.6–17.0sspeech
spk speaker_0 · es · slightly annoyed and impatient
17.0–21.8sspeech
spk speaker_2 · en · melancholic
19.2–21.8sspeechreckless speed every thrills a warning pain
spk speaker_2 · en · melancholic
21.8–22.8sspeechand red. <en-US>
spk speaker_0 · en · melancholic
22.8–24.6sspeech
spk speaker_2 · en · aggressive and boastful
29.4–37.0ssound_eventHis speech is punctuated by several sharp, rapid percussive sounds, consistent with a ball being kicked hard against a surface. In the background throughout this period, an anthemic pop-rock song continues to play, featuring a driving drum beat, electric guitars, and a male vocalist. The tonal balance is dominated by the mid-range of the speech, with the sharp, transient attacks of the kicks in the high-mid frequencies. The spatial acoustics of the speech are dry and close. The music is compressed underneath, slightly masked by the foreground actions
moderate
37.0–51.0ssound_eventA continuous, high-energy pop-rock song with an anthemic feel plays. The arrangement consists of a steady drum machine beat, a rhythmic bassline, and distorted electric guitars providing a powerful chordal backing. distinct foreground events at the very start of the segment. Universal Technical Audio Analysis: The music maintains a moderately loud and consistent dynamic level. The spectral balance is bright and full, with prominent mids and highs from the guitar and drums. The mix has a polished, compressed quality with moderate stereo width and subtle reverb on the vocals. The foreground speech and kicks create a temporary layering, but the musical foundation remains clear
moderate
51.0–51.6ssound_eventnear silence with faint room tone
quiet
clip_015 · 8 GT events · langs: de, fr overlapping speech
Ground truth
0.0–19.8smusicThe clip begins with a slow, descending synth melody, creating a melancholic and atmospheric feel. The melody is heavily reverbed. A simple, repetitive drum machine pattern enters, consisting of a kick drum on beats 1 and 3, and a snare on beats 2 and 4, giving a slow tempo, around 60 BPM. The overall sound is lo-fi. **Vocals**: - **Vocal Profile**: Female. The language is not English. - **Timbre & Quality**: The vocals are airy and somewhat distant, with a slight echo. - **Non-Lyrical Sounds**: None - **Lyrics**: Indiscernible due to language and effects.
0.1–10.1ssound_eventThe dominant sound is wind. There is a rushing or whooshing sound, varying in intensity. There are some rattling sounds in addition to the wind, which may signify wind interacting with objects.
moderate
0.3–9.8sspeechJ'arrive sur mon AD66. Et là, évidemment, je n'ai plus le droit de sortir du réseau avec un préfixe, celui-là.
spk speaker_1 · fr · iding a detailed audio analysis, the recording features a male speaker delivering a French monologue with a standard accent. The audio quality is excellent, exhibiting no background noise and a clear, resonant timbre. The speaker's voice is a male baritone, described as slightly soft and neutral, with a near-neutral-slightly-bright quality. There's a subtle breathiness and nasality, accompanied by a touch of tension and roughness, creating a natural, stable vocal texture. The resonance is chest-m However, with a near-neutral heavy vocal weight and mild wear, suggesting a mature voice.
2.7–11.7ssound_eventThe audio mainly consists of bird sounds. There are multiple calls, chirps, and songs from different types of birds. The soundscape sounds quite clear and natural, probably recorded in an open outdoor area.
loud
6.6–18.2sspeechLeute, die vielleicht neu eingezogen sind, äh, die man überhaupt nicht kennt und dann schreibt man irgendwie ein Beschwerdezettel, weil das Kind ja immer so laut schreit, anstatt irgendwie mal miteinander zu reden. Und ich glaube, in Deutschland würde sich, äh,
spk speaker_2 · de · This recording features a young adult female speaker delivering speech in German with a neutral accent. The audio quality is excellent, captured in a quiet environment with no discernible background noise, resulting in clear and articulate speech. The speaker's voice possesses a female mezzo timbre, slightly soft and neutral with a near-neutral-slightly-bright quality, exhibiting a touch of breathiness and mild nasality. There's a subtle tension and roughness in the voice, with a center-mixed resonance and neutral vocal weight. The voice is mostly natural and stable, with mild wear suggesting a mature yet healthy vocal quality.
8.3–17.9ssound_eventThe primary sound event is of someone running. Consistent rhythmic footstep sounds are heard on what sounds like pavement. The pace is steady and fairly fast.
loud
10.2–14.7svocal_burstSeveral instances of clear, high-pitched laughter are present, interspersed with distinct giggling sounds. The vocalizations are joyful and originate from a person with a higher-pitched voice, perceived as female, exhibiting cheerfulness. The prominent laughter and giggling suggest a state of intense amusement, joy, or lighthearted fun, indicating the speaker finds something highly entertaining and is expressing elation.
spk speaker_1 ·
15.3–19.7svocal_burstA perceived male individual produces a distinct throat-clearing sound, characterized by a deep, guttural expulsion of air. This is immediately followed by a brief, dry cough. The sounds indicate the speaker is clearing their throat, likely due to irritation, dryness, or in preparation for speaking, followed by a minor cough.
spk speaker_2 ·
Gemma-12B + DiCoW (text-only fusion)
0.0–19.5smusicUpbeat electronic music with a driving rhythm and synth melodies, bright tonal balance, moderate volume, stereo presence.
moderate
0.0–19.5ssound_eventLoud and prominent sound of strong wind or air rushing, creating a diffuse ambient texture with some low-frequency rumble.
loud
0.7–1.3sspeechJ'arrive
spk speaker_0 · fr · neutral, simply stating a fact
2.4–3.4sspeechsur mon an X
spk speaker_0 · fr · neutral and informative
4.6–6.5sspeechet là, évidemment, je n'ai plus le droit de
spk speaker_0 · fr · neutral and informative
6.8–7.0sspeechsortir
spk speaker_0 · fr · neutral, simply stating a fact
9.0–9.0sspeechavec
spk speaker_0 · fr · neutral, simply stating a fact
11.2–18.0sspeechses scherzet, weil das Kind ja immer so laut schreibt, anstatt irgendwie mal miteinander zu reden. <de-DE> Und ich glaube in Deutschland würde sich
spk speaker_0 · de · slightly confused and hesitant tone, thinking out loud
18.5–19.0ssound_eventA single, sharp, and forceful cough from a male speaker, captured at medium proximity.
loud
clip_017 · 6 GT events · langs: fr, zh overlapping speech
Ground truth
0.0–27.8smusicThe song begins with a simple piano melody. A gentle acoustic guitar joins in. The music is slow and melancholic, around 80 BPM. A female vocal, singing in Mandarin, is present. **Vocals**: - **Vocal Profile**: Young adult female. - **Timbre & Quality**: Soft, clear, and slightly melancholic. The vocals are relatively unprocessed. - **Non-Lyrical Sounds**: None obvious. - **Lyrics**: (Partial - in Mandarin) "... walking outside the window..."
1.0–5.2sspeech中华民国112年3月21日于台北。
spk speaker_1 · zh · A young adult male, likely in his 20s, delivers a highly expressive and engaging speech in Mandarin Chinese with a neutral accent. The recording boasts exceptional quality, captured in a quiet environment with minimal background noise, suggesting a professional setup. The speaker exhibits a bright, clear timbre, leaning towards a female mezzo-soprano quality, with a slightly soft and neutral overall tone, exhibiting a near-neutral brightness and a subtle breathiness. A slight nasal touch is present, contributing to a relaxed and natural vocal quality. The voice is mostly clear, with a head-mixed resonance, light vocal weight,
3.8–11.6sspeechFaut faire des réservations de jeu, mais vous les avez pas le jour J quand il sort. Donc, euh, ça c'est, ça c'est, et c'est génial. Ça c'est vraiment le top.
spk speaker_2 · fr · A male speaker delivers a French monologue with a generally positive and engaging tone. The voice is a male baritone, exhibiting a slightly soft and neutral timbre with a near-neutral-slightly-bright quality. A subtle breathiness and a hint of nasality are present, contributing to a relaxed yet focused delivery. The voice possesses a chest-mixed resonance, with a near-neutral heavy weight and a mild wear, maintaining a mostly natural and stable character. The speaker's speaking style is conversational and fluent, with precise articulation and a moderate tempo. The overall vocal delivery is dynamic, though
5.6–9.8svocal_burstThe audio consists of several brief, soft, and somewhat wet sounds of lips making contact. These sounds are quick and precise, resembling light, individual kisses rather than a continuous smacking noise. These sounds are indicative of light kissing or air kisses, possibly conveying affection, greeting, or a playful gesture. The discrete nature of each sound suggests individual, brief instances of lip contact.
spk speaker_1 ·
12.1–16.7svocal_burstA sequence of distinct, sharp smacking sounds is audible. Each sound is quick and impactful, possessing a clear percussive quality. They occur as individual events, separated by brief intervals. The audio strongly suggests the sound of a kiss or a series of quick kisses, characterized by a swift, smacking action. The brevity and sharp quality imply direct and momentary physical contact, often on the cheek or another person.
spk speaker_2 ·
16.3–25.9ssound_eventThe primary sound is repetitive and rhythmical thumping. It sounds like foot steps. There's also some light breathing.
moderate
Gemma-12B + DiCoW (text-only fusion)
0.0–7.5smusicAmbient music with a soft, sparse pad and a female singer delivering melodic phrases in Mandarin with a gentle, airy delivery, heavily processed with noticeable reverb and delay.
moderate
0.0–27.8ssound_eventQuiet room-tone
quiet
1.5–5.2sspeech中华民国一百一十二年三月二 font des réservations
spk speaker_1 · fr · neutral and informative
5.3–6.8sspeechde jeu, mais vous les avez pas le jour vi quand
spk speaker_2 · fr · neutral and informative
5.9–8.2ssound_eventMachinery and industrial sounds, possibly related to a vehicle or mechanical process.
moderate
6.9–8.3sspeechils sort que ça, c'est
spk speaker_0 · fr · neutral and simply stating something
8.8–9.1sspeechça,
spk speaker_2 · fr · neutral and conversational
9.0–11.0sspeechc'est c'est génial, ça
spk speaker_0 · fr · neutral tone
11.0–11.7sspeechc'est vraiment le top
spk speaker_2 · fr · neutral and simply stating a fact
13.8–15.7ssound_eventA low-frequency hum with a distinct, rhythmic pulsing sound, suggesting heavy machinery or a vehicle in motion.
moderate
14.0–19.5ssound_eventA very loud, dense, and sudden burst of crowd cheering and shouting, characteristic of a live event or video game audience.
very_loud
19.0–27.5smusicMelancholic, ethereal melody in Mandarin by a female singer with soft, breathy delivery, heavily layered with reverb and delay, over a very quiet, continuous music bed with a simple, sparse texture.
quiet
clip_018 · 5 GT events · langs: de, es overlapping speech
Ground truth
0.0–42.8smusicThe song opens with a slow, steady, and slightly tribal-sounding drum beat around 87 BPM. A synthesized string pad creates a spacious, ethereal atmosphere. A female vocalist enters, singing in a clear, operatic style. **Vocals**: - **Vocal Profile**: Adult female with a soprano range. - **Timbre & Quality**: Clear, powerful, and ethereal, with a slight reverb effect. - **Non-Lyrical Sounds**: None discernible in this short clip. - **Lyrics**: "Ride on warrior queen with your banner high, show the world your strength, never let it die."
1.4–6.3sspeechJetzt zeigt sich, ob ich das Talent zur Graffiti-Künstlerin habe.
spk speaker_1 · de · A young adult female voice delivers a German monologue with a neutral accent, exhibiting a generally high level of professionalism and excellent recording quality. The speech is clear, precise, and articulate, with a moderate pitch and a stable, natural delivery. Initially, the voice conveys a hesitant yet genuine tone, marked by moderate concentration, a hint of doubt, and a touch of mild distress. The timbre is soft and slightly breathy, with a mid-range pitch and a slow, deliberate tempo. Subtle room reverberation is present, indicating a quiet environment. The overall enjoyment is medium, reflecting the clear audio but sub
4.5–15.9sspeechkassim trabajaba esa noche hasta las tres de la mañana y su mujer tenía luego nuevas chispas que ella consideraba un instante con los labios apretados
spk speaker_2 · es · A male speaker delivers a contemplative and wistful narrative in Spanish, exhibiting a generally calm and reflective demeanor. The voice possesses a male baritone timbre, characterized by a slightly soft and dark quality with a subtle breathiness and a hint of nasality. The overall vocal production is relaxed and stable, with a near-neutral heavy weight and a mild, natural wear suggesting a middle-aged adult. The speaker's delivery is slow and deliberate, with a moderate tempo and precise articulation, contributing to a sense of thoughtfulness. Initially, the tone conveys a moderate sense of contentment and a hint
16.7–21.6svocal_burstA series of distinct, wet, smacking sounds are present, accompanied by soft sucking noises. The sounds are consistent in their moist quality and occur in quick succession. These sounds are characteristic of kissing, specifically wet or passionate kisses. Alternatively, it could be someone consuming a soft and moist food item, like fruit, or making deliberate lip-smacking sounds.
spk speaker_2 ·
30.0–39.7ssound_eventThe sound features a single, loud honk, followed by a slightly lower-pitched honk. There is a brief pause between the sounds. Both sounds have a short duration and abruptly stops.
quiet
Gemma-12B + DiCoW (text-only fusion)
0.0–27.3smusicPop-rock track with a driving drum beat (kick and snare), prominent bassline, and electric guitars providing rhythmic and melodic elements; energetic but slightly reflective mood, moderate tempo, studio-quality production.
moderate
0.0–42.8ssound_eventQuiet room-tone / background ambience
quiet
2.2–6.0sspeechJetzt zeigt sich, ob ich das Talent zum Geographie künste.
spk speaker_0 · de · neutral and simply stating something
9.2–15.3sspeechY su mujer tenía luego nuevas chispas que ella consideraba un instante con los labios apretados
spk speaker_0 · es · reflective and slightly nostalgic
17.0–21.4ssound_eventDistinct sound of a cash register, indicating a transaction or sale
moderate
21.7–30.0sspeechThat the waves adore. Whispers of her name are found on every shore.
spk speaker_0 · en · slightly melancholic tone
27.3–35.5smusicInstrumental pop-rock section; steady drum beat, melodic bassline, and layered electric guitars; energy builds progressively with a full and bright tonal balance.
loud
27.4–43.0sspeechofens arena.
spk speaker_0 · pt · slightly melancholic tone
35.5–42.5smusicInstrumental pop-rock track with a prominent, high-pitched electronic sound effect; sharp, percussive, horn-like quality playing a rhythmic, repetitive melodic figure over a full band arrangement.
moderate
clip_020 · 5 GT events · langs: fr, zh overlapping speech
Ground truth
0.7–6.9sspeech呃, 比马,死于安乐费布罗赛,产业成绩团的成员,都业,都死于嗨家肉。
spk speaker_1 · zh · A young adult female speaker delivers a clear, high-quality recording in Mandarin Chinese with a neutral accent. The initial portion of the audio showcases strong elation, moderate hope, and a hint of interest, conveyed through a bright, clear, and dynamic voice. The speaking style is fluent, casual, and spontaneous, with a fast tempo and a high, dynamic pitch range. Articulation is precise, and the airflow is normal, resulting in a balanced head-chest resonance. The timbre is a slightly soft, neutral, and near-neutral-slightly-bright female mezzo-soprano, with a
1.9–11.8ssound_eventThe dominant sound in the recording is white noise. It is a consistent, static hiss with no clear variations in volume or texture throughout the duration of the clip.
loud
3.1–11.2sspeechDonc, vous importez ce petit script là, et puis si vous avez besoin de faire remonter la couche réseau suite à une attaque en Somewhere, vous exécutez, euh, ce script sur la machine concernée.
spk speaker_2 · fr · A male speaker delivers a French monologue with a generally neutral and informative tone, exhibiting a slightly formal delivery. The voice is a male baritone, likely belonging to a middle-aged adult, characterized by a slightly soft and bright timbre with a subtle breathiness and a hint of nasality. The voice possesses a chest-mixed resonance, a near-neutral heavy weight, and a mild wear, remaining mostly natural and stable. Articulation is precise, and the airflow is mostly neutral, with a slight dynamic quality. The speaking style is fluent and casual, with a moderate tempo and a generally consistent pitch
5.4–15.0ssound_eventThe audio contains vocalizations from a canine, specifically a series of barks. The barks vary in pitch and intensity. It sounds like a medium-sized dog.
quiet
11.9–16.1svocal_burstThe audio captures a single, distinct smacking sound, created by the rapid separation or coming together of lips. The sound is sharp, brief, and has a slight wet quality. This sound is consistent with a quick lip smack, often a gesture made during thought, anticipation, or as an involuntary habit.
spk speaker_2 ·
Gemma-12B + DiCoW (text-only fusion)
0.0–0.8ssound_eventQuiet room tone with low-level broadband hiss
quiet
0.8–3.1sspeechoh yeah
spk speaker_0 · fr · neutral and conversational
3.1–8.4sspeechdonc vous importez ce petit script là et si vous avez besoin de faire remonter la touche preso suite à un attack on somewhere vous exécuter ce script sur la machine concernée
spk speaker_1 · fr · neutral and conversational
8.4–8.6ssound_eventFaint, high-frequency clicks from a keyboard or mouse
quiet
8.6–9.7sspeechoh yes yes yes
spk speaker_2 · en · slightly distressed
9.8–11.5sspeechdonc vous importez ce petit script là et si vous avez besoin de faire remonter la touche preso suite à un attack on somewhere vous exécuter ce script sur la machine concernée
spk speaker_1 · fr · neutral tone, formal setting
11.5–11.8ssound_eventQuiet room tone
quiet
11.8–16.5sspeechdonc vous importez ce petit script là et si vous avez besoin de faire remonter la touche preso suite à un attack on somewhere vous exécuter ce script sur la machine concernée
spk speaker_1 · fr · neutral tone, formal setting
16.5–16.6ssound_eventDistinct mechanical click
moderate
16.6–16.9svocal_burstsigh
spk speaker_1 · gentle exhalation
16.9–16.9ssound_eventSilence/Room tone
quiet
clip_027 · 5 GT events · langs: en, zh overlapping speech
Ground truth
0.0–40.3ssound_eventThe primary sound is that of wind, consisting of continuous whooshing and gusting noises. There are varying degrees of intensity, with some moments of quieter breezes and louder blasts. No other distinct sounds are audible.
quiet
0.6–9.1sspeechUh, and making money, augmented reality is going to be the productivity tool that might replace the desktop in your office or something like that. So I think, uh,
spk speaker_1 · en · quality recording features a young adult male, likely in his 20s or 30s, speaking English with a neutral American accent. The audio exhibits excellent clarity with absolutely no background noise, contributing to a high-quality listening experience. The speaker's voice is a male baritone, characterized by a slightly soft, neutral, and near-neutral-slightly-bright timbre with a subtle breathiness and a hint of nasality. There's a slight tension and roughness present, with a chest-mixed resonance and a near-neutral heavy vocal weight, suggesting mild vocal wear but overall natural stability. Init
11.2–14.9sspeech还有六张RFID卡来完成这套系统。
spk speaker_2 · zh · A young adult male, likely in his 20s, delivers a speech in English with a neutral accent, exhibiting a shift in emotional state and vocal delivery over time. Initially, the voice conveys strong elation, moderate amusement, and contentment, characterized by a bright, clear timbre and a fast tempo. The pitch range is high and dynamic, with crisp articulation and a natural, spontaneous delivery. The voice possesses a slightly soft, neutral quality with a near-neutral-slightly-bright timbre, a touch of breathiness and nasality, and a relaxed vocal production. The resonance is chest-m
14.2–23.2ssound_eventThe sound of a motor vehicle passing by is prominent. The vehicle is likely traveling at a high rate of speed as the audio reflects Doppler shift, a clear change in pitch. There's tire noise against the road surface.
moderate
15.3–19.6svocal_burstMultiple forceful coughs are audible in rapid succession. Each cough is distinct, sharp, and appears to be a strong expulsion of air from the lungs. This indicates that a person is coughing, likely due to a respiratory irritation, a cold, or an attempt to clear their throat. The repetitive and forceful nature suggests a persistent or significant cough.
spk speaker_2 ·
Gemma-12B + DiCoW (text-only fusion)
0.0–5.5sspeechUh, and making money, augmented reality is going to be the productivity tool that might replace the desktop in your office or something like that. So I think
spk speaker_0 · en · optimistic and enthusiastic
0.0–40.3ssound_eventQuiet room-tone
quiet
4.5–5.5ssound_eventVery loud and harsh engine rev, characterized by a rising high-frequency whine and distorted roar, quickly swells to dominate the audio and cuts off abruptly
very_loud
7.1–9.3sspeechSo I think, uh.
spk speaker_0 · en · optimistic and enthusiastic
10.9–14.9sspeech还有六张FID卡来完成这套系统。
spk speaker_0 · zh · neutral
16.2–26.8ssound_eventHeavy vehicle, possibly a truck, starting up and driving away; powerful low-frequency engine rumble, mechanical whir, and tire noise with a sense of motion and departure
moderate
16.9–18.9svocal_burstwail
spk speaker_0 · clearly expressing distress, pain, or a strong emotional reaction
25.9–28.0sspeechYou have a comment? Push it.
spk speaker_0 · en · neutral
29.2–31.0ssound_eventLoud, abrupt sound of a vehicle's wheels grinding on gravel, followed by a short metallic clank
loud
31.0–35.6sspeechYou have a comment? Push it.
spk speaker_0 · en · neutral
clip_029 · 5 GT events · langs: de, es overlapping speech
Ground truth
0.0–60.0smusicThe track opens with a heavy, distorted synth bass. A fast, complex drum and bass beat enters at approximately 140 BPM. The male vocals are fast-paced and energetic, with a rapping style. **Vocals**: - **Vocal Profile**: Adult male rapper. - **Timbre & Quality**: Energetic, slightly distorted, and aggressive. - **Non-Lyrical Sounds**: None discernible. - **Lyrics**: "Workhard Burkhard, in the fast lane... crunching numbers, feeling no pain..."
0.7–11.0sspeechJa, also auch grad sowas dann irgendwie, wir doch dann manchmal so viele Nachrichten auch bei einmal Memes kriegen. Also das sind ja jetzt nicht nur dann
spk speaker_1 · de · A medium-quality recording features a young adult female speaker delivering speech in German with a neutral accent. The audio exhibits absolutely no background noise, contributing to moderately clear speech. The speaker's voice possesses a female mezzo-soprano timbre, characterized by a slightly soft and neutral quality with a near-neutral, slightly bright resonance. A subtle breathiness and a hint of nasality are present, accompanied by a slight tension and roughness in the vocal texture. The voice is head-mixed, with a near-neutral light vocal weight and mild wear, yet remains mostly natural and stable.
4.7–14.8sspeechá lo ménos de los bajeles turquescos este bajel que aquí veis reducido á pequeño porque lo pide así la pintura
spk speaker_2 · es · A male speaker delivers a Spanish monologue with a calm, informative, and initially neutral tone. The voice is a male baritone, exhibiting a slightly soft and neutral timbre with a near-neutral-slightly-bright quality. A subtle breathiness and slight nasality are present, contributing to a relaxed vocal production. The voice possesses a slight roughness and a chest-mixed resonance, with a near-neutral heavy vocal weight and mild wear, yet remains largely natural and stable. The speaking style is deliberate and precise, with clear articulation and a moderate tempo. The delivery is generally monotonous
15.2–19.9svocal_burstA sequence of clear, distinct sniffing sounds is heard, alternating with brief, low-volume, percussive clicks. The sniffing is sharp and percussive, while the clicks are subtle. The sounds suggest a person is either sniffing, possibly to clear their nasal passages or to detect a scent, or engaging in a repetitive habit involving sharp nasal inhales and mouth or tongue clicks.
spk speaker_2 ·
33.4–43.4ssound_eventThe primary sound is a rushing, continuous flow of water. The sound quality suggests a large volume of water, perhaps a river or a significant waterfall. There are no other distinct sounds present.
moderate
Gemma-12B + DiCoW (text-only fusion)
0.0–8.2smusicElectronic rap track featuring a male rapper's voice heavily processed with a vocoder effect, layered over a rhythmic beat with a prominent bassline and synthesizer elements; quiet compared to speech.
quiet
0.0–0.1ssound_eventquiet room-tone
quiet
2.2–6.1sspeechJa also auch gerade sowas dann irgendwie wir doch dann manchmal so für Nachrichten, weil einmal
spk speaker_0 · de · slightly hesitant and unsure
4.7–5.0ssound_eventbrief, high-pitched electronic tone, possibly a notification or system alert
moderate
8.2–17.1smusicOngoing electronic rap track with vocoded male vocal line and a steady beat; warm with heavy vocal compression.
moderate
9.5–11.2sspeechalso das sind ja nicht nur dann
spk speaker_0 · de · slightly hesitant and unsure
16.2–16.5ssound_eventlow-pitched, distorted male voice speaking in Russian
moderate
16.5–19.8ssound_eventhigh-pitched, sustained electronic tone
loud
17.1–27.5smusicElectronic beat with bassline and synth pattern; dominant male rap vocal in a non-English language with a rhythmic, slightly aggressive flow.
loud
27.5–35.1smusicContinuous rap track with a male vocalist; mix becomes dense and spectrally complex toward the end.
moderate
33.0–35.1ssound_eventloud, high-frequency sound effect featuring a distinct 'spin-up' and swirling texture
very_loud
35.1–43.1ssound_eventloud, continuous sound effect of a large vessel, like a ship, moving through water; deep, resonant low-frequency rumbles and powerful, churning mid-frequency textures
very_loud
43.1–59.6smusicElectronic rap track with a male voice heavily processed with a vocoder; neutral, robotic, and rhythmic delivery over a simple repeating synthesizer melody and four-on-the-floor drum pattern.
loud
45.8–46.7sspeechgoals that attacked as well
spk speaker_1 · en · neutral
47.9–59.0sspeechin his ears shows that you turn to his will workmans and funk from an indier riding high or card gas in still crushing goals manufacture as well
spk speaker_1 · en · neutral
59.6–60.0ssound_eventquiet room-tone
quiet

Pipeline: LAION Universal Audio Annotation Pipeline · experimental text-only fusion backend.