Voice Activity Detection in Python — Detect Speech, Remove Silence, Split Audio by Pauses
Voice Activity Detection (VAD) is a foundational step in audio processing: finding where someone is speaking versus silence/noise in a recording. It's the first stage of almost every speech pipeline — drop silence to save compute, split long recordings into utterances on pauses, and preprocess audio for any downstream ASR (including feeding Whisper clean speech to cut cost and stop it from "hallucinating" text over silent stretches).
FunASR's fsmn-vad is an industrial-grade, lightweight, very fast open-source VAD model. Three lines of Python, and it returns the start/end milliseconds of every speech region. Everything below is real measured output.
VAD in three lines
pip install funasr from funasr import AutoModel model = AutoModel(model="fsmn-vad", disable_update=True) segments = model.generate(input="audio.wav")[0]["value"] print(segments) # [[610, 5530]] # milliseconds — speech from 0.61s to 5.53s; the leading 610ms of silence is dropped
The return value is a list of [start_ms, end_ms] intervals, one per continuous speech region. In this 5.55s clip, the 0.61s of leading silence was removed automatically.
Real example: a 13-second clip split into 2 segments
Concatenating two utterances with a 1.5s pause between them gives a 13.31s recording (simulating a real clip with a pause). Actual VAD output:
| Segment | Span | Duration |
|---|---|---|
| seg0 | 0.61s → 5.45s | 4.84s |
| seg1 | 7.28s → 13.29s | 6.01s |
VAD cleanly isolates the two speech regions and marks both the 1.5s pause (5.45s–7.28s) and the leading silence as non-speech. Speech is 82% of the 13.31s clip — 18% of silence removed. This took just 0.12s on GPU (~110× real-time): VAD is extremely light and adds almost no overhead.
Three practical use cases
1. Remove silence / keep speech only
import soundfile as sf
import numpy as np
audio, sr = sf.read("audio.wav")
segments = model.generate(input="audio.wav")[0]["value"]
# concatenate all speech regions, drop silence
speech = np.concatenate([audio[int(s/1000*sr):int(e/1000*sr)] for s, e in segments])
sf.write("speech_only.wav", speech, sr)
2. Split a long recording into files on pauses (for batch ASR)
for i, (s, e) in enumerate(segments):
chunk = audio[int(s/1000*sr):int(e/1000*sr)]
sf.write(f"chunk_{i}.wav", chunk, sr)
# chunk_0.wav, chunk_1.wav, ... each file is one independent utterance
3. Preprocess for any ASR (including Whisper)
Running VAD before recognition has two concrete benefits: (1) you never send silence/noise into the model = less compute, lower cost; (2) Whisper tends to hallucinate repeated text over long silences, and trimming silence with VAD first measurably reduces those hallucinations. With FunASR's own ASR models (SenseVoice / Paraformer) you can wire VAD in with a single vad_model="fsmn-vad" argument for automatic segmentation.
from funasr import AutoModel # attach VAD directly to ASR; long audio is auto-segmented asr = AutoModel(model="iic/SenseVoiceSmall", vad_model="fsmn-vad") result = asr.generate(input="long_audio.wav")
Why fsmn-vad
| fsmn-vad (FunASR) | |
|---|---|
| Usage | 3 lines of Python, returns millisecond speech spans |
| Speed | measured 0.12s on 13s audio (~110× real-time) |
| Footprint | lightweight FSMN architecture, real-time even on CPU |
| Ecosystem | use standalone, or attach to FunASR ASR in one line for auto-segmentation |
| License | open-source, commercial-friendly |
If you already use FunASR for recognition, VAD is the same toolkit with zero extra dependencies; if you use Whisper or another ASR, fsmn-vad works as a standalone preprocessor. For full pipelines, see transcribing long audio and speaker diarization.
The whole FunASR stack is open-source — fsmn-vad, industrial-grade ASR, punctuation, speaker, emotion & events, LLM-ASR, ready to use. If it helps, a GitHub Star really supports the project 👇
Also star:SenseVoice · Fun-ASR · FunClip
Related posts
- FunASR vs Whisper Benchmark
- SenseVoice Deployment Guide
- Fun-ASR-Nano Guide
- Speaker Diarization: Who Spoke When
- Emotion & Language Detection
- Real-Time Streaming Speech-to-Text
- Transcribe Long Audio (Hours in One Call)
- Transcribe from the Command Line
- Self-Hosted OpenAI Whisper API Alternative
- Auto-Generate Subtitles (SRT / VTT)
- Speech to Text in Python
- FunASR on llama.cpp (whisper.cpp Alternative)
- FunASR vs faster-whisper (Chinese/Cantonese)
- Lightweight Speech Recognition on CPU
- Self-Hosted Deepgram/AssemblyAI Alternative
- Which FunASR Model?
- Cantonese Speech Recognition (SenseVoice)
- Japanese Speech Recognition (SenseVoice)
- Self-Hosted Google/AWS/Azure STT Alternative