Voice Activity Detection in Python — Detect Speech, Remove Silence, Split Audio by Pauses

Voice Activity Detection (VAD) is a foundational step in audio processing: finding where someone is speaking versus silence/noise in a recording. It's the first stage of almost every speech pipeline — drop silence to save compute, split long recordings into utterances on pauses, and preprocess audio for any downstream ASR (including feeding Whisper clean speech to cut cost and stop it from "hallucinating" text over silent stretches).

FunASR's fsmn-vad is an industrial-grade, lightweight, very fast open-source VAD model. Three lines of Python, and it returns the start/end milliseconds of every speech region. Everything below is real measured output.

VAD in three lines

pip install funasr

from funasr import AutoModel

model = AutoModel(model="fsmn-vad", disable_update=True)
segments = model.generate(input="audio.wav")[0]["value"]

print(segments)
# [[610, 5530]]   # milliseconds — speech from 0.61s to 5.53s; the leading 610ms of silence is dropped

The return value is a list of [start_ms, end_ms] intervals, one per continuous speech region. In this 5.55s clip, the 0.61s of leading silence was removed automatically.

Real example: a 13-second clip split into 2 segments

Concatenating two utterances with a 1.5s pause between them gives a 13.31s recording (simulating a real clip with a pause). Actual VAD output:

SegmentSpanDuration
seg00.61s → 5.45s4.84s
seg17.28s → 13.29s6.01s

VAD cleanly isolates the two speech regions and marks both the 1.5s pause (5.45s–7.28s) and the leading silence as non-speech. Speech is 82% of the 13.31s clip — 18% of silence removed. This took just 0.12s on GPU (~110× real-time): VAD is extremely light and adds almost no overhead.

Three practical use cases

1. Remove silence / keep speech only

import soundfile as sf
import numpy as np

audio, sr = sf.read("audio.wav")
segments = model.generate(input="audio.wav")[0]["value"]

# concatenate all speech regions, drop silence
speech = np.concatenate([audio[int(s/1000*sr):int(e/1000*sr)] for s, e in segments])
sf.write("speech_only.wav", speech, sr)

2. Split a long recording into files on pauses (for batch ASR)

for i, (s, e) in enumerate(segments):
    chunk = audio[int(s/1000*sr):int(e/1000*sr)]
    sf.write(f"chunk_{i}.wav", chunk, sr)
# chunk_0.wav, chunk_1.wav, ...  each file is one independent utterance

3. Preprocess for any ASR (including Whisper)

Running VAD before recognition has two concrete benefits: (1) you never send silence/noise into the model = less compute, lower cost; (2) Whisper tends to hallucinate repeated text over long silences, and trimming silence with VAD first measurably reduces those hallucinations. With FunASR's own ASR models (SenseVoice / Paraformer) you can wire VAD in with a single vad_model="fsmn-vad" argument for automatic segmentation.

from funasr import AutoModel
# attach VAD directly to ASR; long audio is auto-segmented
asr = AutoModel(model="iic/SenseVoiceSmall", vad_model="fsmn-vad")
result = asr.generate(input="long_audio.wav")

Why fsmn-vad

fsmn-vad (FunASR)
Usage3 lines of Python, returns millisecond speech spans
Speedmeasured 0.12s on 13s audio (~110× real-time)
Footprintlightweight FSMN architecture, real-time even on CPU
Ecosystemuse standalone, or attach to FunASR ASR in one line for auto-segmentation
Licenseopen-source, commercial-friendly

If you already use FunASR for recognition, VAD is the same toolkit with zero extra dependencies; if you use Whisper or another ASR, fsmn-vad works as a standalone preprocessor. For full pipelines, see transcribing long audio and speaker diarization.

The whole FunASR stack is open-source — fsmn-vad, industrial-grade ASR, punctuation, speaker, emotion & events, LLM-ASR, ready to use. If it helps, a GitHub Star really supports the project 👇

⭐ Star FunASR

Also star:SenseVoice · Fun-ASR · FunClip

Related posts