Transcribe Long Audio with FunASR: Hours in a Single Call

FunASR Blog · 2026-06-17 · Tutorial

Transcribing a 1-hour podcast, lecture or meeting with Whisper is painful: it processes only 30 seconds at a time, so you have to chunk the audio, transcribe each piece, and stitch the results back together - while handling words cut off at chunk boundaries.

FunASR makes this trivial. With built-in VAD (voice activity detection), a single generate() call ingests audio of any length - it segments, batch-decodes and reassembles the full text for you, with zero chunking code.

Install

pip install -U funasr modelscope

Full, runnable code

from funasr import AutoModel

model = AutoModel(
    model="iic/SenseVoiceSmall",   # or "paraformer-zh"
    vad_model="fsmn-vad",          # built-in voice activity detection
    vad_kwargs={"max_single_segment_time": 30000},
)

# One call on a full 1-hour file - VAD segments and batches it internally.
res = model.generate(
    input="podcast_1hour.wav",
    batch_size_s=300,   # dynamic batching of VAD segments (throughput)
)
print(res[0]["text"])

Tested: 13 minutes transcribed in 4.3s

long clip: 791s = 13.2 min
transcribed 791s in 4.3s -> RTFx=186
output chars: 2104   # full transcript, head to tail

We fed a 13.2-minute (791s) recording into a single generate() call - done in 4.3s (186x realtime), with output covering the full file head to tail. A 1-hour file works exactly the same way, no code changes.

Why it handles any length

VAD auto-segmentation: fsmn-vad splits the long audio into speech segments at pauses; max_single_segment_time caps the longest segment.
Dynamic batching: batch_size_s=300 packs segments by total duration for high-throughput decoding.
Bounded memory: because it processes segments, GPU memory is independent of total file length - 1 hour uses about as much as 1 minute.

Going further

Add punc_model="ct-punc" for automatic punctuation;
Add spk_model="cam++" for speaker diarization (who spoke when);
Use SenseVoiceSmall for language & emotion; for max speed see the benchmark.

Typical use cases

Podcasts / audiobooks: transcribe a whole episode at once.
Meetings / lectures: batch long recordings to text + archive/search.
Call recordings: full-call transcription for QA and mining.

FunASR is Tongyi Lab's open-source, industrial-grade speech recognition toolkit - long audio, streaming and multilingual all covered.

Star FunASR on GitHub ★