Transcribe Long Audio with FunASR: Hours in a Single Call

Transcribing a 1-hour podcast, lecture or meeting with Whisper is painful: it processes only 30 seconds at a time, so you have to chunk the audio, transcribe each piece, and stitch the results back together - while handling words cut off at chunk boundaries.

FunASR makes this trivial. With built-in VAD (voice activity detection), a single generate() call ingests audio of any length - it segments, batch-decodes and reassembles the full text for you, with zero chunking code.

Install

pip install -U funasr modelscope

Full, runnable code

from funasr import AutoModel

model = AutoModel(
    model="iic/SenseVoiceSmall",   # or "paraformer-zh"
    vad_model="fsmn-vad",          # built-in voice activity detection
    vad_kwargs={"max_single_segment_time": 30000},
)

# One call on a full 1-hour file - VAD segments and batches it internally.
res = model.generate(
    input="podcast_1hour.wav",
    batch_size_s=300,   # dynamic batching of VAD segments (throughput)
)
print(res[0]["text"])

Tested: 13 minutes transcribed in 4.3s

long clip: 791s = 13.2 min
transcribed 791s in 4.3s -> RTFx=186
output chars: 2104   # full transcript, head to tail

We fed a 13.2-minute (791s) recording into a single generate() call - done in 4.3s (186x realtime), with output covering the full file head to tail. A 1-hour file works exactly the same way, no code changes.

Why it handles any length

Going further

Typical use cases

FunASR is Tongyi Lab's open-source, industrial-grade speech recognition toolkit - long audio, streaming and multilingual all covered.

Star FunASR on GitHub ★

Related posts