Transcribe Long Audio with FunASR: Hours in a Single Call
Transcribing a 1-hour podcast, lecture or meeting with Whisper is painful: it processes only 30 seconds at a time, so you have to chunk the audio, transcribe each piece, and stitch the results back together - while handling words cut off at chunk boundaries.
FunASR makes this trivial. With built-in VAD (voice activity detection), a single generate() call ingests audio of any length - it segments, batch-decodes and reassembles the full text for you, with zero chunking code.
Install
pip install -U funasr modelscope
Full, runnable code
from funasr import AutoModel
model = AutoModel(
model="iic/SenseVoiceSmall", # or "paraformer-zh"
vad_model="fsmn-vad", # built-in voice activity detection
vad_kwargs={"max_single_segment_time": 30000},
)
# One call on a full 1-hour file - VAD segments and batches it internally.
res = model.generate(
input="podcast_1hour.wav",
batch_size_s=300, # dynamic batching of VAD segments (throughput)
)
print(res[0]["text"])
Tested: 13 minutes transcribed in 4.3s
long clip: 791s = 13.2 min
transcribed 791s in 4.3s -> RTFx=186
output chars: 2104 # full transcript, head to tail
We fed a 13.2-minute (791s) recording into a single generate() call - done in 4.3s (186x realtime), with output covering the full file head to tail. A 1-hour file works exactly the same way, no code changes.
Why it handles any length
- VAD auto-segmentation:
fsmn-vadsplits the long audio into speech segments at pauses;max_single_segment_timecaps the longest segment. - Dynamic batching:
batch_size_s=300packs segments by total duration for high-throughput decoding. - Bounded memory: because it processes segments, GPU memory is independent of total file length - 1 hour uses about as much as 1 minute.
Going further
- Add
punc_model="ct-punc"for automatic punctuation; - Add
spk_model="cam++"for speaker diarization (who spoke when); - Use
SenseVoiceSmallfor language & emotion; for max speed see the benchmark.
Typical use cases
- Podcasts / audiobooks: transcribe a whole episode at once.
- Meetings / lectures: batch long recordings to text + archive/search.
- Call recordings: full-call transcription for QA and mining.
FunASR is Tongyi Lab's open-source, industrial-grade speech recognition toolkit - long audio, streaming and multilingual all covered.
Star FunASR on GitHub ★