Auto-Generate Subtitles (SRT & VTT) from Audio or Video with FunASR
You don't need a cloud service or a paid API to subtitle a video. FunASR turns speech into timestamped SRT or VTT subtitles locally and for free, and can even label speakers. It's especially strong on Chinese and supports 50+ languages. Every command and snippet below is tested.
Step 1 (video): extract the audio
If your source is a video, pull out 16 kHz mono audio (FunASR's standard input) with ffmpeg:
ffmpeg -i video.mp4 -ar 16000 -ac 1 audio.wav
For audio files (wav/mp3/m4a, ...) skip this step.
Fastest: one command to SRT
pip install -U funasr
funasr audio.wav -f srt -o ./subs
Writes a standard SRT file at ./subs/audio.srt with real, usable timestamps:
1
00:00:00,000 --> 00:00:05,546
欢迎大家来体验达摩院推出的语音识别模型 (the Chinese sample)
Add --spk to split cues by speaker (great for interviews and meetings):
funasr meeting.wav --spk -f srt -o ./subs
Python: emit SRT and VTT together (with speakers)
When you need more control — e.g. a web-friendly VTT as well, or a custom speaker prefix — build subtitles directly from sentence_info in the result:
from funasr import AutoModel
from funasr.utils.postprocess_utils import rich_transcription_postprocess
model = AutoModel(model="iic/SenseVoiceSmall", vad_model="fsmn-vad", spk_model="cam++", device="cuda")
res = model.generate(input="audio.wav")
segments = res[0]["sentence_info"] # each has start/end (ms), spk, sentence
def ts(ms, sep): # ms -> HH:MM:SS,mmm (SRT uses "," / VTT uses ".")
h, mm, ss, mmm = ms//3600000, (ms%3600000)//60000, (ms%60000)//1000, ms%1000
return "%02d:%02d:%02d%s%03d" % (h, mm, ss, sep, mmm)
with open("subs.srt", "w", encoding="utf-8") as f:
for i, s in enumerate(segments):
text = rich_transcription_postprocess(s["sentence"])
f.write("%d\n%s --> %s\nSpeaker %s: %s\n\n" % (i+1, ts(s["start"], ","), ts(s["end"], ","), s["spk"], text))
with open("subs.vtt", "w", encoding="utf-8") as f:
f.write("WEBVTT\n\n")
for s in segments:
text = rich_transcription_postprocess(s["sentence"])
f.write("%s --> %s\n<v Speaker %s>%s\n\n" % (ts(s["start"], "."), ts(s["end"], "."), s["spk"], text))
Tested output (SRT):
1
00:00:00,610 --> 00:00:05,530
Speaker 0: 欢迎大家来体验达摩院推出的语音识别模型
Always run rich_transcription_postprocess to strip SenseVoice tags like <|zh|>, otherwise they leak into your subtitles.
Last (optional): burn subtitles into the video
ffmpeg -i video.mp4 -vf subtitles=subs.srt output.mp4
Why use FunASR for subtitles
- Local, free, private — no uploads, no API key, no per-minute fees.
- Fast: SenseVoice is non-autoregressive, far faster than Whisper (benchmark) — long videos subtitle quickly.
- Stronger on Chinese + 50+ languages; built-in VAD segmentation, speaker labels, and real timestamps.
FunASR is Tongyi Lab's open-source, industrial-grade speech recognition toolkit.
Star FunASR on GitHub ★