Auto-Generate Subtitles (SRT & VTT) from Audio or Video with FunASR

FunASR Blog · 2026-06-18 · Tutorial

You don't need a cloud service or a paid API to subtitle a video. FunASR turns speech into timestamped SRT or VTT subtitles locally and for free, and can even label speakers. It's especially strong on Chinese and supports 50+ languages. Every command and snippet below is tested.

Step 1 (video): extract the audio

If your source is a video, pull out 16 kHz mono audio (FunASR's standard input) with ffmpeg:

ffmpeg -i video.mp4 -ar 16000 -ac 1 audio.wav

For audio files (wav/mp3/m4a, ...) skip this step.

Fastest: one command to SRT

pip install -U funasr
funasr audio.wav -f srt -o ./subs

Writes a standard SRT file at ./subs/audio.srt with real, usable timestamps:

1
00:00:00,000 --> 00:00:05,546
欢迎大家来体验达摩院推出的语音识别模型   (the Chinese sample)

Add --spk to split cues by speaker (great for interviews and meetings):

funasr meeting.wav --spk -f srt -o ./subs

Python: emit SRT and VTT together (with speakers)

When you need more control — e.g. a web-friendly VTT as well, or a custom speaker prefix — build subtitles directly from sentence_info in the result:

from funasr import AutoModel
from funasr.utils.postprocess_utils import rich_transcription_postprocess

model = AutoModel(model="iic/SenseVoiceSmall", vad_model="fsmn-vad", spk_model="cam++", device="cuda")
res = model.generate(input="audio.wav")
segments = res[0]["sentence_info"]   # each has start/end (ms), spk, sentence

def ts(ms, sep):                     # ms -> HH:MM:SS,mmm  (SRT uses "," / VTT uses ".")
    h, mm, ss, mmm = ms//3600000, (ms%3600000)//60000, (ms%60000)//1000, ms%1000
    return "%02d:%02d:%02d%s%03d" % (h, mm, ss, sep, mmm)

with open("subs.srt", "w", encoding="utf-8") as f:
    for i, s in enumerate(segments):
        text = rich_transcription_postprocess(s["sentence"])
        f.write("%d\n%s --> %s\nSpeaker %s: %s\n\n" % (i+1, ts(s["start"], ","), ts(s["end"], ","), s["spk"], text))

with open("subs.vtt", "w", encoding="utf-8") as f:
    f.write("WEBVTT\n\n")
    for s in segments:
        text = rich_transcription_postprocess(s["sentence"])
        f.write("%s --> %s\n<v Speaker %s>%s\n\n" % (ts(s["start"], "."), ts(s["end"], "."), s["spk"], text))

Tested output (SRT):

1
00:00:00,610 --> 00:00:05,530
Speaker 0: 欢迎大家来体验达摩院推出的语音识别模型

Always run rich_transcription_postprocess to strip SenseVoice tags like <|zh|>, otherwise they leak into your subtitles.

Last (optional): burn subtitles into the video

ffmpeg -i video.mp4 -vf subtitles=subs.srt output.mp4

Why use FunASR for subtitles

Local, free, private — no uploads, no API key, no per-minute fees.
Fast: SenseVoice is non-autoregressive, far faster than Whisper (benchmark) — long videos subtitle quickly.
Stronger on Chinese + 50+ languages; built-in VAD segmentation, speaker labels, and real timestamps.

FunASR is Tongyi Lab's open-source, industrial-grade speech recognition toolkit.

Star FunASR on GitHub ★