Speech to Text in Python: Transcribe Audio Locally and Free with FunASR

FunASR Blog · 2026-06-18 · Tutorial

You don't need a cloud API or a heavy Whisper setup to transcribe audio in Python. FunASR turns audio into text in a few lines, locally and for free, with timestamps, speaker diarization, and batch processing built in. It's especially strong on Chinese and supports 50+ languages. Every snippet below is tested.

Install

pip install -U torch torchaudio
pip install -U funasr

Simplest: transcribe a file in a few lines

from funasr import AutoModel
from funasr.utils.postprocess_utils import rich_transcription_postprocess

model = AutoModel(model="iic/SenseVoiceSmall", vad_model="fsmn-vad", device="cuda")
result = model.generate(input="audio.wav")
print(rich_transcription_postprocess(result[0]["text"]))
# -> 欢迎大家来体验达摩院推出的语音识别模型   (Chinese sample)

Defaults to SenseVoice (non-autoregressive, very fast). rich_transcription_postprocess strips SenseVoice tags like <|zh|> to give clean text — don't skip it.

Transcribe straight from a URL

result = model.generate(
    input="https://example.com/audio.wav"   # local paths, numpy, bytes also work
)

Get timestamps and speakers

Each entry in sentence_info carries start/end (ms), spk (speaker), and sentence:

model = AutoModel(model="iic/SenseVoiceSmall", vad_model="fsmn-vad", spk_model="cam++", device="cuda")
result = model.generate(input="audio.wav")

for seg in result[0]["sentence_info"]:
    start = seg["start"] / 1000
    text = rich_transcription_postprocess(seg["sentence"])
    print("[%.1fs] Speaker %s: %s" % (start, seg["spk"], text))
# -> [0.6s] Speaker 0: 欢迎大家来体验达摩院推出的语音识别模型

Batch-transcribe multiple files

Pass a list to input and get one result per file:

results = model.generate(input=["a.wav", "b.wav", "c.wav"])
for r in results:
    print(rich_transcription_postprocess(r["text"]))

Which model to pick

model=	Best for
`iic/SenseVoiceSmall`	Default; very fast, 50+ languages, emotion/events
`paraformer-zh`	Classic, production-grade Chinese
`FunAudioLLM/Fun-ASR-Nano-2512`	Flagship LLM decoder for zh/en/ja + Chinese dialects/accents
`FunAudioLLM/Fun-ASR-MLT-Nano-2512`	Separate multilingual checkpoint covering 31 languages

Switch models by changing only model= — the rest of the code stays the same.

Why FunASR

Local, free, private — no internet, no API key, no per-minute fees.
Fast: SenseVoice is non-autoregressive, far faster than Whisper (benchmark) — real-time even on CPU.
Stronger on Chinese + 50+ languages; timestamps, speakers, hotwords, and batching out of the box.

FunASR is Tongyi Lab's open-source, industrial-grade speech recognition toolkit.

Star FunASR on GitHub ★