Speech to Text in Python: Transcribe Audio Locally and Free with FunASR
You don't need a cloud API or a heavy Whisper setup to transcribe audio in Python. FunASR turns audio into text in a few lines, locally and for free, with timestamps, speaker diarization, and batch processing built in. It's especially strong on Chinese and supports 50+ languages. Every snippet below is tested.
Install
pip install -U torch torchaudio
pip install -U funasr
Simplest: transcribe a file in a few lines
from funasr import AutoModel
from funasr.utils.postprocess_utils import rich_transcription_postprocess
model = AutoModel(model="iic/SenseVoiceSmall", vad_model="fsmn-vad", device="cuda")
result = model.generate(input="audio.wav")
print(rich_transcription_postprocess(result[0]["text"]))
# -> 欢迎大家来体验达摩院推出的语音识别模型 (Chinese sample)
Defaults to SenseVoice (non-autoregressive, very fast). rich_transcription_postprocess strips SenseVoice tags like <|zh|> to give clean text — don't skip it.
Transcribe straight from a URL
result = model.generate(
input="https://example.com/audio.wav" # local paths, numpy, bytes also work
)
Get timestamps and speakers
Each entry in sentence_info carries start/end (ms), spk (speaker), and sentence:
model = AutoModel(model="iic/SenseVoiceSmall", vad_model="fsmn-vad", spk_model="cam++", device="cuda")
result = model.generate(input="audio.wav")
for seg in result[0]["sentence_info"]:
start = seg["start"] / 1000
text = rich_transcription_postprocess(seg["sentence"])
print("[%.1fs] Speaker %s: %s" % (start, seg["spk"], text))
# -> [0.6s] Speaker 0: 欢迎大家来体验达摩院推出的语音识别模型
Batch-transcribe multiple files
Pass a list to input and get one result per file:
results = model.generate(input=["a.wav", "b.wav", "c.wav"])
for r in results:
print(rich_transcription_postprocess(r["text"]))
Which model to pick
| model= | Best for |
|---|---|
iic/SenseVoiceSmall | Default; very fast, 50+ languages, emotion/events |
paraformer-zh | Classic, production-grade Chinese |
FunAudioLLM/Fun-ASR-Nano-2512 | LLM decoder, highest accuracy, 31 languages incl. dialects |
Switch models by changing only model= — the rest of the code stays the same.
Why FunASR
- Local, free, private — no internet, no API key, no per-minute fees.
- Fast: SenseVoice is non-autoregressive, far faster than Whisper (benchmark) — real-time even on CPU.
- Stronger on Chinese + 50+ languages; timestamps, speakers, hotwords, and batching out of the box.
FunASR is Tongyi Lab's open-source, industrial-grade speech recognition toolkit.
Star FunASR on GitHub ★Related posts
- FunASR vs Whisper Benchmark
- SenseVoice Deployment Guide
- Fun-ASR-Nano Guide
- Speaker Diarization: Who Spoke When
- Emotion & Language Detection
- Real-Time Streaming Speech-to-Text
- Transcribe Long Audio (Hours in One Call)
- Transcribe from the Command Line
- Self-Hosted OpenAI Whisper API Alternative
- Auto-Generate Subtitles (SRT / VTT)