Chinese (Mandarin) Speech Recognition in Python — Fast & Accurate with FunASR
Chinese is hard for general ASR — OpenAI Whisper's character error rate (CER) on Chinese is around 20%, with frequent homophone and proper-noun mistakes. FunASR (open-sourced by Tongyi Lab) is purpose-built for Chinese, with CER far below Whisper. The default recommendation is the flagship Fun-ASR-Nano (LLM-ASR); on CPU use SenseVoice / Paraformer. Here's the practical how-to, with real measured output.
Chinese ASR in 3 lines (flagship Fun-ASR-Nano, real output)
pip install funasr
from funasr import AutoModel
model = AutoModel(model="FunAudioLLM/Fun-ASR-Nano-2512", disable_update=True, device="cuda")
res = model.generate(input="audio.wav")
print(res[0]["text"])
# 欢迎大家来体验达摩院推出的语音识别模型。 ("Welcome everyone to try the speech model from DAMO Academy")
Fun-ASR-Nano is the LLM-based flagship recognizer (SenseVoice encoder + a Qwen3 decoder), 31 languages, more robust on context / hard cases / proper nouns — the default for Chinese. For scale, use vLLM (~340× real-time); see the Fun-ASR-Nano guide.
Pick the right Chinese model (by scenario)
| Your situation | Pick | Chinese CER |
|---|---|---|
| Have a GPU / want the best | ⭐ Fun-ASR-Nano | 8.06% |
| CPU / edge / want fastest | SenseVoice | 7.81% |
| CPU + Chinese-only + timestamps/hotwords | Paraformer | 10.18% |
All three beat Whisper (~20%) by a wide margin. Full guidance in the FunASR model selection guide. The two lightweight CPU options:
# SenseVoice — the CPU choice, non-autoregressive & fast, multilingual + emotion from funasr import AutoModel from funasr.utils.postprocess_utils import rich_transcription_postprocess model = AutoModel(model="iic/SenseVoiceSmall", disable_update=True) res = model.generate(input="audio.wav", language="auto", use_itn=True) print(rich_transcription_postprocess(res[0]["text"])) # 欢迎大家来体验达摩院推出的语音识别模型。 # Paraformer — Chinese-only + character timestamps + hotwords model = AutoModel(model="paraformer-zh", vad_model="fsmn-vad", punc_model="ct-punc") res = model.generate(input="audio.wav", batch_size_s=300) print(res[0]["text"]) # 欢迎大家来体验达摩院推出的语音识别模型。 # res[0]["timestamp"] -> [[880, 1120], [1120, 1360], ...] per-character start/end ms
Why FunASR for Chinese (instead of Whisper)
| Chinese test-set CER (lower is better) | CER |
|---|---|
| FunASR · Fun-ASR-Nano (flagship) | 8.06% |
| FunASR · SenseVoice | 7.81% |
| FunASR · Paraformer | 10.18% |
| Whisper-large-v3 | 20.02% |
Beyond accuracy, FunASR ships a full Chinese toolkit: inverse text normalization (ITN), punctuation restoration, hotword / proper-noun customization, character-level timestamps, and speaker diarization. Full comparisons in the FunASR vs Whisper benchmark and vs faster-whisper.
Common add-ons (each is one line)
- Long audio auto-segmentation: add
vad_model="fsmn-vad". See long audio and VAD / silence removal. - Punctuation: add
punc_model="ct-punc". - Hotwords: use SeACo-Paraformer with
hotword. - Who spoke when: add a speaker model — see speaker diarization.
- Cantonese: natively supported by SenseVoice — see Cantonese speech recognition.
Command line / server
# transcribe from the CLI, export srt / json, with speaker labels funasr audio.wav --model paraformer-zh -f srt --spk # OpenAI-compatible transcription server (flagship fun-asr-nano by default) funasr-server --device cuda
See command-line transcription and the self-hosted cloud-STT alternative.
FunASR is the go-to open-source choice for Chinese ASR (MIT) — flagship Fun-ASR-Nano + SenseVoice + Paraformer. If it helps, a GitHub Star really supports the project 👇
Also star:FunASR · SenseVoice · FunClip
Related posts
- FunASR vs Whisper Benchmark
- SenseVoice Deployment Guide
- Fun-ASR-Nano Guide
- Speaker Diarization: Who Spoke When
- Emotion & Language Detection
- Real-Time Streaming Speech-to-Text
- Transcribe Long Audio (Hours in One Call)
- Transcribe from the Command Line
- Self-Hosted OpenAI Whisper API Alternative
- Auto-Generate Subtitles (SRT / VTT)
- Speech to Text in Python
- FunASR on llama.cpp (whisper.cpp Alternative)
- FunASR vs faster-whisper (Chinese/Cantonese)
- Lightweight Speech Recognition on CPU
- Self-Hosted Deepgram/AssemblyAI Alternative
- Which FunASR Model?
- Cantonese Speech Recognition (SenseVoice)
- Japanese Speech Recognition (SenseVoice)
- Voice Activity Detection in Python
- Self-Hosted Google/AWS/Azure STT Alternative