Chinese (Mandarin) Speech Recognition in Python — Fast & Accurate with FunASR

Chinese is hard for general ASR — OpenAI Whisper's character error rate (CER) on Chinese is around 20%, with frequent homophone and proper-noun mistakes. FunASR (open-sourced by Tongyi Lab) is purpose-built for Chinese, with CER far below Whisper. The default recommendation is the flagship Fun-ASR-Nano (LLM-ASR); on CPU use SenseVoice / Paraformer. Here's the practical how-to, with real measured output.

Chinese ASR in 3 lines (flagship Fun-ASR-Nano, real output)

pip install funasr

from funasr import AutoModel

model = AutoModel(model="FunAudioLLM/Fun-ASR-Nano-2512", disable_update=True, device="cuda")
res = model.generate(input="audio.wav")

print(res[0]["text"])
# 欢迎大家来体验达摩院推出的语音识别模型。  ("Welcome everyone to try the speech model from DAMO Academy")

Fun-ASR-Nano is the LLM-based flagship recognizer (SenseVoice encoder + a Qwen3 decoder), 31 languages, more robust on context / hard cases / proper nouns — the default for Chinese. For scale, use vLLM (~340× real-time); see the Fun-ASR-Nano guide.

Pick the right Chinese model (by scenario)

Your situationPickChinese CER
Have a GPU / want the bestFun-ASR-Nano8.06%
CPU / edge / want fastestSenseVoice7.81%
CPU + Chinese-only + timestamps/hotwordsParaformer10.18%

All three beat Whisper (~20%) by a wide margin. Full guidance in the FunASR model selection guide. The two lightweight CPU options:

# SenseVoice — the CPU choice, non-autoregressive & fast, multilingual + emotion
from funasr import AutoModel
from funasr.utils.postprocess_utils import rich_transcription_postprocess
model = AutoModel(model="iic/SenseVoiceSmall", disable_update=True)
res = model.generate(input="audio.wav", language="auto", use_itn=True)
print(rich_transcription_postprocess(res[0]["text"]))
# 欢迎大家来体验达摩院推出的语音识别模型。

# Paraformer — Chinese-only + character timestamps + hotwords
model = AutoModel(model="paraformer-zh", vad_model="fsmn-vad", punc_model="ct-punc")
res = model.generate(input="audio.wav", batch_size_s=300)
print(res[0]["text"])           # 欢迎大家来体验达摩院推出的语音识别模型。
# res[0]["timestamp"] -> [[880, 1120], [1120, 1360], ...] per-character start/end ms

Why FunASR for Chinese (instead of Whisper)

Chinese test-set CER (lower is better)CER
FunASR · Fun-ASR-Nano (flagship)8.06%
FunASR · SenseVoice7.81%
FunASR · Paraformer10.18%
Whisper-large-v320.02%

Beyond accuracy, FunASR ships a full Chinese toolkit: inverse text normalization (ITN), punctuation restoration, hotword / proper-noun customization, character-level timestamps, and speaker diarization. Full comparisons in the FunASR vs Whisper benchmark and vs faster-whisper.

Common add-ons (each is one line)

Command line / server

# transcribe from the CLI, export srt / json, with speaker labels
funasr audio.wav --model paraformer-zh -f srt --spk

# OpenAI-compatible transcription server (flagship fun-asr-nano by default)
funasr-server --device cuda

See command-line transcription and the self-hosted cloud-STT alternative.

FunASR is the go-to open-source choice for Chinese ASR (MIT) — flagship Fun-ASR-Nano + SenseVoice + Paraformer. If it helps, a GitHub Star really supports the project 👇

⭐ Star Fun-ASR

Also star:FunASR · SenseVoice · FunClip

Related posts