FunASR vs faster-whisper: Chinese & Cantonese Speech Recognition Compared

2026-06-21 · FunASR Team

faster-whisper is the most popular fast Whisper implementation (CTranslate2), and it is strong on English and general multilingual audio. But on Chinese — especially Cantonese and dialects — it has clear gaps. Below is a sentence-level comparison of faster-whisper (small) vs FunASR SenseVoice on the same three clips.

Side by side (same audio, verbatim)

Audio	faster-whisper (small)	FunASR SenseVoice
Mandarin	开放时间早上九点至下午五点。	开放时间早上9点至下午5点。
Cantonese	lang mis-detected as zh; 這幾個字都表達不到我想講的意思	correctly detects `yue`; 呢几个字都表达唔到,我想讲嘅意思
Japanese	うちの中学は弁当性で…	うちの中学は弁当制で…

Cantonese: faster-whisper treats it as Mandarin — detects zh and rewrites spoken Cantonese into standard written Mandarin (dropping Cantonese-specific characters 呢 / 唔到 / 嘅). SenseVoice natively handles Cantonese, detects yue, and keeps the Cantonese form.
Japanese homophone error: faster-whisper hears 弁当性 instead of 弁当制; SenseVoice is correct.
On a simple Mandarin sentence both are correct — the gap shows on hard samples, dialects, and proper nouns.

Overall accuracy (184 Mandarin clips)

Same machine, CPU, character error rate (lower is better): FunASR SenseVoice 8.0% / Paraformer 9.9% / Fun-ASR-Nano 8.3%; Whisper-class models (small/base/large-v3-turbo) ~22–31%. On Chinese, FunASR's CER is ~2.7× lower — due to large-scale Mandarin training data and a non-autoregressive architecture (also faster). Full methodology in BENCHMARKS.md.

What else FunASR gives you

Language ID: zh / en / yue (Cantonese) / ja / ko (faster-whisper labels Cantonese as zh)
Emotion + audio events: HAPPY/SAD/ANGRY, BGM/applause/laughter (details)
Non-autoregressive = faster: SenseVoice/Paraformer single forward pass, ~20× real-time on CPU
On-device: a llama.cpp/GGUF runtime, single binary, zero Python

Which should you use?

Honestly: faster-whisper is still excellent for English, general 99-language coverage, and translation, with a mature ecosystem. But if your audio is Chinese, Cantonese, dialects, or you need emotion / event / language info, FunASR is more accurate and more complete. Both are open-source and run locally.

Try FunASR in 3 lines

pip install funasr
from funasr import AutoModel
m = AutoModel(model="iic/SenseVoiceSmall")
print(m.generate(input="audio.wav", language="auto", use_itn=True)[0]["text"])

The whole FunASR stack is open-source — ASR / VAD / punctuation / speaker / emotion & events / LLM-ASR / on-device llama.cpp. A GitHub Star helps 👇

⭐ Star FunASR

Also star:SenseVoice · Fun-ASR · FunClip