FunASR vs faster-whisper: Chinese & Cantonese Speech Recognition Compared

faster-whisper is the most popular fast Whisper implementation (CTranslate2), and it is strong on English and general multilingual audio. But on Chinese — especially Cantonese and dialects — it has clear gaps. Below is a sentence-level comparison of faster-whisper (small) vs FunASR SenseVoice on the same three clips.

Side by side (same audio, verbatim)

Audiofaster-whisper (small)FunASR SenseVoice
Mandarin开放时间早上九点至下午五点。开放时间早上9点至下午5点。
Cantoneselang mis-detected as zh;
這幾個字都表達不到我想講的意思
correctly detects yue;
呢几个字都表达唔到,我想讲嘅意思
Japaneseうちの中学は弁当で…うちの中学は弁当で…

Overall accuracy (184 Mandarin clips)

Same machine, CPU, character error rate (lower is better): FunASR SenseVoice 8.0% / Paraformer 9.9% / Fun-ASR-Nano 8.3%; Whisper-class models (small/base/large-v3-turbo) ~22–31%. On Chinese, FunASR's CER is ~2.7× lower — due to large-scale Mandarin training data and a non-autoregressive architecture (also faster). Full methodology in BENCHMARKS.md.

What else FunASR gives you

Which should you use?

Honestly: faster-whisper is still excellent for English, general 99-language coverage, and translation, with a mature ecosystem. But if your audio is Chinese, Cantonese, dialects, or you need emotion / event / language info, FunASR is more accurate and more complete. Both are open-source and run locally.

Try FunASR in 3 lines

pip install funasr
from funasr import AutoModel
m = AutoModel(model="iic/SenseVoiceSmall")
print(m.generate(input="audio.wav", language="auto", use_itn=True)[0]["text"])

The whole FunASR stack is open-source — ASR / VAD / punctuation / speaker / emotion & events / LLM-ASR / on-device llama.cpp. A GitHub Star helps 👇

⭐ Star FunASR

Also star:SenseVoice · Fun-ASR · FunClip

Related posts