FunASR vs faster-whisper: Chinese & Cantonese Speech Recognition Compared
faster-whisper is the most popular fast Whisper implementation (CTranslate2), and it is strong on English and general multilingual audio. But on Chinese — especially Cantonese and dialects — it has clear gaps. Below is a sentence-level comparison of faster-whisper (small) vs FunASR SenseVoice on the same three clips.
Side by side (same audio, verbatim)
| Audio | faster-whisper (small) | FunASR SenseVoice |
|---|---|---|
| Mandarin | 开放时间早上九点至下午五点。 | 开放时间早上9点至下午5点。 |
| Cantonese | lang mis-detected as zh; 這幾個字都表達不到我想講的意思 | correctly detects yue;呢几个字都表达唔到,我想讲嘅意思 |
| Japanese | うちの中学は弁当性で… | うちの中学は弁当制で… |
- Cantonese: faster-whisper treats it as Mandarin — detects
zhand rewrites spoken Cantonese into standard written Mandarin (dropping Cantonese-specific characters呢 / 唔到 / 嘅). SenseVoice natively handles Cantonese, detectsyue, and keeps the Cantonese form. - Japanese homophone error: faster-whisper hears 弁当性 instead of 弁当制; SenseVoice is correct.
- On a simple Mandarin sentence both are correct — the gap shows on hard samples, dialects, and proper nouns.
Overall accuracy (184 Mandarin clips)
Same machine, CPU, character error rate (lower is better): FunASR SenseVoice 8.0% / Paraformer 9.9% / Fun-ASR-Nano 8.3%; Whisper-class models (small/base/large-v3-turbo) ~22–31%. On Chinese, FunASR's CER is ~2.7× lower — due to large-scale Mandarin training data and a non-autoregressive architecture (also faster). Full methodology in BENCHMARKS.md.
What else FunASR gives you
- Language ID: zh / en / yue (Cantonese) / ja / ko (faster-whisper labels Cantonese as zh)
- Emotion + audio events: HAPPY/SAD/ANGRY, BGM/applause/laughter (details)
- Non-autoregressive = faster: SenseVoice/Paraformer single forward pass, ~20× real-time on CPU
- On-device: a llama.cpp/GGUF runtime, single binary, zero Python
Which should you use?
Honestly: faster-whisper is still excellent for English, general 99-language coverage, and translation, with a mature ecosystem. But if your audio is Chinese, Cantonese, dialects, or you need emotion / event / language info, FunASR is more accurate and more complete. Both are open-source and run locally.
Try FunASR in 3 lines
pip install funasr from funasr import AutoModel m = AutoModel(model="iic/SenseVoiceSmall") print(m.generate(input="audio.wav", language="auto", use_itn=True)[0]["text"])
The whole FunASR stack is open-source — ASR / VAD / punctuation / speaker / emotion & events / LLM-ASR / on-device llama.cpp. A GitHub Star helps 👇
Also star:SenseVoice · Fun-ASR · FunClip
Related posts
- FunASR vs Whisper Benchmark
- SenseVoice Deployment Guide
- Fun-ASR-Nano Guide
- Speaker Diarization: Who Spoke When
- Emotion & Language Detection
- Real-Time Streaming Speech-to-Text
- Transcribe Long Audio (Hours in One Call)
- Transcribe from the Command Line
- Self-Hosted OpenAI Whisper API Alternative
- Auto-Generate Subtitles (SRT / VTT)
- Speech to Text in Python
- FunASR on llama.cpp (whisper.cpp Alternative)