Cantonese Speech Recognition in Python — SenseVoice Keeps Real Cantonese (Whisper Turns It Into Mandarin)
Cantonese has around 85 million speakers, yet open-source ASR support for it is weak. Whisper nominally lists Cantonese, but in practice it treats Cantonese as Chinese (zh) and rewrites it into Standard Written Mandarin — 唔→不, 嘅→的, 呢→這 — so the actual Cantonese is lost.
SenseVoice (an open-source multilingual speech-understanding model from the FunAudioLLM team) natively supports Cantonese (yue): it auto-detects the language and preserves genuine colloquial Cantonese characters, all in one non-autoregressive pass. Here is a real side-by-side on the same Cantonese clip.
Real comparison: the same Cantonese audio
Tested on SenseVoice's bundled Cantonese sample yue.mp3, actual outputs from both models:
| Model | Detected language | Output |
|---|---|---|
| SenseVoice | yue (Cantonese) | 呢几个字都表达唔到,我想讲嘅意思。 |
| Whisper-small | zh (Chinese) | 這幾個字都表達不到我想講的意思 |
The difference is clear: SenseVoice keeps the hallmark Cantonese words 呢 / 唔 / 嘅; Whisper has no separate Cantonese, recognizes it as Chinese and converts it to written Mandarin (呢→這, 唔→不, 嘅→的) — roughly the same meaning, but no longer Cantonese.
Cantonese ASR in three lines
pip install funasr
from funasr import AutoModel
from funasr.utils.postprocess_utils import rich_transcription_postprocess
model = AutoModel(model="iic/SenseVoiceSmall", disable_update=True)
res = model.generate(input="cantonese.wav", cache={}, language="yue", use_itn=True)
print(rich_transcription_postprocess(res[0]["text"]))
# 呢几个字都表达唔到,我想讲嘅意思。
Set language to "yue" directly, or use "auto" for automatic language ID — on this clip auto also correctly resolves to yue with identical output.
What else the raw output carries
SenseVoice's raw output begins with a set of tags:
<|yue|><|NEUTRAL|><|Speech|><|withitn|>呢几个字都表达唔到,我想讲嘅意思。
Meaning: <|yue|> = language is Cantonese, <|NEUTRAL|> = emotion, <|Speech|> = audio event (clean speech), <|withitn|> = inverse text normalization applied. So while transcribing Cantonese you also get language, emotion and audio events for free. One call to rich_transcription_postprocess() strips the tags to plain text.
Why SenseVoice for Cantonese
| SenseVoice | Whisper | |
|---|---|---|
| Colloquial Cantonese (呢/唔/嘅) | ✅ preserved | ❌ rewritten to Mandarin |
| Automatic language ID | ✅ resolves to yue | collapses to zh |
| Emotion / audio events | ✅ in one pass | ❌ |
| Inverse text normalization | ✅ built-in | partial |
| Speed | non-autoregressive, ~15× faster than Whisper-Large | autoregressive baseline |
| License | open-source, commercial-friendly | open-source |
If your use case mixes Mandarin and Cantonese (HK/Macau apps, call centers, media subtitles), one SenseVoice model covers both. For higher-accuracy offline transcription, see the FunASR-family Fun-ASR-Nano, and the full Chinese accuracy comparison in the FunASR vs Whisper benchmark.
The whole FunASR stack is open-source — industrial-grade ASR / VAD / punctuation / speaker / emotion & events / LLM-ASR, with Cantonese out of the box. If it helps, a GitHub Star really supports the project 👇
Related posts
- FunASR vs Whisper Benchmark
- SenseVoice Deployment Guide
- Fun-ASR-Nano Guide
- Speaker Diarization: Who Spoke When
- Emotion & Language Detection
- Real-Time Streaming Speech-to-Text
- Transcribe Long Audio (Hours in One Call)
- Transcribe from the Command Line
- Self-Hosted OpenAI Whisper API Alternative
- Auto-Generate Subtitles (SRT / VTT)
- Speech to Text in Python
- FunASR on llama.cpp (whisper.cpp Alternative)
- FunASR vs faster-whisper (Chinese/Cantonese)