Cantonese Speech Recognition in Python — SenseVoice Keeps Real Cantonese (Whisper Turns It Into Mandarin)

Cantonese has around 85 million speakers, yet open-source ASR support for it is weak. Whisper nominally lists Cantonese, but in practice it treats Cantonese as Chinese (zh) and rewrites it into Standard Written Mandarin, , — so the actual Cantonese is lost.

SenseVoice (an open-source multilingual speech-understanding model from the FunAudioLLM team) natively supports Cantonese (yue): it auto-detects the language and preserves genuine colloquial Cantonese characters, all in one non-autoregressive pass. Here is a real side-by-side on the same Cantonese clip.

Real comparison: the same Cantonese audio

Tested on SenseVoice's bundled Cantonese sample yue.mp3, actual outputs from both models:

ModelDetected languageOutput
SenseVoiceyue (Cantonese)呢几个字都表达到,我想讲意思。
Whisper-smallzh (Chinese)這幾個字都表達到我想講意思

The difference is clear: SenseVoice keeps the hallmark Cantonese words / / ; Whisper has no separate Cantonese, recognizes it as Chinese and converts it to written Mandarin (呢→這, 唔→不, 嘅→的) — roughly the same meaning, but no longer Cantonese.

Cantonese ASR in three lines

pip install funasr

from funasr import AutoModel
from funasr.utils.postprocess_utils import rich_transcription_postprocess

model = AutoModel(model="iic/SenseVoiceSmall", disable_update=True)
res = model.generate(input="cantonese.wav", cache={}, language="yue", use_itn=True)

print(rich_transcription_postprocess(res[0]["text"]))
# 呢几个字都表达唔到,我想讲嘅意思。

Set language to "yue" directly, or use "auto" for automatic language ID — on this clip auto also correctly resolves to yue with identical output.

What else the raw output carries

SenseVoice's raw output begins with a set of tags:

<|yue|><|NEUTRAL|><|Speech|><|withitn|>呢几个字都表达唔到,我想讲嘅意思。

Meaning: <|yue|> = language is Cantonese, <|NEUTRAL|> = emotion, <|Speech|> = audio event (clean speech), <|withitn|> = inverse text normalization applied. So while transcribing Cantonese you also get language, emotion and audio events for free. One call to rich_transcription_postprocess() strips the tags to plain text.

Why SenseVoice for Cantonese

SenseVoiceWhisper
Colloquial Cantonese (呢/唔/嘅)✅ preserved❌ rewritten to Mandarin
Automatic language ID✅ resolves to yuecollapses to zh
Emotion / audio events✅ in one pass
Inverse text normalization✅ built-inpartial
Speednon-autoregressive, ~15× faster than Whisper-Largeautoregressive baseline
Licenseopen-source, commercial-friendlyopen-source

If your use case mixes Mandarin and Cantonese (HK/Macau apps, call centers, media subtitles), one SenseVoice model covers both. For higher-accuracy offline transcription, see the FunASR-family Fun-ASR-Nano, and the full Chinese accuracy comparison in the FunASR vs Whisper benchmark.

The whole FunASR stack is open-source — industrial-grade ASR / VAD / punctuation / speaker / emotion & events / LLM-ASR, with Cantonese out of the box. If it helps, a GitHub Star really supports the project 👇

⭐ Star SenseVoice

Also star:FunASR · Fun-ASR · FunClip

Related posts