Cantonese Speech Recognition in Python — SenseVoice Keeps Real Cantonese (Whisper Turns It Into Mandarin)

2026-06-21 · FunASR Team

Cantonese has around 85 million speakers, yet open-source ASR support for it is weak. Whisper nominally lists Cantonese, but in practice it treats Cantonese as Chinese (zh) and rewrites it into Standard Written Mandarin — 唔→不, 嘅→的, 呢→這 — so the actual Cantonese is lost.

SenseVoice (an open-source multilingual speech-understanding model from the FunAudioLLM team) natively supports Cantonese (yue): it auto-detects the language and preserves genuine colloquial Cantonese characters, all in one non-autoregressive pass. Here is a real side-by-side on the same Cantonese clip.

Real comparison: the same Cantonese audio

Tested on SenseVoice's bundled Cantonese sample yue.mp3, actual outputs from both models:

Model	Detected language	Output
SenseVoice	`yue` (Cantonese)	呢几个字都表达唔到,我想讲嘅意思。
Whisper-small	`zh` (Chinese)	這幾個字都表達不到我想講的意思

The difference is clear: SenseVoice keeps the hallmark Cantonese words 呢 / 唔 / 嘅; Whisper has no separate Cantonese, recognizes it as Chinese and converts it to written Mandarin (呢→這, 唔→不, 嘅→的) — roughly the same meaning, but no longer Cantonese.

Cantonese ASR in three lines

pip install funasr

from funasr import AutoModel
from funasr.utils.postprocess_utils import rich_transcription_postprocess

model = AutoModel(model="iic/SenseVoiceSmall", disable_update=True)
res = model.generate(input="cantonese.wav", cache={}, language="yue", use_itn=True)

print(rich_transcription_postprocess(res[0]["text"]))
# 呢几个字都表达唔到，我想讲嘅意思。

Set language to "yue" directly, or use "auto" for automatic language ID — on this clip auto also correctly resolves to yue with identical output.

What else the raw output carries

SenseVoice's raw output begins with a set of tags:

<|yue|><|NEUTRAL|><|Speech|><|withitn|>呢几个字都表达唔到，我想讲嘅意思。

Why SenseVoice for Cantonese

	SenseVoice	Whisper
Colloquial Cantonese (呢/唔/嘅)	✅ preserved	❌ rewritten to Mandarin
Automatic language ID	✅ resolves to `yue`	collapses to `zh`
Emotion / audio events	✅ in one pass	❌
Inverse text normalization	✅ built-in	partial
Speed	non-autoregressive, ~15× faster than Whisper-Large	autoregressive baseline
License	open-source, commercial-friendly	open-source

If your use case mixes Mandarin and Cantonese (HK/Macau apps, call centers, media subtitles), one SenseVoice model covers both. For higher-accuracy offline transcription, see the FunASR-family Fun-ASR-Nano, and the full Chinese accuracy comparison in the FunASR vs Whisper benchmark.

The whole FunASR stack is open-source — industrial-grade ASR / VAD / punctuation / speaker / emotion & events / LLM-ASR, with Cantonese out of the box. If it helps, a GitHub Star really supports the project 👇

⭐ Star SenseVoice

Also star:FunASR · Fun-ASR · FunClip

Cantonese Speech Recognition in Python — SenseVoice Keeps Real Cantonese (Whisper Turns It Into Mandarin)

Real comparison: the same Cantonese audio

Cantonese ASR in three lines

What else the raw output carries

Why SenseVoice for Cantonese

Related posts