FunASR on llama.cpp — a whisper.cpp Alternative for Chinese ASR (CPU, no Python)

whisper.cpp is the de-facto on-device ASR runtime — a single self-contained binary, runs on CPU, zero dependencies. But Whisper is comparatively weak on Chinese. Now FunASR has a llama.cpp / GGUF runtime: the same download-and-run experience — one static binary, no Python, built-in VAD, any audio format — and it is about 2.7× more accurate than whisper.cpp on Chinese CPU.

Run it in 3 steps (all verified)

# 1. Download a prebuilt binary (linux-x64 / linux-arm64 / macos-arm64 / windows-x64)
wget https://github.com/modelscope/FunASR/releases/download/runtime-llamacpp-v0.1.1/funasr-llamacpp-linux-x64.tar.gz
tar xzf funasr-llamacpp-linux-x64.tar.gz

# 2. Download a model with one command (no torch / funasr env)
bash download-funasr-model.sh sensevoice

# 3. Transcribe — prints text directly
./llama-funasr-sensevoice -m funasr-gguf/sensevoice-small-f16.gguf \
    --vad funasr-gguf/fsmn-vad.gguf -a audio.wav
# -> 开放时间早上九点至下午五点

No Python, no build, no dependency hell — download the binary, download the model, run, get text. Exactly the whisper.cpp feel.

What you get

ModelStrengthSpeed (CPU)
SenseVoicemultilingual + language / emotion / event tags~20× real-time
Paraformernon-autoregressive, fastest~22× real-time
Fun-ASR-NanoLLM-ASR, most accurateLLM decode

vs whisper.cpp on Chinese CPU

184 Mandarin clips, same machine, CPU 8 threads, character error rate (CER, lower is better):

SystemCER ↓Size
FunASR SenseVoice8.01 %449 MB
FunASR Paraformer9.85 %401 MB
FunASR Fun-ASR-Nano8.30 %enc + Qwen3-0.6B
whisper.cpp base31.33 %142 MB
whisper.cpp small22.12 %466 MB
whisper.cpp large-v3-turbo23.15 %1.6 GB

At comparable size (SenseVoice 449 MB ≈ whisper small 466 MB), FunASR's CER is about 2.7× lower — and faster. Full methodology in the repo's BENCHMARKS.md.

Why it wins on Chinese

(1) Training data — SenseVoice / Paraformer / Fun-ASR-Nano are trained primarily on large-scale Mandarin; Whisper is a general multilingual model where Chinese is a small slice. (2) Architecture — Paraformer is non-autoregressive (CIF, one forward pass) and SenseVoice is encoder + CTC (one forward pass), faster than Whisper's autoregressive per-token decoding.

Model GGUFs

Pre-converted GGUF on Hugging Face: SenseVoiceSmall-GGUF · Paraformer-GGUF · Fun-ASR-Nano-GGUF · fsmn-vad-GGUF. `download-funasr-model.sh` fetches them automatically.

The whole FunASR stack is open-source — ASR / VAD / punctuation / speaker / emotion & events / LLM-ASR, now with an on-device llama.cpp runtime. A GitHub Star really helps 👇

⭐ Star FunASR

Also star:SenseVoice · Fun-ASR · FunClip

Related posts