FunASR on llama.cpp — a whisper.cpp Alternative for Chinese ASR (CPU, no Python)

2026-06-20 · FunASR Team

whisper.cpp is the de-facto on-device ASR runtime — a single self-contained binary, runs on CPU, zero dependencies. But Whisper is comparatively weak on Chinese. Now FunASR has a llama.cpp / GGUF runtime: the same download-and-run experience — one static binary, no Python, built-in VAD, any audio format — and it is about 2.7× more accurate than whisper.cpp on Chinese CPU.

Run it in 3 steps (all verified)

# 1. Download a prebuilt binary (linux-x64 / linux-arm64 / macos-arm64 / windows-x64)
wget https://github.com/modelscope/FunASR/releases/download/runtime-llamacpp-v0.1.1/funasr-llamacpp-linux-x64.tar.gz
tar xzf funasr-llamacpp-linux-x64.tar.gz

# 2. Download a model with one command (no torch / funasr env)
bash download-funasr-model.sh sensevoice

# 3. Transcribe — prints text directly
./llama-funasr-sensevoice -m funasr-gguf/sensevoice-small-f16.gguf \
    --vad funasr-gguf/fsmn-vad.gguf -a audio.wav
# -> 开放时间早上九点至下午五点

No Python, no build, no dependency hell — download the binary, download the model, run, get text. Exactly the whisper.cpp feel.

What you get

Model	Strength	Speed (CPU)
SenseVoice	multilingual + language / emotion / event tags	~20× real-time
Paraformer	non-autoregressive, fastest	~22× real-time
Fun-ASR-Nano	LLM-ASR, most accurate	LLM decode

Built-in FSMN-VAD: automatic long-audio segmentation (`--vad fsmn-vad.gguf`); the bare binary reaches reference accuracy with no Python front end.
Any audio: wav / mp3 / flac, any sample rate / channels — resampled inside the binary.
4 prebuilt platforms: Linux x64/arm64, macOS arm64, Windows x64 (Release); one CMake command to build for others.

vs whisper.cpp on Chinese CPU

184 Mandarin clips, same machine, CPU 8 threads, character error rate (CER, lower is better):

System	CER ↓	Size
FunASR SenseVoice	8.01 %	449 MB
FunASR Paraformer	9.85 %	401 MB
FunASR Fun-ASR-Nano	8.30 %	enc + Qwen3-0.6B
whisper.cpp base	31.33 %	142 MB
whisper.cpp small	22.12 %	466 MB
whisper.cpp large-v3-turbo	23.15 %	1.6 GB

At comparable size (SenseVoice 449 MB ≈ whisper small 466 MB), FunASR's CER is about 2.7× lower — and faster. Full methodology in the repo's BENCHMARKS.md.

Why it wins on Chinese

(1) Training data — SenseVoice / Paraformer / Fun-ASR-Nano are trained primarily on large-scale Mandarin; Whisper is a general multilingual model where Chinese is a small slice. (2) Architecture — Paraformer is non-autoregressive (CIF, one forward pass) and SenseVoice is encoder + CTC (one forward pass), faster than Whisper's autoregressive per-token decoding.

Model GGUFs

Pre-converted GGUF on Hugging Face: SenseVoiceSmall-GGUF · Paraformer-GGUF · Fun-ASR-Nano-GGUF · fsmn-vad-GGUF. `download-funasr-model.sh` fetches them automatically.

The whole FunASR stack is open-source — ASR / VAD / punctuation / speaker / emotion & events / LLM-ASR, now with an on-device llama.cpp runtime. A GitHub Star really helps 👇

⭐ Star FunASR

Also star:SenseVoice · Fun-ASR · FunClip