FunASR on llama.cpp — a whisper.cpp Alternative for Chinese ASR (CPU, no Python)
whisper.cpp is the de-facto on-device ASR runtime — a single self-contained binary, runs on CPU, zero dependencies. But Whisper is comparatively weak on Chinese. Now FunASR has a llama.cpp / GGUF runtime: the same download-and-run experience — one static binary, no Python, built-in VAD, any audio format — and it is about 2.7× more accurate than whisper.cpp on Chinese CPU.
Run it in 3 steps (all verified)
# 1. Download a prebuilt binary (linux-x64 / linux-arm64 / macos-arm64 / windows-x64)
wget https://github.com/modelscope/FunASR/releases/download/runtime-llamacpp-v0.1.1/funasr-llamacpp-linux-x64.tar.gz
tar xzf funasr-llamacpp-linux-x64.tar.gz
# 2. Download a model with one command (no torch / funasr env)
bash download-funasr-model.sh sensevoice
# 3. Transcribe — prints text directly
./llama-funasr-sensevoice -m funasr-gguf/sensevoice-small-f16.gguf \
--vad funasr-gguf/fsmn-vad.gguf -a audio.wav
# -> 开放时间早上九点至下午五点
No Python, no build, no dependency hell — download the binary, download the model, run, get text. Exactly the whisper.cpp feel.
What you get
| Model | Strength | Speed (CPU) |
|---|---|---|
| SenseVoice | multilingual + language / emotion / event tags | ~20× real-time |
| Paraformer | non-autoregressive, fastest | ~22× real-time |
| Fun-ASR-Nano | LLM-ASR, most accurate | LLM decode |
- Built-in FSMN-VAD: automatic long-audio segmentation (`--vad fsmn-vad.gguf`); the bare binary reaches reference accuracy with no Python front end.
- Any audio: wav / mp3 / flac, any sample rate / channels — resampled inside the binary.
- 4 prebuilt platforms: Linux x64/arm64, macOS arm64, Windows x64 (Release); one CMake command to build for others.
vs whisper.cpp on Chinese CPU
184 Mandarin clips, same machine, CPU 8 threads, character error rate (CER, lower is better):
| System | CER ↓ | Size |
|---|---|---|
| FunASR SenseVoice | 8.01 % | 449 MB |
| FunASR Paraformer | 9.85 % | 401 MB |
| FunASR Fun-ASR-Nano | 8.30 % | enc + Qwen3-0.6B |
| whisper.cpp base | 31.33 % | 142 MB |
| whisper.cpp small | 22.12 % | 466 MB |
| whisper.cpp large-v3-turbo | 23.15 % | 1.6 GB |
At comparable size (SenseVoice 449 MB ≈ whisper small 466 MB), FunASR's CER is about 2.7× lower — and faster. Full methodology in the repo's BENCHMARKS.md.
Why it wins on Chinese
(1) Training data — SenseVoice / Paraformer / Fun-ASR-Nano are trained primarily on large-scale Mandarin; Whisper is a general multilingual model where Chinese is a small slice. (2) Architecture — Paraformer is non-autoregressive (CIF, one forward pass) and SenseVoice is encoder + CTC (one forward pass), faster than Whisper's autoregressive per-token decoding.
Model GGUFs
Pre-converted GGUF on Hugging Face: SenseVoiceSmall-GGUF · Paraformer-GGUF · Fun-ASR-Nano-GGUF · fsmn-vad-GGUF. `download-funasr-model.sh` fetches them automatically.
The whole FunASR stack is open-source — ASR / VAD / punctuation / speaker / emotion & events / LLM-ASR, now with an on-device llama.cpp runtime. A GitHub Star really helps 👇
Also star:SenseVoice · Fun-ASR · FunClip
Related posts
- FunASR vs Whisper Benchmark
- SenseVoice Deployment Guide
- Fun-ASR-Nano Guide
- Speaker Diarization: Who Spoke When
- Emotion & Language Detection
- Real-Time Streaming Speech-to-Text
- Transcribe Long Audio (Hours in One Call)
- Transcribe from the Command Line
- Self-Hosted OpenAI Whisper API Alternative
- Auto-Generate Subtitles (SRT / VTT)
- Speech to Text in Python