Lightweight speech-to-text: Chinese ASR in ~250 MB on CPU (no GPU, no Python)

2026-06-22 · FunASR Team

Most speech recognition wants a GPU, a Python environment, and multi-GB models. If you just need to turn Chinese speech into text on a laptop, a small server, or an edge device, that's a lot of overhead.

FunASR's llama.cpp / GGUF runtime strips it down to the minimum: **one self-contained binary + one quantized model file**, pure CPU, zero Python. The q8 build of SenseVoice is just **254 MB** with virtually unchanged accuracy.

Tested: a 254 MB model, 0.16 s on CPU (real output)

Grab a prebuilt binary (Linux/macOS/Windows) and a q8 model, then run:

# binaries are on Releases; then fetch a model
bash download-funasr-model.sh sensevoice ./gguf
llama-funasr-sensevoice -m ./gguf/sensevoice-small-q8.gguf --vad ./gguf/fsmn-vad.gguf -a audio.wav
# → 欢迎大家来体验达摩院推出的语音识别模型。   (CPU, 0.16 s)

The model is 243–254 MB, the VAD just 1.7 MB, and detokenization is built into the binary (no Python).

Small and accurate: vs whisper.cpp

Both on CPU (Chinese, 184-file micro-CER, lower is better):

Model	Size	Chinese CER ↓
FunASR SenseVoice q8	254 MB	7.99%
FunASR Paraformer q8	237 MB	9.78%
whisper.cpp small	466 MB	22.12%
whisper.cpp large-v3-turbo	1.6 GB	23.15%

FunASR's q8 model is smaller than whisper.cpp small yet ~3× more accurate on Chinese.

Even smaller: the quant matrix

Model	Type	Size	CER
SenseVoice encoder	q8	254 MB	7.99%
Paraformer encoder	q8	237 MB	9.78%
Fun-ASR-Nano LLM (+ encoder 470 MB)	q4_K_M	484 MB	8.35%

Prebuilt binaries for Linux x64 / arm64, macOS arm64, Windows x64 — the arm64 build suits Raspberry Pi and edge boxes.

Where to get it

Prebuilt binaries: GitHub Releases (tag runtime-llamacpp-v*)
One-page quickstart: funasr.com/llama-cpp
GGUF models: Hugging Face / ModelScope

FunASR is fully open-source & commercial-friendly. A GitHub Star really helps 👇

⭐ Star FunASR

Also star: SenseVoice · Fun-ASR · FunClip

Lightweight speech-to-text: Chinese ASR in ~250 MB on CPU (no GPU, no Python)

Tested: a 254 MB model, 0.16 s on CPU (real output)

Small and accurate: vs whisper.cpp

Even smaller: the quant matrix

Where to get it

Related posts