Lightweight speech-to-text: Chinese ASR in ~250 MB on CPU (no GPU, no Python)

Most speech recognition wants a GPU, a Python environment, and multi-GB models. If you just need to turn Chinese speech into text on a laptop, a small server, or an edge device, that's a lot of overhead.

FunASR's llama.cpp / GGUF runtime strips it down to the minimum: **one self-contained binary + one quantized model file**, pure CPU, zero Python. The q8 build of SenseVoice is just **254 MB** with virtually unchanged accuracy.

Tested: a 254 MB model, 0.16 s on CPU (real output)

Grab a prebuilt binary (Linux/macOS/Windows) and a q8 model, then run:

# binaries are on Releases; then fetch a model
bash download-funasr-model.sh sensevoice ./gguf
llama-funasr-sensevoice -m ./gguf/sensevoice-small-q8.gguf --vad ./gguf/fsmn-vad.gguf -a audio.wav
# → 欢迎大家来体验达摩院推出的语音识别模型。   (CPU, 0.16 s)

The model is 243–254 MB, the VAD just 1.7 MB, and detokenization is built into the binary (no Python).

Small and accurate: vs whisper.cpp

Both on CPU (Chinese, 184-file micro-CER, lower is better):

ModelSizeChinese CER ↓
FunASR SenseVoice q8254 MB7.99%
FunASR Paraformer q8237 MB9.78%
whisper.cpp small466 MB22.12%
whisper.cpp large-v3-turbo1.6 GB23.15%

FunASR's q8 model is smaller than whisper.cpp small yet ~3× more accurate on Chinese.

Even smaller: the quant matrix

ModelTypeSizeCER
SenseVoice encoderq8254 MB7.99%
Paraformer encoderq8237 MB9.78%
Fun-ASR-Nano LLM (+ encoder 470 MB)q4_K_M484 MB8.35%

Prebuilt binaries for Linux x64 / arm64, macOS arm64, Windows x64 — the arm64 build suits Raspberry Pi and edge boxes.

Where to get it

FunASR is fully open-source & commercial-friendly. A GitHub Star really helps 👇

⭐ Star FunASR

Also star: SenseVoice · Fun-ASR · FunClip

Related posts