FunASR vs Whisper: A Real Benchmark on Chinese Speech Recognition
We benchmarked the FunASR model family against OpenAI Whisper on 184 long-form Chinese audio files (11,539s ≈ 192 min) on a single NVIDIA H100, measuring speed (RTFx) and accuracy (CER). The takeaway: for Chinese, FunASR is both faster and more accurate.
| Model | Device | RTFx(higher=faster) | CER(lower=better) |
|---|---|---|---|
| SenseVoice-Small | GPU | 169.6x | 7.81% |
| Paraformer-Large | GPU | 119.6x | 10.18% |
| Fun-ASR-Nano | GPU | 340x (vLLM) | 8.20% |
| Whisper-large-v3-turbo | GPU | 46.1x | 21.71% |
| Whisper-large-v3 | GPU | 13.4x | 20.02% |
| SenseVoice-Small | CPU | 17.2x | 7.81% |
Speed
SenseVoice-Small reaches 169.6x realtime — about 12x faster than Whisper-large-v3 (13.4x). Even on CPU, SenseVoice hits 17.2x, faster than Whisper on GPU. Paraformer-Large does 119.6x.
Accuracy
Chinese CER: SenseVoice 7.81%, Paraformer 10.18%, versus Whisper-large-v3 at 20.02% (turbo 21.71%) — roughly half the error rate or better.
Why FunASR is faster
SenseVoice / Paraformer are non-autoregressive: one forward pass yields the full transcript, unlike Whisper's token-by-token autoregressive decoding. Combined with training data tuned for Chinese and Asian languages, FunASR wins on both speed and accuracy in Chinese.
Get started with FunASR
Open-source, commercial-friendly, CPU/GPU. If it helps, star it ⭐
FunASR GitHub ★Read more: SenseVoice Guide · Fun-ASR-Nano Guide
Setup: 184 files / 11,539s Chinese audio, NVIDIA H100. RTF=infer time/audio duration, Speed=1/RTF, CER computed after punctuation removal.