FunASR vs Whisper: A Real Benchmark on Chinese Speech Recognition

We benchmarked the FunASR model family against OpenAI Whisper on 184 long-form Chinese audio files (11,539s ≈ 192 min) on a single NVIDIA H100, measuring speed (RTFx) and accuracy (CER). The takeaway: for Chinese, FunASR is both faster and more accurate.

ModelDeviceRTFx(higher=faster)CER(lower=better)
SenseVoice-SmallGPU169.6x7.81%
Paraformer-LargeGPU119.6x10.18%
Fun-ASR-NanoGPU340x (vLLM)8.20%
Whisper-large-v3-turboGPU46.1x21.71%
Whisper-large-v3GPU13.4x20.02%
SenseVoice-SmallCPU17.2x7.81%

Speed

SenseVoice-Small reaches 169.6x realtime — about 12x faster than Whisper-large-v3 (13.4x). Even on CPU, SenseVoice hits 17.2x, faster than Whisper on GPU. Paraformer-Large does 119.6x.

Accuracy

Chinese CER: SenseVoice 7.81%, Paraformer 10.18%, versus Whisper-large-v3 at 20.02% (turbo 21.71%) — roughly half the error rate or better.

Why FunASR is faster

SenseVoice / Paraformer are non-autoregressive: one forward pass yields the full transcript, unlike Whisper's token-by-token autoregressive decoding. Combined with training data tuned for Chinese and Asian languages, FunASR wins on both speed and accuracy in Chinese.

Get started with FunASR

Open-source, commercial-friendly, CPU/GPU. If it helps, star it ⭐

FunASR GitHub ★

Read more: SenseVoice Guide · Fun-ASR-Nano Guide

Setup: 184 files / 11,539s Chinese audio, NVIDIA H100. RTF=infer time/audio duration, Speed=1/RTF, CER computed after punctuation removal.

Related posts