FunASR vs Whisper: A Real Benchmark on Chinese Speech Recognition

FunASR Blog · 2026-06-16 · Benchmark

We benchmarked the FunASR model family against OpenAI Whisper on 184 long-form Chinese audio files (11,539s ≈ 192 min) on a single NVIDIA H100, measuring speed (RTFx) and accuracy (CER). The takeaway: for Chinese, FunASR is both faster and more accurate.

Model	Device	RTFx(higher=faster)	CER(lower=better)
SenseVoice-Small	GPU	169.6x	7.81%
Paraformer-Large	GPU	119.6x	10.18%
Fun-ASR-Nano	GPU	340x (vLLM)	8.20%
Whisper-large-v3-turbo	GPU	46.1x	21.71%
Whisper-large-v3	GPU	13.4x	20.02%
SenseVoice-Small	CPU	17.2x	7.81%

Speed

SenseVoice-Small reaches 169.6x realtime — about 12x faster than Whisper-large-v3 (13.4x). Even on CPU, SenseVoice hits 17.2x, faster than Whisper on GPU. Paraformer-Large does 119.6x.

Accuracy

Chinese CER: SenseVoice 7.81%, Paraformer 10.18%, versus Whisper-large-v3 at 20.02% (turbo 21.71%) — roughly half the error rate or better.

Why FunASR is faster

SenseVoice / Paraformer are non-autoregressive: one forward pass yields the full transcript, unlike Whisper's token-by-token autoregressive decoding. Combined with training data tuned for Chinese and Asian languages, FunASR wins on both speed and accuracy in Chinese.

Get started with FunASR

Open-source, commercial-friendly, CPU/GPU. If it helps, star it ⭐

FunASR GitHub ★

Read more: SenseVoice Guide · Fun-ASR-Nano Guide

Setup: 184 files / 11,539s Chinese audio, NVIDIA H100. RTF=infer time/audio duration, Speed=1/RTF, CER computed after punctuation removal.

FunASR vs Whisper: A Real Benchmark on Chinese Speech Recognition

Speed

Accuracy

Why FunASR is faster

Get started with FunASR

Related posts