Which FunASR model? SenseVoice vs Paraformer vs Fun-ASR-Nano

2026-06-23 · FunASR Team

FunASR ships three main ASR models. In one line: multilingual + emotion/events and fast → SenseVoice; Chinese production + word timestamps/hotwords → Paraformer; highest accuracy + context/hotwords across 31 languages → Fun-ASR-Nano. Details below.

Pick in one table

Model	Languages	Chinese CER ↓	Arch / speed	Highlights	Best for
SenseVoice	50+ (zh/yue/en/ja/ko…)	7.81%	non-AR CTC, ~170x	emotion + audio events + language ID	multilingual, emotion, real-time/low latency
Paraformer	Chinese (+ English variant)	10.18%	non-AR CIF, ~120x	word timestamps, hotwords (SeACo), streaming	Chinese production, subtitles/timestamps, hotwords
Fun-ASR-Nano	31	8.06%	LLM (Qwen3-0.6B), vLLM 340x	context/hotword prompting, LLM decoding	highest accuracy, context-aware, broad languages

(Chinese CER on the same 184-file set, micro-average + normalize_zh; speed = realtime factor on GPU.)

SenseVoice — the all-rounder, default pick

One non-autoregressive pass gives transcript + language + emotion + audio events, 50+ languages, lowest Chinese CER, and high speed. The default for most use cases.

from funasr import AutoModel
m = AutoModel(model="iic/SenseVoiceSmall", vad_model="fsmn-vad")
res = m.generate(input="audio.wav", language="auto", use_itn=True)

Paraformer — Chinese production + timestamps/hotwords

Industrial Chinese ASR with word-level timestamps (for subtitles), hotword customization (SeACo-Paraformer), and a low-latency streaming variant (paraformer-zh-streaming). Choose it when you need timestamps or hotwords.

m = AutoModel(model="paraformer-zh", vad_model="fsmn-vad", punc_model="ct-punc")
res = m.generate(input="audio.wav")

Fun-ASR-Nano — LLM-ASR, highest accuracy + context

A Qwen3-0.6B-based LLM-ASR across 31 languages, with context/hotword prompting and strong offline accuracy; vLLM acceleration reaches 340x. Choose it for top quality and context-awareness.

m = AutoModel(model="FunAudioLLM/Fun-ASR-Nano-2512", trust_remote_code=True, hub="hf")
res = m.generate(input="audio.wav", language="中文", hotwords=["开放时间"])

Quick decision

Need multilingual / emotion / real-time → SenseVoice
Need word timestamps / hotwords / streaming → Paraformer
Need highest accuracy / context / 31 languages → Fun-ASR-Nano
Need to run on CPU/edge with no Python → all three have a llama.cpp / GGUF build

FunASR is open-source & commercial-friendly. A Star really helps 👇

⭐ Star FunASR

Also: SenseVoice · Fun-ASR · FunClip