Which FunASR model? SenseVoice vs Paraformer vs Fun-ASR-Nano
FunASR ships three main ASR models. In one line: multilingual + emotion/events and fast → SenseVoice; Chinese production + word timestamps/hotwords → Paraformer; highest accuracy + context/hotwords across 31 languages → Fun-ASR-Nano. Details below.
Pick in one table
| Model | Languages | Chinese CER ↓ | Arch / speed | Highlights | Best for |
|---|---|---|---|---|---|
| SenseVoice | 50+ (zh/yue/en/ja/ko…) | 7.81% | non-AR CTC, ~170x | emotion + audio events + language ID | multilingual, emotion, real-time/low latency |
| Paraformer | Chinese (+ English variant) | 10.18% | non-AR CIF, ~120x | word timestamps, hotwords (SeACo), streaming | Chinese production, subtitles/timestamps, hotwords |
| Fun-ASR-Nano | 31 | 8.06% | LLM (Qwen3-0.6B), vLLM 340x | context/hotword prompting, LLM decoding | highest accuracy, context-aware, broad languages |
(Chinese CER on the same 184-file set, micro-average + normalize_zh; speed = realtime factor on GPU.)
SenseVoice — the all-rounder, default pick
One non-autoregressive pass gives transcript + language + emotion + audio events, 50+ languages, lowest Chinese CER, and high speed. The default for most use cases.
from funasr import AutoModel m = AutoModel(model="iic/SenseVoiceSmall", vad_model="fsmn-vad") res = m.generate(input="audio.wav", language="auto", use_itn=True)
Paraformer — Chinese production + timestamps/hotwords
Industrial Chinese ASR with word-level timestamps (for subtitles), hotword customization (SeACo-Paraformer), and a low-latency streaming variant (paraformer-zh-streaming). Choose it when you need timestamps or hotwords.
m = AutoModel(model="paraformer-zh", vad_model="fsmn-vad", punc_model="ct-punc") res = m.generate(input="audio.wav")
Fun-ASR-Nano — LLM-ASR, highest accuracy + context
A Qwen3-0.6B-based LLM-ASR across 31 languages, with context/hotword prompting and strong offline accuracy; vLLM acceleration reaches 340x. Choose it for top quality and context-awareness.
m = AutoModel(model="FunAudioLLM/Fun-ASR-Nano-2512", trust_remote_code=True, hub="hf") res = m.generate(input="audio.wav", language="中文", hotwords=["开放时间"])
Quick decision
- Need multilingual / emotion / real-time → SenseVoice
- Need word timestamps / hotwords / streaming → Paraformer
- Need highest accuracy / context / 31 languages → Fun-ASR-Nano
- Need to run on CPU/edge with no Python → all three have a llama.cpp / GGUF build
FunASR is open-source & commercial-friendly. A Star really helps 👇
Also: SenseVoice · Fun-ASR · FunClip
Related posts
- FunASR vs Whisper Benchmark
- SenseVoice Deployment Guide
- Fun-ASR-Nano Guide
- Speaker Diarization: Who Spoke When
- Emotion & Language Detection
- Real-Time Streaming Speech-to-Text
- Transcribe Long Audio (Hours in One Call)
- Transcribe from the Command Line
- Self-Hosted OpenAI Whisper API Alternative
- Auto-Generate Subtitles (SRT / VTT)
- Speech to Text in Python
- FunASR on llama.cpp (whisper.cpp Alternative)
- FunASR vs faster-whisper (Chinese/Cantonese)
- Lightweight Speech Recognition on CPU
- Self-Hosted Deepgram/AssemblyAI Alternative
- Cantonese Speech Recognition (SenseVoice)