FunASR Models

Choose the right model for your use case — from ultra-fast multilingual recognition to the highest Chinese accuracy.

Quick Comparison

Model	Speed	Languages	Params	Best For
Fun-ASR-Nano ⭐	vLLM 340x	zh/en/ja + Chinese dialects/accents	800M	Flagship · LLM-ASR · hardest cases
Fun-ASR-MLT-Nano	vLLM	31	800M	Separate multilingual checkpoint
SenseVoice Small	170x realtime	50+	234M	Fast multilingual, emotion detection
Paraformer-zh	120x realtime	zh, yue	220M	Best Chinese accuracy
cam++	realtime	any	7.2M	Speaker diarization & verification

ASR Models

Fun-ASR-Nano ⭐ Flagship / default

800M params · LLM-based (SenseVoice encoder + Qwen3-0.6B) · GitHub · HuggingFace

Next-generation LLM-based ASR model. Combines SenseVoice's audio encoder with Qwen3-0.6B language model for superior context understanding. Supports vLLM acceleration for high-throughput batch inference and real-time streaming. The released model.pt checkpoint does not provide reliable checkpoint-native character timestamps (issue #106).

vLLM accelerated zh/en/ja + Chinese dialects/accents Streaming LLM-quality

When to use

Best for high-throughput batch processing, real-time subtitles, and scenarios where LLM-quality context understanding improves output (e.g., proper nouns and code-switching). For reliable character-level timestamps, use Paraformer.

# With vLLM acceleration
from funasr import AutoModel
model = AutoModel(model="FunAudioLLM/Fun-ASR-Nano-2512", device="cuda", backend="vllm")
result = model.generate(input="audio.wav")
print(result[0]["text"])

Fun-ASR-MLT-Nano 31 languages

800M params · Separate multilingual checkpoint · HuggingFace · ModelScope

A separate checkpoint for broad multilingual recognition across 31 languages. Its model ID and language scope differ from flagship Fun-ASR-Nano.

vLLM accelerated 31 languages Separate checkpoint

When to use

Choose MLT-Nano for recognition across 31 languages; choose flagship Nano for zh/en/ja and Chinese dialects/accents.

from funasr import AutoModel
model = AutoModel(model="FunAudioLLM/Fun-ASR-MLT-Nano-2512", device="cuda")
result = model.generate(input="audio.wav")

SenseVoice Small

234M params · Non-autoregressive · GitHub · HuggingFace

Ultra-fast speech recognition with built-in emotion and audio event detection. Supports 50+ languages including Chinese, English, Japanese, Korean, French, German, and more. Non-autoregressive architecture delivers 170x realtime speed on GPU.

170x realtime 50+ languages Emotion detection Audio events CPU-viable

When to use

Best for: multilingual applications, real-time streaming, batch processing large audio collections, applications needing emotion or audio event detection.

from funasr import AutoModel
model = AutoModel(model="iic/SenseVoiceSmall")
result = model.generate(input="audio.wav")
print(result[0]["text"])

Paraformer-zh Large

220M params · Non-autoregressive · HuggingFace

Highest-accuracy Chinese speech recognition model. Non-autoregressive with CTC-guided attention, trained on 60,000+ hours of Mandarin speech. Includes built-in punctuation restoration and timestamp prediction.

120x realtime Chinese + Cantonese Best accuracy Timestamps Punctuation

When to use

Best for: Chinese-only applications where accuracy is the top priority — meeting transcription, subtitle generation, voice input, training data annotation.

from funasr import AutoModel
model = AutoModel(
    model="iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch",
    vad_model="iic/speech_fsmn_vad_zh-cn-16k-common-pytorch",
    punc_model="iic/punc_ct-transformer_cn-en-common-vocab471067-large",
)
result = model.generate(input="audio.wav")
print(result[0]["text"])

Supporting Models

cam++ (Speaker Diarization)

7.2M params · HuggingFace

Lightweight speaker embedding model for speaker diarization (who spoke when) and speaker verification (is this the same person). Only 7.2M parameters — runs on CPU in realtime.

7.2M params Diarization Verification CPU realtime

FSMN-VAD

Built-in · Voice Activity Detection

Feedforward Sequential Memory Network for Voice Activity Detection. Accurately detects speech segments in audio, handling silence, noise, and music. Used as a preprocessing step for all ASR models.

VAD Lightweight

CT-Transformer (Punctuation)

Built-in · Punctuation Restoration

Automatically adds punctuation to ASR output — commas, periods, question marks, etc. Supports Chinese and English. Dramatically improves readability of transcription output.

Punctuation zh + en

OpenAI-Compatible API

All models are available through funasr-server, which exposes an OpenAI-compatible /v1/audio/transcriptions endpoint:

# Start the server
pip install funasr vllm fastapi uvicorn python-multipart
funasr-server --device cuda --port 8000

# Use with any OpenAI-compatible client
curl http://localhost:8000/v1/audio/transcriptions \
  -F file=@audio.wav \
  -F model=SenseVoiceSmall

Drop-in replacement: Any application using OpenAI's Whisper API can switch to FunASR by changing the base URL. No code changes needed — same API format, same response structure.

Deployment Options

Method	Command	Best For
pip	`pip install funasr && funasr-server`	Development, quick testing
Docker	`docker run -d --gpus all -p 8000:8000 ...`	Production deployment
Python API	`from funasr import AutoModel`	Embedding in applications
ONNX	Via Sherpa-ONNX	Mobile, edge, browser