Fun-ASR-Nano Guide: An 800M End-to-End Speech Recognition LLM for 31 Languages and 7 Chinese Dialects
Fun-ASR-Nano is an end-to-end speech recognition large model from the FunAudioLLM team, built on an "Audio Encoder + Adaptor + LLM (Qwen)" architecture with ~800M parameters, trained on tens of millions of hours of real speech. It is our current flagship recommended ASR model.
Highlights:
- 31 languages; Chinese covers 7 dialects (Wu, Cantonese, Min, Hakka, Gan, Xiang, Jin) and 26 regional accents
- Hotwords, timestamps, and low-latency streaming
- Speaker diarization (VAD + speaker embedding + punctuation)
- Even lyric and rap speech recognition
1. Install
pip install -U funasr torch torchaudio
2. Basic inference (with hotwords)
A minimal, tested example (downloads the model automatically):
from funasr import AutoModel
model = AutoModel(
model="FunAudioLLM/Fun-ASR-Nano-2512",
trust_remote_code=True,
remote_code="./model.py",
device="cuda:0",
hub="hf", # use hub="ms" for ModelScope
)
wav = f"{model.model_path}/example/zh.mp3" # bundled sample audio
res = model.generate(
input=[wav], cache={}, batch_size=1,
hotwords=["开放时间"], # boosts recall of domain terms
language="中文",
itn=True,
)
print(res[0]["text"])
Verified output: 开放时间早上九点至下午五点。 — the hotword was recognized correctly.
3. Long audio: add VAD
model = AutoModel(
model="FunAudioLLM/Fun-ASR-Nano-2512",
trust_remote_code=True, remote_code="./model.py",
vad_model="fsmn-vad",
vad_kwargs={"max_single_segment_time": 30000},
device="cuda:0", hub="hf",
)
res = model.generate(input=[wav], cache={}, batch_size=1, language="中文")
4. Speaker diarization
Combine VAD + speaker embedding (cam++) + punctuation for per-sentence speaker labels:
model = AutoModel(
model="FunAudioLLM/Fun-ASR-Nano-2512",
trust_remote_code=True, remote_code="./model.py",
vad_model="fsmn-vad", vad_kwargs={"max_single_segment_time": 30000},
spk_model="cam++", punc_model="ct-punc",
device="cuda:0", hub="hf",
)
res = model.generate(input=[wav], cache={}, batch_size=1, language="中文")
for sent in res[0]["sentence_info"]:
print(sent["spk"], sent["text"])
Diarization needs FunASR from source: pip install git+https://github.com/modelscope/FunASR.git.
5. High-throughput / streaming
| Scenario | Recommended |
|---|---|
| Large-scale offline batch | AutoModelVLLM (vLLM backend) |
| Real-time low latency | FunASRNanoStreamingVLLM (chunk streaming) |
| Single machine / quick try | The AutoModel path above |
The vLLM path is version-sensitive; vLLM 0.12.0 + torch 2.9.0 is recommended. Full examples in the Fun-ASR repo.
Get started with Fun-ASR-Nano
Our flagship model — 31 languages and 7 dialects. If it helps, please star it on GitHub ⭐
Fun-ASR GitHub ★Read more: SenseVoice Deployment Guide · Quickstart