Fun-ASR-Nano Guide: An 800M ASR LLM for Chinese, English, Japanese and Chinese Dialects

FunASR Blog · Updated 2026-07-15 · Flagship model

Fun-ASR-Nano is an end-to-end speech recognition large model from the FunAudioLLM team, built on an "Audio Encoder + Adaptor + LLM (Qwen)" architecture with ~800M parameters, trained on tens of millions of hours of real speech. It is our current flagship recommended ASR model.

Highlights:

Supports Chinese, English, and Japanese; Chinese covers 7 dialect groups (Wu, Cantonese, Min, Hakka, Gan, Xiang, Jin) and 26 regional accents
Hotwords and a low-latency streaming service
The FunASR pipeline can combine separate VAD + cam++ + punctuation models for speaker diarization (not a native Nano checkpoint output)
Even lyric and rap speech recognition

Timestamp boundary: The released model.pt checkpoint lacks trained CTC alignment weights, so it does not provide reliable checkpoint-native character timestamps. In the offline service, VAD returns reliable segment start and end times; use Paraformer when you need accurate character-level timestamps. See issue #106.

For recognition across 31 languages, use the separate Fun-ASR-MLT-Nano-2512 checkpoint. Its language scope must not be attributed to flagship Nano.

1. Install

pip install -U funasr torch torchaudio

2. Basic inference (with hotwords)

A minimal, tested example (downloads the model automatically):

from funasr import AutoModel

model = AutoModel(
    model="FunAudioLLM/Fun-ASR-Nano-2512",
    trust_remote_code=True,
    remote_code="./model.py",
    device="cuda:0",
    hub="hf",          # use hub="ms" for ModelScope
)
wav = f"{model.model_path}/example/zh.mp3"   # bundled sample audio
res = model.generate(
    input=[wav], cache={}, batch_size=1,
    hotwords=["开放时间"],   # boosts recall of domain terms
    language="中文",
    itn=True,
)
print(res[0]["text"])

Verified output: 开放时间早上九点至下午五点。 — the hotword was recognized correctly.

3. Long audio: add VAD

model = AutoModel(
    model="FunAudioLLM/Fun-ASR-Nano-2512",
    trust_remote_code=True, remote_code="./model.py",
    vad_model="fsmn-vad",
    vad_kwargs={"max_single_segment_time": 30000},
    device="cuda:0", hub="hf",
)
res = model.generate(input=[wav], cache={}, batch_size=1, language="中文")

4. Speaker diarization

Speaker diarization is not a native Nano checkpoint output. The FunASR pipeline below combines VAD + speaker embedding (cam++) + punctuation for per-sentence speaker labels:

model = AutoModel(
    model="FunAudioLLM/Fun-ASR-Nano-2512",
    trust_remote_code=True, remote_code="./model.py",
    vad_model="fsmn-vad", vad_kwargs={"max_single_segment_time": 30000},
    spk_model="cam++", punc_model="ct-punc",
    device="cuda:0", hub="hf",
)
res = model.generate(input=[wav], cache={}, batch_size=1, language="中文")
for sent in res[0]["sentence_info"]:
    print(sent["spk"], sent["text"])

Diarization needs FunASR from source: pip install git+https://github.com/modelscope/FunASR.git.

5. High-throughput / streaming

Scenario	Recommended
Large-scale offline batch	`AutoModelVLLM` (vLLM backend)
Real-time low latency	`FunASRNanoStreamingVLLM` (chunk streaming)
Single machine / quick try	The `AutoModel` path above

The vLLM path is version-sensitive; vLLM 0.12.0 + torch 2.9.0 is recommended. Full examples in the Fun-ASR repo.

Get started with Fun-ASR-Nano

Our flagship model for zh/en/ja plus Chinese dialects and accents. Choose MLT-Nano for 31 languages. If it helps, please star it on GitHub ⭐

Fun-ASR GitHub ★