Fun-ASR-Nano Guide: An 800M End-to-End Speech Recognition LLM for 31 Languages and 7 Chinese Dialects

Fun-ASR-Nano is an end-to-end speech recognition large model from the FunAudioLLM team, built on an "Audio Encoder + Adaptor + LLM (Qwen)" architecture with ~800M parameters, trained on tens of millions of hours of real speech. It is our current flagship recommended ASR model.

Highlights:

1. Install

pip install -U funasr torch torchaudio

2. Basic inference (with hotwords)

A minimal, tested example (downloads the model automatically):

from funasr import AutoModel

model = AutoModel(
    model="FunAudioLLM/Fun-ASR-Nano-2512",
    trust_remote_code=True,
    remote_code="./model.py",
    device="cuda:0",
    hub="hf",          # use hub="ms" for ModelScope
)
wav = f"{model.model_path}/example/zh.mp3"   # bundled sample audio
res = model.generate(
    input=[wav], cache={}, batch_size=1,
    hotwords=["开放时间"],   # boosts recall of domain terms
    language="中文",
    itn=True,
)
print(res[0]["text"])

Verified output: 开放时间早上九点至下午五点。 — the hotword was recognized correctly.

3. Long audio: add VAD

model = AutoModel(
    model="FunAudioLLM/Fun-ASR-Nano-2512",
    trust_remote_code=True, remote_code="./model.py",
    vad_model="fsmn-vad",
    vad_kwargs={"max_single_segment_time": 30000},
    device="cuda:0", hub="hf",
)
res = model.generate(input=[wav], cache={}, batch_size=1, language="中文")

4. Speaker diarization

Combine VAD + speaker embedding (cam++) + punctuation for per-sentence speaker labels:

model = AutoModel(
    model="FunAudioLLM/Fun-ASR-Nano-2512",
    trust_remote_code=True, remote_code="./model.py",
    vad_model="fsmn-vad", vad_kwargs={"max_single_segment_time": 30000},
    spk_model="cam++", punc_model="ct-punc",
    device="cuda:0", hub="hf",
)
res = model.generate(input=[wav], cache={}, batch_size=1, language="中文")
for sent in res[0]["sentence_info"]:
    print(sent["spk"], sent["text"])

Diarization needs FunASR from source: pip install git+https://github.com/modelscope/FunASR.git.

5. High-throughput / streaming

ScenarioRecommended
Large-scale offline batchAutoModelVLLM (vLLM backend)
Real-time low latencyFunASRNanoStreamingVLLM (chunk streaming)
Single machine / quick tryThe AutoModel path above

The vLLM path is version-sensitive; vLLM 0.12.0 + torch 2.9.0 is recommended. Full examples in the Fun-ASR repo.

Get started with Fun-ASR-Nano

Our flagship model — 31 languages and 7 dialects. If it helps, please star it on GitHub ⭐

Fun-ASR GitHub ★

Read more: SenseVoice Deployment Guide · Quickstart

Related posts