Industrial-Grade
Speech Recognition

Speech recognition, voice activity detection, punctuation restoration, speaker diarization, emotion detection, and audio event recognition — one unified Python API handles it all. 50+ languages, self-hosted, production-ready.

50+Languages
170xRealtime Speed
1 APIUnified Interface
16K+GitHub Stars
# Start speech recognition service
$ pip install funasr vllm
$ funasr-server --device cuda

# Call with OpenAI SDK
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8899/v1", api_key="x")
result = client.audio.transcriptions.create(
    model="fun-asr-nano",
    file=open("meeting.wav", "rb")
)
print(result.text)

Core Capabilities

A complete speech understanding pipeline from raw audio to structured output — all in one call

🎙️

Speech Recognition

End-to-end ASR supporting 50+ languages, including 7 Chinese dialects and 26 regional accents, with automatic language detection

📍

Voice Activity Detection

Millisecond-precision VAD with adaptive silence thresholds, accurately segmenting speech from silence

✍️

Punctuation Restoration

Automatically adds punctuation and applies inverse text normalization, producing readable formatted text

👥

Speaker Diarization

Identifies who said what, labeling each sentence with a speaker ID — ideal for meetings and interviews

😊

Emotion Detection

Recognizes emotional states — happy, sad, angry, neutral — for customer service QA and sentiment analysis

🔔

Audio Event Recognition

Detects background music, applause, laughter, crying, and other acoustic events for full scene understanding

How to Use

Three steps: Install → Choose your scenario → Call

$ pip install funasr vllm

Python 3.8+ · GPU 8GB+ · Linux / macOS

File Transcription — Upload audio, get complete results

Ideal for meeting recordings, video subtitles, and batch processing. Automatically includes VAD segmentation, punctuation, timestamps, and speaker labels.

# Start offline transcription service (works after pip install)
$ funasr-server --device cuda --port 8899

# Call (curl)
$ curl -X POST http://localhost:8899/v1/audio/transcriptions \
    -F "file=@meeting.wav" -F "model=fun-asr-nano" -F "response_format=verbose_json"
Output [00:01.7 → 00:05.5] Speaker 0: Let's discuss the three topics today.
[00:05.8 → 00:08.2] Speaker 1: Sounds good. First one is the Q3 plan.
[00:08.5 → 00:12.1] Speaker 0: Go ahead, we have 30 minutes.

Real-time Recognition — Speak and see results instantly

For live captions, broadcast transcription, and voice assistants. WebSocket-based protocol with confirmed text that never changes and new text that updates continuously.

# Streaming requires source code (not in pip yet)
$ git clone https://github.com/modelscope/FunASR.git && cd FunASR
$ python examples/industrial_data_pretraining/fun_asr_nano/serve_realtime_ws.py --port 10095 --language 中文

# Open the built-in browser client
$ open client_mic.html

# Or connect via Python
$ python client_python.py --server ws://localhost:10095 --mic
Real-time output (updates progressively) [live] Let's discuss the...
[confirmed] Let's discuss the three topics today.
[live] Sounds good first...
[confirmed] Sounds good. First one is the Q3 plan.

API Integration — OpenAI-compatible, zero-code changes for AI frameworks

Standard /v1/audio/transcriptions endpoint. LangChain, AutoGen, Dify, and Coze can connect directly without any code modifications.

# Start OpenAI-compatible API
$ funasr-server --device cuda

# Python (identical to OpenAI Whisper API)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8899/v1", api_key="x")

result = client.audio.transcriptions.create(
    model="fun-asr-nano",
    file=open("audio.wav", "rb"),
    response_format="verbose_json"
)
print(result.text)
JSON Response {"text": "Let's discuss the three topics today.", "segments": [{"start": 1.7, "end": 5.5, "text": "..."}], "duration": 12.1}

Explore More

Performance

184 files / 11,541 seconds / Fun-ASR-Nano

ModelEngineRTFxCERNotes
Fun-ASR-NanoPyTorch218.06%Baseline
Fun-ASR-NanovLLM batch3408.20%16x speedup
Fun-ASR-NanoOffline service1028.14%Incl. VAD + timestamps
GLM-ASR-NanovLLM batch26512.93%Community model

Accuracy matches PyTorch exactly (CER delta < 0.2%), with 16–340x speedup. Full report →

Product Demo

Watch FunASR real-time speech recognition in action