SenseVoice Deployment Guide: Multilingual Speech Recognition 15× Faster Than Whisper
SenseVoice is an open-source multilingual speech understanding model from the FunAudioLLM team. Beyond automatic speech recognition (ASR), a single forward pass also returns spoken-language identification, emotion recognition, and audio-event detection. Its non-autoregressive architecture makes it roughly 15× faster than Whisper-Large, with latency low enough for both real-time and high-throughput batch workloads.
Here is a copy-paste guide to get SenseVoice running in three steps.
1. Install
pip install -U funasr torch torchaudio
2. Transcribe in three lines
This downloads SenseVoiceSmall automatically and runs recognition:
from funasr import AutoModel from funasr.utils.postprocess_utils import rich_transcription_postprocess model = AutoModel(model="iic/SenseVoiceSmall", device="cuda:0") # use device="cpu" for CPU res = model.generate(input="audio.wav", language="auto", use_itn=True) print(rich_transcription_postprocess(res[0]["text"]))
Model IDs: iic/SenseVoiceSmall on ModelScope, FunAudioLLM/SenseVoiceSmall on HuggingFace. use_itn=True adds punctuation and inverse text normalization.
3. Reading the output: language + emotion + audio event
The raw output carries rich-text tags, for example:
<|en|><|NEUTRAL|><|Speech|><|withitn|>Welcome to the speech recognition model.
<|en|>— language (zh, en, yue, ja, ko)<|NEUTRAL|>— emotion (HAPPY / SAD / ANGRY / NEUTRAL ...)<|Speech|>— audio event (Speech / BGM / Applause / Laughter ...)
Call rich_transcription_postprocess() for clean text, or read res[0]["text"] directly to keep the tags for downstream analysis.
4. Long audio: add VAD segmentation
model = AutoModel(
model="iic/SenseVoiceSmall",
vad_model="fsmn-vad",
vad_kwargs={"max_single_segment_time": 30000},
device="cuda:0",
)
res = model.generate(input="long_audio.wav", language="auto", use_itn=True, batch_size_s=60)
5. GPU vs CPU
| Scenario | Recommendation |
|---|---|
| High concurrency / batch | GPU (device="cuda:0") |
| Lightweight / edge / offline | CPU (device="cpu") — non-autoregressive, fast on CPU too |
| Production service | FunASR runtime (C++ / Docker, multi-concurrency) |
6. At a glance
| Dimension | SenseVoice-Small |
|---|---|
| Speed vs Whisper-Large | ~15× |
| Languages | Chinese / Cantonese / English / Japanese / Korean ... |
| Extra capabilities | Language ID · Emotion · Audio events |
| Architecture | Non-autoregressive (low latency) |
| License | Open source (commercial use allowed) |
Get started with SenseVoice
If this guide helped, please star the project on GitHub ⭐ to support open source.
SenseVoice GitHub ★Read more: FunASR Quickstart · All models · Ecosystem