SenseVoice Deployment Guide: Multilingual Speech Recognition 15× Faster Than Whisper

SenseVoice is an open-source multilingual speech understanding model from the FunAudioLLM team. Beyond automatic speech recognition (ASR), a single forward pass also returns spoken-language identification, emotion recognition, and audio-event detection. Its non-autoregressive architecture makes it roughly 15× faster than Whisper-Large, with latency low enough for both real-time and high-throughput batch workloads.

Here is a copy-paste guide to get SenseVoice running in three steps.

1. Install

pip install -U funasr torch torchaudio

2. Transcribe in three lines

This downloads SenseVoiceSmall automatically and runs recognition:

from funasr import AutoModel
from funasr.utils.postprocess_utils import rich_transcription_postprocess

model = AutoModel(model="iic/SenseVoiceSmall", device="cuda:0")  # use device="cpu" for CPU
res = model.generate(input="audio.wav", language="auto", use_itn=True)
print(rich_transcription_postprocess(res[0]["text"]))

Model IDs: iic/SenseVoiceSmall on ModelScope, FunAudioLLM/SenseVoiceSmall on HuggingFace. use_itn=True adds punctuation and inverse text normalization.

3. Reading the output: language + emotion + audio event

The raw output carries rich-text tags, for example:

<|en|><|NEUTRAL|><|Speech|><|withitn|>Welcome to the speech recognition model.

Call rich_transcription_postprocess() for clean text, or read res[0]["text"] directly to keep the tags for downstream analysis.

4. Long audio: add VAD segmentation

model = AutoModel(
    model="iic/SenseVoiceSmall",
    vad_model="fsmn-vad",
    vad_kwargs={"max_single_segment_time": 30000},
    device="cuda:0",
)
res = model.generate(input="long_audio.wav", language="auto", use_itn=True, batch_size_s=60)

5. GPU vs CPU

ScenarioRecommendation
High concurrency / batchGPU (device="cuda:0")
Lightweight / edge / offlineCPU (device="cpu") — non-autoregressive, fast on CPU too
Production serviceFunASR runtime (C++ / Docker, multi-concurrency)

6. At a glance

DimensionSenseVoice-Small
Speed vs Whisper-Large~15×
LanguagesChinese / Cantonese / English / Japanese / Korean ...
Extra capabilitiesLanguage ID · Emotion · Audio events
ArchitectureNon-autoregressive (low latency)
LicenseOpen source (commercial use allowed)

Get started with SenseVoice

If this guide helped, please star the project on GitHub ⭐ to support open source.

SenseVoice GitHub ★

Read more: FunASR Quickstart · All models · Ecosystem

Related posts