SenseVoice Deployment Guide: Multilingual Speech Recognition 15× Faster Than Whisper

FunASR Blog · 2026-06-16 · Tutorial

SenseVoice is an open-source multilingual speech understanding model from the FunAudioLLM team. Beyond automatic speech recognition (ASR), a single forward pass also returns spoken-language identification, emotion recognition, and audio-event detection. Its non-autoregressive architecture makes it roughly 15× faster than Whisper-Large, with latency low enough for both real-time and high-throughput batch workloads.

Here is a copy-paste guide to get SenseVoice running in three steps.

1. Install

pip install -U funasr torch torchaudio

2. Transcribe in three lines

This downloads SenseVoiceSmall automatically and runs recognition:

from funasr import AutoModel
from funasr.utils.postprocess_utils import rich_transcription_postprocess

model = AutoModel(model="iic/SenseVoiceSmall", device="cuda:0")  # use device="cpu" for CPU
res = model.generate(input="audio.wav", language="auto", use_itn=True)
print(rich_transcription_postprocess(res[0]["text"]))

Model IDs: iic/SenseVoiceSmall on ModelScope, FunAudioLLM/SenseVoiceSmall on HuggingFace. use_itn=True adds punctuation and inverse text normalization.

3. Reading the output: language + emotion + audio event

The raw output carries rich-text tags, for example:

<|en|><|NEUTRAL|><|Speech|><|withitn|>Welcome to the speech recognition model.

<|en|> — language (zh, en, yue, ja, ko)
<|NEUTRAL|> — emotion (HAPPY / SAD / ANGRY / NEUTRAL ...)
<|Speech|> — audio event (Speech / BGM / Applause / Laughter ...)

Call rich_transcription_postprocess() for clean text, or read res[0]["text"] directly to keep the tags for downstream analysis.

4. Long audio: add VAD segmentation

model = AutoModel(
    model="iic/SenseVoiceSmall",
    vad_model="fsmn-vad",
    vad_kwargs={"max_single_segment_time": 30000},
    device="cuda:0",
)
res = model.generate(input="long_audio.wav", language="auto", use_itn=True, batch_size_s=60)

5. GPU vs CPU

Scenario	Recommendation
High concurrency / batch	GPU (`device="cuda:0"`)
Lightweight / edge / offline	CPU (`device="cpu"`) — non-autoregressive, fast on CPU too
Production service	FunASR runtime (C++ / Docker, multi-concurrency)

6. At a glance

Dimension	SenseVoice-Small
Speed vs Whisper-Large	~15×
Languages	Chinese / Cantonese / English / Japanese / Korean ...
Extra capabilities	Language ID · Emotion · Audio events
Architecture	Non-autoregressive (low latency)
License	Open source (commercial use allowed)

Get started with SenseVoice

If this guide helped, please star the project on GitHub ⭐ to support open source.

SenseVoice GitHub ★

Read more: FunASR Quickstart · All models · Ecosystem