Beyond Transcription: Detect Language, Emotion & Audio Events with SenseVoice
Most speech recognition models answer a single question: what was said. But many real-world applications also need to know: which language is this? what emotion is the speaker in? is there laughter, applause or background music?
SenseVoice does all four in a single non-autoregressive forward pass. Alongside the transcript, it tags each segment with language, emotion and acoustic event — no separate emotion model or language detector required.
Install
pip install -U funasr modelscope
Full, runnable code
from funasr import AutoModel
from funasr.utils.postprocess_utils import rich_transcription_postprocess
model = AutoModel(
model="iic/SenseVoiceSmall", # or FunAudioLLM/SenseVoiceSmall + hub="hf"
vad_model="fsmn-vad",
device="cuda:0",
)
res = model.generate(
input="audio.wav",
cache={},
language="auto", # auto-detect: zh / en / yue / ja / ko ...
use_itn=True,
batch_size_s=300,
)
print(res[0]["text"]) # raw, with tags
print(rich_transcription_postprocess(res[0]["text"])) # rendered
What the raw output looks like
<|zh|><|NEUTRAL|><|Speech|><|withitn|>关于春秋战国的一大区别。 <|zh|><|SAD|><|Speech|><|withitn|>在白宫出丑闻还蹦迪,嘿嘿。
The angle-bracket tags in front of each segment are the structured signal:
| Tag | Meaning |
|---|---|
<|zh|> | Language (zh / en / yue Cantonese / ja / ko …) |
<|SAD|> | Emotion: NEUTRAL / HAPPY / SAD / ANGRY / FEARFUL / DISGUSTED / SURPRISED |
<|Speech|> | Acoustic event: Speech / BGM / Laughter / Applause … |
<|withitn|> | Inverse text normalization applied (digits, punctuation) |
rich_transcription_postprocess() renders these tags as emoji (e.g. 😔 for sadness) for display; for structured analysis, just parse the tags out of the raw text with a regex.
Parse the tags for emotion analysis
import re
pat = re.compile(r"<\|(\w+)\|><\|(\w+)\|><\|(\w+)\|><\|withitn\|>([^<]*)")
for lang, emo, event, text in pat.findall(res[0]["text"]):
print(lang, emo, event, text.strip())
Typical use cases
- Call-center QA: detect customer emotion (anger, frustration) while transcribing, auto-flag calls to review.
- Multilingual routing: identify the language first, then dispatch to the right downstream pipeline.
- Video / podcast understanding: spot laughter, applause and music for highlight clipping or chaptering.
- Content moderation: combine emotion + event cues to surface anomalous audio.
Why SenseVoice instead of Whisper
- 4-in-1: Whisper outputs text only; SenseVoice gives text + language + emotion + event in one call.
- Faster: non-autoregressive, with far higher RTFx than Whisper (benchmark) — real-time even on CPU.
- Stronger on Chinese: significantly lower CER than Whisper.
SenseVoice is Tongyi Lab's open-source multi-task speech understanding model — fast and accurate, especially on Chinese.
Star SenseVoice on GitHub ★