Beyond Transcription: Detect Language, Emotion & Audio Events with SenseVoice

Most speech recognition models answer a single question: what was said. But many real-world applications also need to know: which language is this? what emotion is the speaker in? is there laughter, applause or background music?

SenseVoice does all four in a single non-autoregressive forward pass. Alongside the transcript, it tags each segment with language, emotion and acoustic event — no separate emotion model or language detector required.

Install

pip install -U funasr modelscope

Full, runnable code

from funasr import AutoModel
from funasr.utils.postprocess_utils import rich_transcription_postprocess

model = AutoModel(
    model="iic/SenseVoiceSmall",   # or FunAudioLLM/SenseVoiceSmall + hub="hf"
    vad_model="fsmn-vad",
    device="cuda:0",
)

res = model.generate(
    input="audio.wav",
    cache={},
    language="auto",   # auto-detect: zh / en / yue / ja / ko ...
    use_itn=True,
    batch_size_s=300,
)

print(res[0]["text"])                              # raw, with tags
print(rich_transcription_postprocess(res[0]["text"]))  # rendered

What the raw output looks like

<|zh|><|NEUTRAL|><|Speech|><|withitn|>关于春秋战国的一大区别。 <|zh|><|SAD|><|Speech|><|withitn|>在白宫出丑闻还蹦迪,嘿嘿。

The angle-bracket tags in front of each segment are the structured signal:

TagMeaning
<|zh|>Language (zh / en / yue Cantonese / ja / ko …)
<|SAD|>Emotion: NEUTRAL / HAPPY / SAD / ANGRY / FEARFUL / DISGUSTED / SURPRISED
<|Speech|>Acoustic event: Speech / BGM / Laughter / Applause
<|withitn|>Inverse text normalization applied (digits, punctuation)

rich_transcription_postprocess() renders these tags as emoji (e.g. 😔 for sadness) for display; for structured analysis, just parse the tags out of the raw text with a regex.

Parse the tags for emotion analysis

import re
pat = re.compile(r"<\|(\w+)\|><\|(\w+)\|><\|(\w+)\|><\|withitn\|>([^<]*)")
for lang, emo, event, text in pat.findall(res[0]["text"]):
    print(lang, emo, event, text.strip())

Typical use cases

Why SenseVoice instead of Whisper

SenseVoice is Tongyi Lab's open-source multi-task speech understanding model — fast and accurate, especially on Chinese.

Star SenseVoice on GitHub ★

Related posts