Speech Emotion Recognition in Python — Language ID & Audio Events in One Model (SenseVoice)

2026-06-19 · FunASR Team

Most speech-recognition (ASR) models give you only text. But real applications often need more: what emotion is the speaker in? what language is this? is the background clean speech or music? With Whisper you have to stack several models — language detection + a separate emotion model + an audio-event classifier. Slow and hard to maintain.

SenseVoice (an open-source multilingual speech-understanding model from the FunAudioLLM team) returns all of it in one non-autoregressive forward pass: the transcript plus spoken-language ID, emotion, audio-event detection, and inverse text normalization (ITN). It is roughly 15× faster than Whisper-Large.

What one inference gives you

Capability	Detail
ASR	50+ languages, leading Chinese accuracy
Language ID	auto-detects `zh / en / ja / ko / yue …`
Emotion	HAPPY 😊 / SAD 😔 / ANGRY 😡 / NEUTRAL / FEARFUL / DISGUSTED / SURPRISED
Audio events	Speech / BGM 🎵 / Applause 👏 / Laughter / Cry
Inverse text norm (ITN)	“nine o'clock”→“9:00”, “fifty”→“50”

Three lines of Python

pip install funasr

from funasr import AutoModel
from funasr.utils.postprocess_utils import rich_transcription_postprocess

model = AutoModel(model="iic/SenseVoiceSmall", disable_update=True)
res = model.generate(input="audio.wav", cache={}, language="auto", use_itn=True)

print(res[0]["text"])                                    # raw, with tags
print(rich_transcription_postprocess(res[0]["text"]))    # cleaned text

Real output: 5 languages auto-detected

Run on SenseVoice's bundled multilingual samples (zh/en/ja/ko/yue). Each raw output starts with tags:

<|zh|><|NEUTRAL|><|Speech|><|withitn|>开放时间早上9点至下午5点。
<|en|><|NEUTRAL|><|Speech|><|withitn|>The tribal chieftain called for the boy and presented him with 50 pieces of gold.
<|ja|><|NEUTRAL|><|Speech|><|withitn|>うちの中学は弁当制で持っていけない場合は、50 円の学校販売のパンを買う。
<|ko|><|NEUTRAL|><|Speech|><|withitn|>조금만 생각을 하면서 살면 훨씬 편할 거야.
<|yue|><|NEUTRAL|><|Speech|><|withitn|>呢几个字都表达唔到，我想讲嘅意思。

Emotion & audio events on real-world audio

Running SenseVoice over 60 real-world web clips: 56 were correctly tagged BGM (background music present) — exactly the case where Whisper tends to force a transcription and hallucinate — while HAPPY / ANGRY / NEUTRAL emotions were detected. For example, one clip tagged ANGRY:

<|zh|><|ANGRY|><|Speech|><|withitn|>哎,不要看不起那些理想主义者,你脚下的每一步都是他们走出来的。

Full tag sets — emotion: HAPPY / SAD / ANGRY / NEUTRAL / FEARFUL / DISGUSTED / SURPRISED; events: Speech / BGM / Applause / Laughter / Cry.

Parsing the tags

import re

raw = res[0]["text"]
tags = re.findall(r"<\|([^|]+)\|>", raw)
language = tags[0] if tags else None                       # 'zh'
emotion  = next((t for t in tags if t in
            {"HAPPY","SAD","ANGRY","NEUTRAL","FEARFUL","DISGUSTED","SURPRISED"}), None)
event    = next((t for t in tags if t in
            {"Speech","BGM","Applause","Laughter","Cry"}), None)
text     = re.sub(r"<\|[^|]+\|>", "", raw)         # clean text
print(language, emotion, event, text)

vs Whisper

	SenseVoice	Whisper
Transcript + language	✅ one call	✅
Emotion recognition	✅ built-in	❌ needs extra model
Audio events (BGM/applause/laughter)	✅ built-in	❌
Inverse text normalization	✅ built-in	partial
Speed	non-autoregressive, ~15× faster	autoregressive baseline

When you need understanding of audio and not just text, one SenseVoice model replaces a language-detector + emotion-model + event-classifier stack.

The whole FunASR stack is open-source — industrial-grade ASR / VAD / punctuation / speaker / emotion & events / LLM-ASR. If it helps, a GitHub Star really supports the project 👇

⭐ Star FunASR

Also star:SenseVoice · Fun-ASR · FunClip