Japanese Speech Recognition in Python — SenseVoice Does Transcription + Punctuation + Emotion in One Pass (vs Whisper)

For Japanese speech-to-text, most people reach for Whisper. But if you want something that runs on CPU, is fast and accurate, and ships punctuation and emotion out of the box, SenseVoice (an open-source multilingual speech-understanding model from the FunAudioLLM team) is an underrated option: it natively supports Japanese (ja), auto-detects the language, and produces punctuated text in a single non-autoregressive pass — plus emotion and audio-event tags for free.

Below is a real side-by-side on the same Japanese clip between SenseVoice and the same-tier Whisper-small (both run on our server, nothing cherry-picked).

Real comparison: the same Japanese audio

The test clip is a piece of colloquial Japanese speech. Actual outputs from both models:

ModelDetected languageOutput
SenseVoiceauto → ja供給量が減る と ある程度は仕方ないんじゃね、転売の価格は論外だけど。
Whisper-smallja供給量が減るとある程度は仕方ないんじゃね天売の価格は論外だけど

The key difference is the word 転売:

To be fair about the setup: this compares same-tier Whisper-small (a common CPU-friendly choice). A larger Whisper-large would usually fix this kind of homophone, but it is far bigger and slower; SenseVoice gets the word right at a small-model footprint, and throws in punctuation and emotion on top.

Japanese ASR in three lines

pip install funasr

from funasr import AutoModel
from funasr.utils.postprocess_utils import rich_transcription_postprocess

model = AutoModel(model="iic/SenseVoiceSmall", disable_update=True)
res = model.generate(input="japanese.wav", language="auto", use_itn=True)

print(rich_transcription_postprocess(res[0]["text"]))
# 供給量が減る と ある程度は仕方ないんじゃね、転売の価格は論外だけど。

Set language to "ja" explicitly, or use "auto" as above for automatic language ID — on this clip auto correctly resolves to ja with output identical to the explicit setting.

What else the raw output carries

SenseVoice's raw output begins with a set of tags:

<|ja|><|NEUTRAL|><|Speech|><|withitn|>供給量が減る と ある程度は仕方ないんじゃね、転売の価格は論外だけど。

Meaning: <|ja|> = language is Japanese, <|NEUTRAL|> = emotion (neutral), <|Speech|> = audio event (clean speech; it can also flag BGM/laughter/applause), <|withitn|> = inverse text normalization applied. So while transcribing Japanese you also get language, emotion and audio events for free. One call to rich_transcription_postprocess() strips the tags to plain text.

Why SenseVoice is worth a try for Japanese

SenseVoiceWhisper-small
Japanese homophone (転売)✅ correct❌ wrong (天売)
Automatic punctuation✅ built-in❌ none
Emotion / audio events✅ in one pass
Inverse text normalization✅ built-inpartial
Speednon-autoregressive, measured RTF≈0.04 (GPU)autoregressive, slower
Multilingualone model for 50+ languages (ja/ko/zh/yue/en…)multilingual
Licenseopen-source, commercial-friendlyopen-source

If your use case mixes Japanese + Chinese + English (cross-border support, subtitles, meetings), one SenseVoice model covers all of them and bundles emotion analysis. For its emotion/event detection, see SenseVoice emotion & audio event detection; to pick a model, see the FunASR model selection guide; and for Chinese accuracy, the FunASR vs Whisper benchmark.

The whole FunASR stack is open-source — industrial-grade ASR / VAD / punctuation / speaker / emotion & events / LLM-ASR, with Japanese and 50+ languages out of the box. If it helps, a GitHub Star really supports the project 👇

⭐ Star SenseVoice

Also star:FunASR · Fun-ASR · FunClip

Related posts