Japanese Speech Recognition in Python — SenseVoice Does Transcription + Punctuation + Emotion in One Pass (vs Whisper)

2026-06-22 · FunASR Team

For Japanese speech-to-text, most people reach for Whisper. But if you want something that runs on CPU, is fast and accurate, and ships punctuation and emotion out of the box, SenseVoice (an open-source multilingual speech-understanding model from the FunAudioLLM team) is an underrated option: it natively supports Japanese (ja), auto-detects the language, and produces punctuated text in a single non-autoregressive pass — plus emotion and audio-event tags for free.

Below is a real side-by-side on the same Japanese clip between SenseVoice and the same-tier Whisper-small (both run on our server, nothing cherry-picked).

Real comparison: the same Japanese audio

The test clip is a piece of colloquial Japanese speech. Actual outputs from both models:

Model	Detected language	Output
SenseVoice	auto → `ja`	供給量が減るとある程度は仕方ないんじゃね、転売の価格は論外だけど。
Whisper-small	`ja`	供給量が減るとある程度は仕方ないんじゃね天売の価格は論外だけど

The key difference is the word 転売:

SenseVoice writes 転売 correctly (tenbai, "resale / scalping") — which is exactly what the sentence is about (talking about resale prices).
Whisper-small produces the homophone 天売 — same reading (tenbai), but not a real word: a classic Japanese homophone/wrong-kanji error.
SenseVoice also adds punctuation (、 。) automatically; the Whisper-small output has none.

To be fair about the setup: this compares same-tier Whisper-small (a common CPU-friendly choice). A larger Whisper-large would usually fix this kind of homophone, but it is far bigger and slower; SenseVoice gets the word right at a small-model footprint, and throws in punctuation and emotion on top.

Japanese ASR in three lines

pip install funasr

from funasr import AutoModel
from funasr.utils.postprocess_utils import rich_transcription_postprocess

model = AutoModel(model="iic/SenseVoiceSmall", disable_update=True)
res = model.generate(input="japanese.wav", language="auto", use_itn=True)

print(rich_transcription_postprocess(res[0]["text"]))
# 供給量が減る と ある程度は仕方ないんじゃね、転売の価格は論外だけど。

Set language to "ja" explicitly, or use "auto" as above for automatic language ID — on this clip auto correctly resolves to ja with output identical to the explicit setting.

What else the raw output carries

SenseVoice's raw output begins with a set of tags:

<|ja|><|NEUTRAL|><|Speech|><|withitn|>供給量が減る と ある程度は仕方ないんじゃね、転売の価格は論外だけど。

Meaning: <|ja|> = language is Japanese, <|NEUTRAL|> = emotion (neutral), <|Speech|> = audio event (clean speech; it can also flag BGM/laughter/applause), <|withitn|> = inverse text normalization applied. So while transcribing Japanese you also get language, emotion and audio events for free. One call to rich_transcription_postprocess() strips the tags to plain text.

Why SenseVoice is worth a try for Japanese

	SenseVoice	Whisper-small
Japanese homophone (転売)	✅ correct	❌ wrong (天売)
Automatic punctuation	✅ built-in	❌ none
Emotion / audio events	✅ in one pass	❌
Inverse text normalization	✅ built-in	partial
Speed	non-autoregressive, measured RTF≈0.04 (GPU)	autoregressive, slower
Multilingual	one model for 50+ languages (ja/ko/zh/yue/en…)	multilingual
License	open-source, commercial-friendly	open-source

If your use case mixes Japanese + Chinese + English (cross-border support, subtitles, meetings), one SenseVoice model covers all of them and bundles emotion analysis. For its emotion/event detection, see SenseVoice emotion & audio event detection; to pick a model, see the FunASR model selection guide; and for Chinese accuracy, the FunASR vs Whisper benchmark.

The whole FunASR stack is open-source — industrial-grade ASR / VAD / punctuation / speaker / emotion & events / LLM-ASR, with Japanese and 50+ languages out of the box. If it helps, a GitHub Star really supports the project 👇

⭐ Star SenseVoice

Also star:FunASR · Fun-ASR · FunClip

Japanese Speech Recognition in Python — SenseVoice Does Transcription + Punctuation + Emotion in One Pass (vs Whisper)

Real comparison: the same Japanese audio

Japanese ASR in three lines

What else the raw output carries

Why SenseVoice is worth a try for Japanese

Related posts