Japanese Speech Recognition in Python — SenseVoice Does Transcription + Punctuation + Emotion in One Pass (vs Whisper)
For Japanese speech-to-text, most people reach for Whisper. But if you want something that runs on CPU, is fast and accurate, and ships punctuation and emotion out of the box, SenseVoice (an open-source multilingual speech-understanding model from the FunAudioLLM team) is an underrated option: it natively supports Japanese (ja), auto-detects the language, and produces punctuated text in a single non-autoregressive pass — plus emotion and audio-event tags for free.
Below is a real side-by-side on the same Japanese clip between SenseVoice and the same-tier Whisper-small (both run on our server, nothing cherry-picked).
Real comparison: the same Japanese audio
The test clip is a piece of colloquial Japanese speech. Actual outputs from both models:
| Model | Detected language | Output |
|---|---|---|
| SenseVoice | auto → ja | 供給量が減る と ある程度は仕方ないんじゃね、転売の価格は論外だけど。 |
| Whisper-small | ja | 供給量が減るとある程度は仕方ないんじゃね天売の価格は論外だけど |
The key difference is the word 転売:
- SenseVoice writes
転売correctly (tenbai, "resale / scalping") — which is exactly what the sentence is about (talking about resale prices). - Whisper-small produces the homophone
天売— same reading (tenbai), but not a real word: a classic Japanese homophone/wrong-kanji error. - SenseVoice also adds punctuation (
、。) automatically; the Whisper-small output has none.
To be fair about the setup: this compares same-tier Whisper-small (a common CPU-friendly choice). A larger Whisper-large would usually fix this kind of homophone, but it is far bigger and slower; SenseVoice gets the word right at a small-model footprint, and throws in punctuation and emotion on top.
Japanese ASR in three lines
pip install funasr from funasr import AutoModel from funasr.utils.postprocess_utils import rich_transcription_postprocess model = AutoModel(model="iic/SenseVoiceSmall", disable_update=True) res = model.generate(input="japanese.wav", language="auto", use_itn=True) print(rich_transcription_postprocess(res[0]["text"])) # 供給量が減る と ある程度は仕方ないんじゃね、転売の価格は論外だけど。
Set language to "ja" explicitly, or use "auto" as above for automatic language ID — on this clip auto correctly resolves to ja with output identical to the explicit setting.
What else the raw output carries
SenseVoice's raw output begins with a set of tags:
<|ja|><|NEUTRAL|><|Speech|><|withitn|>供給量が減る と ある程度は仕方ないんじゃね、転売の価格は論外だけど。
Meaning: <|ja|> = language is Japanese, <|NEUTRAL|> = emotion (neutral), <|Speech|> = audio event (clean speech; it can also flag BGM/laughter/applause), <|withitn|> = inverse text normalization applied. So while transcribing Japanese you also get language, emotion and audio events for free. One call to rich_transcription_postprocess() strips the tags to plain text.
Why SenseVoice is worth a try for Japanese
| SenseVoice | Whisper-small | |
|---|---|---|
| Japanese homophone (転売) | ✅ correct | ❌ wrong (天売) |
| Automatic punctuation | ✅ built-in | ❌ none |
| Emotion / audio events | ✅ in one pass | ❌ |
| Inverse text normalization | ✅ built-in | partial |
| Speed | non-autoregressive, measured RTF≈0.04 (GPU) | autoregressive, slower |
| Multilingual | one model for 50+ languages (ja/ko/zh/yue/en…) | multilingual |
| License | open-source, commercial-friendly | open-source |
If your use case mixes Japanese + Chinese + English (cross-border support, subtitles, meetings), one SenseVoice model covers all of them and bundles emotion analysis. For its emotion/event detection, see SenseVoice emotion & audio event detection; to pick a model, see the FunASR model selection guide; and for Chinese accuracy, the FunASR vs Whisper benchmark.
The whole FunASR stack is open-source — industrial-grade ASR / VAD / punctuation / speaker / emotion & events / LLM-ASR, with Japanese and 50+ languages out of the box. If it helps, a GitHub Star really supports the project 👇
Related posts
- FunASR vs Whisper Benchmark
- SenseVoice Deployment Guide
- Fun-ASR-Nano Guide
- Speaker Diarization: Who Spoke When
- Emotion & Language Detection
- Real-Time Streaming Speech-to-Text
- Transcribe Long Audio (Hours in One Call)
- Transcribe from the Command Line
- Self-Hosted OpenAI Whisper API Alternative
- Auto-Generate Subtitles (SRT / VTT)
- Speech to Text in Python
- FunASR on llama.cpp (whisper.cpp Alternative)
- FunASR vs faster-whisper (Chinese/Cantonese)
- Lightweight Speech Recognition on CPU
- Self-Hosted Deepgram/AssemblyAI Alternative
- Which FunASR Model?
- Cantonese Speech Recognition (SenseVoice)
- Voice Activity Detection in Python