Speech-to-Text with Word/Character-Level Timestamps in Python — Millisecond-Accurate
Many use cases need more than "what was said" — they need to know when each word was spoken: karaoke-style highlighting, click-a-word-to-seek transcripts, subtitle alignment, video-editing search. Whisper's word timestamps are bolted on (DTW-based, variable accuracy). FunASR's Paraformer emits character-level timestamps natively: every character comes with [start_ms, end_ms] in a single call. Here is real measured output.
Transcribe with timestamps (real output)
pip install funasr from funasr import AutoModel model = AutoModel(model="paraformer-zh", vad_model="fsmn-vad", disable_update=True) res = model.generate(input="audio.wav") text = res[0]["text"] # 欢 迎 大 家 来 体 验 ... timestamp = res[0]["timestamp"] # [[880, 1120], [1120, 1360], ...] per-character start/end ms
timestamp is a list of [start_ms, end_ms] pairs, one per character in the text.
Pair characters with timestamps (real)
chars = text.split() # Paraformer's Chinese output is space-separated characters
for ch, (start, end) in zip(chars, timestamp):
print(f"{ch} {start}-{end}ms")
Actual output:
| Char | Span (ms) |
|---|---|
| 欢 | 880 – 1120 |
| 迎 | 1120 – 1360 |
| 大 | 1380 – 1540 |
| 家 | 1540 – 1780 |
| 来 | 1780 – 2020 |
| 体 | 2020 – 2180 |
| 验 | 2180 – 2420 |
All 19 characters of the sentence get a millisecond-accurate start/end.
What you can build
- Word highlighting / karaoke: highlight the character whose span contains the current playback time.
- Click-to-seek transcripts: clicking a character seeks the audio to its
start_ms. - Precise subtitle alignment: split and time captions from the timestamps. For full subtitles, see generating SRT/VTT subtitles.
- Video-edit search: jump to the exact moment a keyword is spoken.
Why Paraformer for timestamps
| FunASR Paraformer | Whisper | |
|---|---|---|
| Word/char timestamps | ✅ native, single call | bolt-on (--word_timestamps, DTW) |
| Accuracy | non-autoregressive + CIF alignment, stable | implementation-dependent |
| Chinese | character-level, CER 10.18% | ~20% |
| License | open-source, commercial-friendly | open-source |
For higher Chinese accuracy, default to the flagship Fun-ASR-Nano; for the full Chinese walkthrough see Chinese speech recognition; for long-audio segmentation see VAD.
The whole FunASR stack is open-source (MIT) — character timestamps, ASR, VAD, punctuation, speaker, LLM-ASR (flagship Fun-ASR-Nano), ready to use. If it helps, a GitHub Star supports the project 👇
Also star:SenseVoice · Fun-ASR · FunClip
Related posts
- FunASR vs Whisper Benchmark
- SenseVoice Deployment Guide
- Fun-ASR-Nano Guide
- Speaker Diarization: Who Spoke When
- Emotion & Language Detection
- Real-Time Streaming Speech-to-Text
- Transcribe Long Audio (Hours in One Call)
- Transcribe from the Command Line
- Self-Hosted OpenAI Whisper API Alternative
- Auto-Generate Subtitles (SRT / VTT)
- Speech to Text in Python
- FunASR on llama.cpp (whisper.cpp Alternative)
- FunASR vs faster-whisper (Chinese/Cantonese)
- Lightweight Speech Recognition on CPU
- Self-Hosted Deepgram/AssemblyAI Alternative
- Which FunASR Model?
- Cantonese Speech Recognition (SenseVoice)
- Japanese Speech Recognition (SenseVoice)
- Voice Activity Detection in Python
- Self-Hosted Google/AWS/Azure STT Alternative
- Chinese Speech Recognition in Python
- Punctuation Restoration in Python