Speech-to-Text with Word/Character-Level Timestamps in Python — Millisecond-Accurate

2026-06-23 · FunASR Team

Many use cases need more than "what was said" — they need to know when each word was spoken: karaoke-style highlighting, click-a-word-to-seek transcripts, subtitle alignment, video-editing search. Whisper's word timestamps are bolted on (DTW-based, variable accuracy). FunASR's Paraformer emits character-level timestamps natively: every character comes with [start_ms, end_ms] in a single call. Here is real measured output.

Transcribe with timestamps (real output)

pip install funasr

from funasr import AutoModel

model = AutoModel(model="paraformer-zh", vad_model="fsmn-vad", disable_update=True)
res = model.generate(input="audio.wav")

text = res[0]["text"]            # 欢 迎 大 家 来 体 验 ...
timestamp = res[0]["timestamp"]  # [[880, 1120], [1120, 1360], ...]  per-character start/end ms

timestamp is a list of [start_ms, end_ms] pairs, one per character in the text.

Pair characters with timestamps (real)

chars = text.split()          # Paraformer's Chinese output is space-separated characters
for ch, (start, end) in zip(chars, timestamp):
    print(f"{ch}  {start}-{end}ms")

Actual output:

Char	Span (ms)
欢	880 – 1120
迎	1120 – 1360
大	1380 – 1540
家	1540 – 1780
来	1780 – 2020
体	2020 – 2180
验	2180 – 2420

All 19 characters of the sentence get a millisecond-accurate start/end.

What you can build

Word highlighting / karaoke: highlight the character whose span contains the current playback time.
Click-to-seek transcripts: clicking a character seeks the audio to its start_ms.
Precise subtitle alignment: split and time captions from the timestamps. For full subtitles, see generating SRT/VTT subtitles.
Video-edit search: jump to the exact moment a keyword is spoken.

Why Paraformer for timestamps

	FunASR Paraformer	Whisper
Word/char timestamps	✅ native, single call	bolt-on (`--word_timestamps`, DTW)
Accuracy	non-autoregressive + CIF alignment, stable	implementation-dependent
Chinese	character-level, CER 10.18%	~20%
License	open-source, commercial-friendly	open-source

For higher Chinese accuracy, default to the flagship Fun-ASR-Nano; for the full Chinese walkthrough see Chinese speech recognition; for long-audio segmentation see VAD.

The whole FunASR stack is open-source (MIT) — character timestamps, ASR, VAD, punctuation, speaker, LLM-ASR (flagship Fun-ASR-Nano), ready to use. If it helps, a GitHub Star supports the project 👇

⭐ Star FunASR

Also star:SenseVoice · Fun-ASR · FunClip

Speech-to-Text with Word/Character-Level Timestamps in Python — Millisecond-Accurate

Transcribe with timestamps (real output)

Pair characters with timestamps (real)

What you can build

Why Paraformer for timestamps

Related posts