Real-Time Streaming Speech-to-Text with FunASR
Live captions, voice assistants, meeting transcription — these don't want "record then transcribe", they want text as you speak: words scrolling on screen before the sentence even ends. That is streaming ASR.
FunASR's streaming Paraformer decodes with chunks + a cache, delivering real-time text at ~600 ms latency — and it runs on CPU.
Install
pip install -U funasr modelscope
Full, runnable code
from funasr import AutoModel
import soundfile as sf
chunk_size = [0, 10, 5] # 600 ms chunks (10 * 60 ms center)
encoder_chunk_look_back = 4 # encoder look-back chunks
decoder_chunk_look_back = 1 # decoder look-back chunks
model = AutoModel(model="paraformer-zh-streaming")
audio, sr = sf.read("speech.wav", dtype="float32") # 16 kHz mono
chunk_stride = chunk_size[1] * 960 # 600 ms @ 16 kHz
cache = {}
n_chunks = (len(audio) - 1) // chunk_stride + 1
for i in range(n_chunks):
chunk = audio[i * chunk_stride : (i + 1) * chunk_stride]
is_final = i == n_chunks - 1
res = model.generate(
input=chunk, cache=cache, is_final=is_final,
chunk_size=chunk_size,
encoder_chunk_look_back=encoder_chunk_look_back,
decoder_chunk_look_back=decoder_chunk_look_back,
)
if res[0]["text"]:
print(res[0]["text"], end="", flush=True) # emit partial text
The output grows incrementally (example)
Today # ~600 ms after speech starts
Today the
Today the weather
Today the weather is great # is_final=True
For each 600 ms audio chunk, the model emits the newly recognized text, so the on-screen sentence grows token by token until is_final=True. In testing, a 12 s clip was split into 20 chunks, each returning an incremental result.
Key parameters
| Parameter | Meaning |
|---|---|
chunk_size=[0,10,5] | [left, center, right] chunk sizes in 60 ms units. center 10 → 600 ms per chunk (the latency/accuracy sweet spot) |
encoder_chunk_look_back=4 | History chunks the encoder attends to (more = more context, slightly more latency) |
decoder_chunk_look_back=1 | History chunks for the decoder |
cache={} | State carried across chunks — reuse the same dict across calls |
is_final | Set True on the last chunk to flush the final decoding |
Best practice: 2-pass (streaming + offline)
To keep latency low, the streaming model is slightly less accurate than offline. Production systems often use a 2-pass design:
- Pass 1 (streaming): streaming Paraformer for instant on-screen feedback;
- Pass 2 (offline): once a sentence ends (VAD endpointing), re-transcribe the full segment with SenseVoice / offline Paraformer for a more accurate final result that replaces the provisional text.
FunASR's official WebSocket service (funasr-runtime-sdk-online-cpu) ships this 2-pass streaming protocol ready to deploy.
Typical use cases
- Live captions: instant subtitles for streams, meetings, classrooms.
- Voice assistants: understand while listening, lowering end-to-end latency.
- Call-center QA: live transcription + real-time keyword alerts.
FunASR is Tongyi Lab's open-source, industrial-grade speech recognition toolkit — offline + streaming, fast and accurate on Chinese.
Star FunASR on GitHub ★