Real-Time Streaming Speech-to-Text with FunASR

FunASR Blog · 2026-06-17 · Tutorial

Live captions, voice assistants, meeting transcription — these don't want "record then transcribe", they want text as you speak: words scrolling on screen before the sentence even ends. That is streaming ASR.

FunASR's streaming Paraformer decodes with chunks + a cache, delivering real-time text at ~600 ms latency — and it runs on CPU.

Install

pip install -U funasr modelscope

Full, runnable code

from funasr import AutoModel
import soundfile as sf

chunk_size = [0, 10, 5]          # 600 ms chunks (10 * 60 ms center)
encoder_chunk_look_back = 4      # encoder look-back chunks
decoder_chunk_look_back = 1      # decoder look-back chunks

model = AutoModel(model="paraformer-zh-streaming")

audio, sr = sf.read("speech.wav", dtype="float32")   # 16 kHz mono
chunk_stride = chunk_size[1] * 960                    # 600 ms @ 16 kHz

cache = {}
n_chunks = (len(audio) - 1) // chunk_stride + 1
for i in range(n_chunks):
    chunk = audio[i * chunk_stride : (i + 1) * chunk_stride]
    is_final = i == n_chunks - 1
    res = model.generate(
        input=chunk, cache=cache, is_final=is_final,
        chunk_size=chunk_size,
        encoder_chunk_look_back=encoder_chunk_look_back,
        decoder_chunk_look_back=decoder_chunk_look_back,
    )
    if res[0]["text"]:
        print(res[0]["text"], end="", flush=True)   # emit partial text

The output grows incrementally (example)

Today          # ~600 ms after speech starts
Today the
Today the weather
Today the weather is great    # is_final=True

For each 600 ms audio chunk, the model emits the newly recognized text, so the on-screen sentence grows token by token until is_final=True. In testing, a 12 s clip was split into 20 chunks, each returning an incremental result.

Key parameters

Parameter	Meaning
`chunk_size=[0,10,5]`	[left, center, right] chunk sizes in 60 ms units. center 10 → 600 ms per chunk (the latency/accuracy sweet spot)
`encoder_chunk_look_back=4`	History chunks the encoder attends to (more = more context, slightly more latency)
`decoder_chunk_look_back=1`	History chunks for the decoder
`cache={}`	State carried across chunks — reuse the same dict across calls
`is_final`	Set `True` on the last chunk to flush the final decoding

Best practice: 2-pass (streaming + offline)

To keep latency low, the streaming model is slightly less accurate than offline. Production systems often use a 2-pass design:

Pass 1 (streaming): streaming Paraformer for instant on-screen feedback;
Pass 2 (offline): once a sentence ends (VAD endpointing), re-transcribe the full segment with SenseVoice / offline Paraformer for a more accurate final result that replaces the provisional text.

FunASR's official WebSocket service (funasr-runtime-sdk-online-cpu) ships this 2-pass streaming protocol ready to deploy.

Typical use cases

Live captions: instant subtitles for streams, meetings, classrooms.
Voice assistants: understand while listening, lowering end-to-end latency.
Call-center QA: live transcription + real-time keyword alerts.

FunASR is Tongyi Lab's open-source, industrial-grade speech recognition toolkit — offline + streaming, fast and accurate on Chinese.

Star FunASR on GitHub ★