Real-Time Streaming Speech-to-Text with FunASR

Live captions, voice assistants, meeting transcription — these don't want "record then transcribe", they want text as you speak: words scrolling on screen before the sentence even ends. That is streaming ASR.

FunASR's streaming Paraformer decodes with chunks + a cache, delivering real-time text at ~600 ms latency — and it runs on CPU.

Install

pip install -U funasr modelscope

Full, runnable code

from funasr import AutoModel
import soundfile as sf

chunk_size = [0, 10, 5]          # 600 ms chunks (10 * 60 ms center)
encoder_chunk_look_back = 4      # encoder look-back chunks
decoder_chunk_look_back = 1      # decoder look-back chunks

model = AutoModel(model="paraformer-zh-streaming")

audio, sr = sf.read("speech.wav", dtype="float32")   # 16 kHz mono
chunk_stride = chunk_size[1] * 960                    # 600 ms @ 16 kHz

cache = {}
n_chunks = (len(audio) - 1) // chunk_stride + 1
for i in range(n_chunks):
    chunk = audio[i * chunk_stride : (i + 1) * chunk_stride]
    is_final = i == n_chunks - 1
    res = model.generate(
        input=chunk, cache=cache, is_final=is_final,
        chunk_size=chunk_size,
        encoder_chunk_look_back=encoder_chunk_look_back,
        decoder_chunk_look_back=decoder_chunk_look_back,
    )
    if res[0]["text"]:
        print(res[0]["text"], end="", flush=True)   # emit partial text

The output grows incrementally (example)

Today          # ~600 ms after speech starts
Today the
Today the weather
Today the weather is great    # is_final=True

For each 600 ms audio chunk, the model emits the newly recognized text, so the on-screen sentence grows token by token until is_final=True. In testing, a 12 s clip was split into 20 chunks, each returning an incremental result.

Key parameters

ParameterMeaning
chunk_size=[0,10,5][left, center, right] chunk sizes in 60 ms units. center 10 → 600 ms per chunk (the latency/accuracy sweet spot)
encoder_chunk_look_back=4History chunks the encoder attends to (more = more context, slightly more latency)
decoder_chunk_look_back=1History chunks for the decoder
cache={}State carried across chunks — reuse the same dict across calls
is_finalSet True on the last chunk to flush the final decoding

Best practice: 2-pass (streaming + offline)

To keep latency low, the streaming model is slightly less accurate than offline. Production systems often use a 2-pass design:

FunASR's official WebSocket service (funasr-runtime-sdk-online-cpu) ships this 2-pass streaming protocol ready to deploy.

Typical use cases

FunASR is Tongyi Lab's open-source, industrial-grade speech recognition toolkit — offline + streaming, fast and accurate on Chinese.

Star FunASR on GitHub ★

Related posts