Speaker Diarization with FunASR: Who Spoke When, in One Call

FunASR Blog · 2026-06-17 · Tutorial

For meeting minutes, call-center QA or interview transcripts, you don't just want to know what was said — you want to know who said each line. That task is called speaker diarization.

The usual approach glues two tools together: pyannote.audio for speaker segmentation + Whisper for transcription, then you align the timelines yourself. That means requesting HuggingFace gated access, installing a stack of dependencies, and writing your own alignment logic — and it isn't especially strong on Chinese.

With FunASR, a single generate() call is enough. VAD, ASR, punctuation and speaker embedding (CAM++) are chained into one pipeline that returns the speaker id, timestamps and text for every sentence.

Install

pip install -U funasr modelscope

Full code

from funasr import AutoModel

# One model handle = VAD + ASR + punctuation + speaker embedding (CAM++)
model = AutoModel(
    model="paraformer-zh",   # or "iic/SenseVoiceSmall"
    vad_model="fsmn-vad",
    punc_model="ct-punc",
    spk_model="cam++",       # speaker diarization
)

res = model.generate(input="meeting.wav", batch_size_s=300)

for s in res[0]["sentence_info"]:
    print(f'[{s["start"]/1000:.1f}s-{s["end"]/1000:.1f}s] '
          f'speaker {s["spk"]}: {s["sentence"]}')

What the output looks like (example)

[0.0s-3.2s] speaker 0: Hi everyone, today's meeting is about the Q3 product plan.
[3.5s-7.1s] speaker 1: Quick status update — the core features are about 80% done.
[7.4s-9.9s] speaker 2: QA may need two more weeks on our side.
[10.2s-13.6s] speaker 0: Got it, let's push the launch back by one week.

Each sentence_info item has four fields:

Field	Meaning
`spk`	Speaker id, auto-clustered by CAM++ — no need to specify the number of speakers
`start` / `end`	Sentence start/end time (milliseconds)
`sentence`	Sentence text (punctuated)

Tested on a real recording

On a 227-second multi-speaker meeting recording, FunASR automatically split the audio into 77 sentences and distinguished 11 different speakers — without being told the speaker count in advance. The whole pipeline is one generate() call and a single Python dependency.

Why FunASR instead of pyannote + Whisper

One call, no alignment: transcription and speaker labels are aligned by construction.
No HuggingFace gated access: pyannote's speaker models are gated; FunASR models are openly downloadable.
Stronger on Chinese: Paraformer / SenseVoice have far lower CER than Whisper on Chinese (benchmark).
Runs on CPU: get results offline without a GPU.

Going further

Swap model="paraformer-zh" for "iic/SenseVoiceSmall" to use SenseVoice (with emotion/event tags).
Iterate over sentence_info to export SRT subtitles or aggregate per speaker into meeting minutes.

FunASR is Tongyi Lab's open-source, industrial-grade speech recognition toolkit — fast and accurate, especially on Chinese.

Star FunASR on GitHub ★