Speaker Diarization with FunASR: Who Spoke When, in One Call
For meeting minutes, call-center QA or interview transcripts, you don't just want to know what was said — you want to know who said each line. That task is called speaker diarization.
The usual approach glues two tools together: pyannote.audio for speaker segmentation + Whisper for transcription, then you align the timelines yourself. That means requesting HuggingFace gated access, installing a stack of dependencies, and writing your own alignment logic — and it isn't especially strong on Chinese.
With FunASR, a single generate() call is enough. VAD, ASR, punctuation and speaker embedding (CAM++) are chained into one pipeline that returns the speaker id, timestamps and text for every sentence.
Install
pip install -U funasr modelscope
Full code
from funasr import AutoModel
# One model handle = VAD + ASR + punctuation + speaker embedding (CAM++)
model = AutoModel(
model="paraformer-zh", # or "iic/SenseVoiceSmall"
vad_model="fsmn-vad",
punc_model="ct-punc",
spk_model="cam++", # speaker diarization
)
res = model.generate(input="meeting.wav", batch_size_s=300)
for s in res[0]["sentence_info"]:
print(f'[{s["start"]/1000:.1f}s-{s["end"]/1000:.1f}s] '
f'speaker {s["spk"]}: {s["sentence"]}')
What the output looks like (example)
[0.0s-3.2s] speaker 0: Hi everyone, today's meeting is about the Q3 product plan.
[3.5s-7.1s] speaker 1: Quick status update — the core features are about 80% done.
[7.4s-9.9s] speaker 2: QA may need two more weeks on our side.
[10.2s-13.6s] speaker 0: Got it, let's push the launch back by one week.
Each sentence_info item has four fields:
| Field | Meaning |
|---|---|
spk | Speaker id, auto-clustered by CAM++ — no need to specify the number of speakers |
start / end | Sentence start/end time (milliseconds) |
sentence | Sentence text (punctuated) |
Tested on a real recording
On a 227-second multi-speaker meeting recording, FunASR automatically split the audio into 77 sentences and distinguished 11 different speakers — without being told the speaker count in advance. The whole pipeline is one generate() call and a single Python dependency.
Why FunASR instead of pyannote + Whisper
- One call, no alignment: transcription and speaker labels are aligned by construction.
- No HuggingFace gated access: pyannote's speaker models are gated; FunASR models are openly downloadable.
- Stronger on Chinese: Paraformer / SenseVoice have far lower CER than Whisper on Chinese (benchmark).
- Runs on CPU: get results offline without a GPU.
Going further
- Swap
model="paraformer-zh"for"iic/SenseVoiceSmall"to use SenseVoice (with emotion/event tags). - Iterate over
sentence_infoto export SRT subtitles or aggregate per speaker into meeting minutes.
FunASR is Tongyi Lab's open-source, industrial-grade speech recognition toolkit — fast and accurate, especially on Chinese.
Star FunASR on GitHub ★