Speaker Diarization with FunASR: Who Spoke When, in One Call

For meeting minutes, call-center QA or interview transcripts, you don't just want to know what was said — you want to know who said each line. That task is called speaker diarization.

The usual approach glues two tools together: pyannote.audio for speaker segmentation + Whisper for transcription, then you align the timelines yourself. That means requesting HuggingFace gated access, installing a stack of dependencies, and writing your own alignment logic — and it isn't especially strong on Chinese.

With FunASR, a single generate() call is enough. VAD, ASR, punctuation and speaker embedding (CAM++) are chained into one pipeline that returns the speaker id, timestamps and text for every sentence.

Install

pip install -U funasr modelscope

Full code

from funasr import AutoModel

# One model handle = VAD + ASR + punctuation + speaker embedding (CAM++)
model = AutoModel(
    model="paraformer-zh",   # or "iic/SenseVoiceSmall"
    vad_model="fsmn-vad",
    punc_model="ct-punc",
    spk_model="cam++",       # speaker diarization
)

res = model.generate(input="meeting.wav", batch_size_s=300)

for s in res[0]["sentence_info"]:
    print(f'[{s["start"]/1000:.1f}s-{s["end"]/1000:.1f}s] '
          f'speaker {s["spk"]}: {s["sentence"]}')

What the output looks like (example)

[0.0s-3.2s] speaker 0: Hi everyone, today's meeting is about the Q3 product plan.
[3.5s-7.1s] speaker 1: Quick status update — the core features are about 80% done.
[7.4s-9.9s] speaker 2: QA may need two more weeks on our side.
[10.2s-13.6s] speaker 0: Got it, let's push the launch back by one week.

Each sentence_info item has four fields:

FieldMeaning
spkSpeaker id, auto-clustered by CAM++ — no need to specify the number of speakers
start / endSentence start/end time (milliseconds)
sentenceSentence text (punctuated)

Tested on a real recording

On a 227-second multi-speaker meeting recording, FunASR automatically split the audio into 77 sentences and distinguished 11 different speakerswithout being told the speaker count in advance. The whole pipeline is one generate() call and a single Python dependency.

Why FunASR instead of pyannote + Whisper

Going further

FunASR is Tongyi Lab's open-source, industrial-grade speech recognition toolkit — fast and accurate, especially on Chinese.

Star FunASR on GitHub ★

Related posts