Updated June 24, 2026

pyannote.audio — The Open-Source Speaker Diarization Toolkit, Honestly

By VexaScribe Editorial · Published June 24, 2026

pyannote.audio is the de facto open-source library for speaker diarization in 2026 — PyTorch-based, MIT-licensed, maintained by Hervé Bredin and contributors, hosted on Hugging Face. The current model (pyannote/speaker-diarization-3.1) is the standard for self-hosted diarization and pairs with Whisper for the dominant Whisper + pyannote transcription stack. This page covers what pyannote is, how to install and run it, how accurate it is on standard benchmarks, the operational reality of self-hosting, and when to stay with it vs switch to a hosted service. Disclosure up front: VexaScribe runs Whisper + pyannote in production. The operational tradeoffs described here are ones we work with daily, not theoretical observations from outside the stack.

Key takeaways

  • pyannote.audio is the open-source standard. MIT-licensed code, PyTorch-based, hosted on Hugging Face. Maintained by Hervé Bredin (École Polytechnique). The default choice for self-hosted speaker diarization.
  • Current model: speaker-diarization-3.1. Powerset multi-class architecture (Bredin, ICASSP 2023). ~12-14% DER on AMI, ~9-11% on VoxConverse, ~17-19% on DIHARD III. State-of-the-art for open-source in 2026.
  • Whisper + pyannote is the standard stack. Whisper for text, pyannote for “who spoke when.” WhisperX wraps both for the easiest integration. faster-whisper + pyannote for production.
  • Hugging Face access is gated. You must accept the user agreement for both speaker-diarization-3.1 and segmentation-3.0 separately. Common gotcha: missing the segmentation agreement causes 401 errors.
  • GPU strongly recommended for production. ~5-10x real-time on T4 (~$0.04-$0.07/hour of audio). CPU is roughly real-time, fine for dev but not for batch workloads.
  • When to stay with pyannote: data sovereignty, research, custom training, you have GPU infrastructure. When to switch to hosted: you don't want to operate the model, you need an SLA, you want a single API for transcription + diarization, you're billing customers per call.
  • VexaScribe runs this stack in production. We use Whisper Large-v3 + pyannote 3.1. This guide is written from operating the stack daily, not from the outside. If you decide self-hosting isn't worth it, we're a hosted version of the same thing.

What is pyannote.audio?

pyannote.audio is an open-source PyTorch toolkit for speaker diarization — the task of figuring out “who spoke when” in an audio file. Given a multi-speaker recording, pyannote produces a list of speaker turns with start and end timestamps and a speaker label for each turn. It does not transcribe (no text output) and does not identify named speakers (output is “Speaker A, Speaker B” not “Alice, Bob”). Both of those layers come from downstream processing — usually Whisper for transcription and a manual or AI rename step for named labels.

Project basics

  • Maintainer: Hervé Bredin (originally LIMSI/CNRS, now at École Polytechnique) and a community of contributors
  • License: MIT for the code; models on Hugging Face are gated behind a user agreement (free to accept for research and commercial use as of 2026)
  • Hosting: github.com/pyannote/pyannote-audio for the code; huggingface.co/pyannote for the models
  • Framework: PyTorch
  • Underlying tasks: voice activity detection, speaker segmentation, speaker embedding, clustering — all packaged as a single Pipeline abstraction

Version timeline

VersionYearStatusNote
1.x2019-2021LegacyOriginal release; deep but verbose pipeline configuration
2.02022LegacyMajor refactor; introduced Pipeline abstraction; widely adopted
2.12023LegacyAccuracy improvements; ungated models
3.02024StablePowerset multi-class architecture (Bredin, ICASSP 2023); gated models on HF
3.12024-currentCurrentImproved short-segment handling; better noisy audio behavior

If you're starting a new project in 2026, use 3.1. The 2.x models still work for legacy projects but are no longer actively maintained, and the Powerset architecture in 3.x meaningfully improves overlap handling.

The current model — pyannote/speaker-diarization-3.1

The 3.x line introduced the Powerset multi-class formulation described in Bredin (ICASSP 2023), “Powerset multi-class cross entropy loss for neural speaker diarization.” The core insight: instead of treating diarization as a sequence of binary speaker-vs-not-speaker decisions plus a clustering step, the model directly predicts which subset of speakers is active in each frame. This handles overlapping speech natively (a frame can be labeled “Speaker A + Speaker B both active” rather than forced into one or the other) and removes the heuristics that previous segmentation + clustering pipelines required.

What changed from 3.0 to 3.1

  • ● Improved handling of short speaker segments (under 1 second)
  • ● Better behavior on noisy audio without retuning
  • ● Same Powerset architecture; incremental rather than fundamental changes

Internal pipeline (high-level)

  1. Audio is segmented into short windows (~5 seconds)
  2. Each window is encoded into a speaker representation
  3. The Powerset head predicts which subset of speakers is active per frame
  4. Speaker embeddings are clustered to assign consistent labels across windows
  5. Output: list of (start, end, speaker_label) tuples covering the full audio

The 3.1 model depends on pyannote/segmentation-3.0 for the windowing step — both models are gated separately on Hugging Face and both agreements must be accepted.

Install and run (5 minutes)

Minimum working example. Assumes Python 3.9+, a Hugging Face account, and either CPU or GPU.

Step 1: Install

pip install pyannote.audio

Step 2: Accept Hugging Face agreements

Visit both model pages in the browser, scroll to the agreement section, and accept:

Then generate an access token at huggingface.co/settings/tokens (read scope is sufficient).

Step 3: Run a 5-line example

from pyannote.audio import Pipeline

pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1",
    use_auth_token="hf_YOUR_TOKEN_HERE",
)

# Optional: move to GPU for production speeds
# import torch
# pipeline.to(torch.device("cuda"))

diarization = pipeline("audio.wav")

for turn, _, speaker in diarization.itertracks(yield_label=True):
    print(f"{turn.start:.1f}s - {turn.end:.1f}s: {speaker}")

Expected output:

0.0s - 4.2s: SPEAKER_00
4.5s - 9.1s: SPEAKER_01
9.3s - 12.8s: SPEAKER_00
...

Common gotchas

  • 401 Unauthorized: Almost always means you accepted the speaker-diarization-3.1 agreement but not segmentation-3.0. Accept both.
  • Slow on CPU: ~real-time processing speed. For batch workloads, move to GPU with pipeline.to(torch.device("cuda")).
  • Long model load time: 5-15 seconds on first call. Cache the pipeline object; don't reload per request.
  • Audio format issues: pyannote expects 16kHz mono WAV ideally; other formats work but converting upfront avoids edge cases. Use torchaudio or ffmpeg to normalize.

Whisper + pyannote — the standard self-hosted pipeline

pyannote on its own gives you speaker turns but no text. Whisper on its own gives you text but no speaker labels. Together they produce the standard self-hosted transcription + diarization output: text with speaker attribution per segment. This combination is the default for self-hosted production pipelines in 2026.

The pattern

  1. Run Whisper (or faster-whisper) with word-level timestamps
  2. Run pyannote separately to get speaker turns
  3. Align: for each transcribed segment, look up which speaker was active during that time range
  4. Output: a list of (start, end, text, speaker) tuples

WhisperX — the easiest integration

WhisperX wraps Whisper + pyannote + word-level alignment in a single library. Minimum working example:

# pip install whisperx
import whisperx

device = "cuda"
audio_file = "meeting.wav"
batch_size = 16

# 1. Transcribe with Whisper
model = whisperx.load_model("large-v3", device, compute_type="float16")
audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)

# 2. Align for word-level timestamps
align_model, metadata = whisperx.load_align_model(
    language_code=result["language"], device=device
)
result = whisperx.align(
    result["segments"], align_model, metadata, audio, device,
)

# 3. Diarize with pyannote
diarize_model = whisperx.DiarizationPipeline(
    use_auth_token="hf_YOUR_TOKEN", device=device,
)
diarize_segments = diarize_model(audio)
result = whisperx.assign_word_speakers(diarize_segments, result)

# Output: result["segments"] with speaker labels attached
for segment in result["segments"]:
    print(f"[{segment['speaker']}] {segment['text']}")

For production (not WhisperX): most production pipelines use faster-whisper (CTranslate2 backend, 2-4x faster than reference Whisper) plus pyannote called directly, with custom alignment logic tailored to the use case. WhisperX is the right starting point but most teams diverge as they scale.

Related: Whisper transcription, speaker labeling concepts, word-level timestamps reference.

Accuracy in 2026 (honest)

Diarization accuracy is measured by DER (Diarization Error Rate): the percentage of time that diarization output disagrees with ground truth, summing missed speech, false alarms, and speaker confusion. Lower is better. The numbers below are from the published pyannote/speaker-diarization-3.1 model card; verify against the current card before relying on them in production planning.

DatasetDescriptionDERNote
AMIMeeting recordings, 4 speakers, English~12-14%Standard benchmark for meeting diarization
VoxConverseYouTube conversations, varied speakers and conditions~9-11%Closer to real-world audio quality variety
DIHARD IIIHard cases: noisy, overlapping, multilingual~17-19%Stress test for diarization in adversarial conditions

Real-world vs benchmark. Benchmarks are useful for comparing models but don't map directly to what you'll see in production. Practical accuracy on real audio:

  • Clean two-speaker studio audio (podcast interview, recorded sales call): 95%+ correct speaker attribution. Boundaries off by 100-300ms at speaker transitions but the labels are right.
  • Meeting-room audio, 3-4 speakers, shared microphone: 75-85% correct attribution. Drops further if heavy overlap or speakers stand at different distances from the mic.
  • Telephony audio (8kHz, narrow band): 70-80% — Whisper and pyannote both prefer 16kHz; resampled 8kHz audio loses information that diarization relies on.
  • Heavy overlap (debates, family arguments, talkative meetings): Drops significantly. Powerset architecture helps but doesn't solve the problem entirely.

Comparison context. Commercial services (Deepgram, AssemblyAI) report 1-3 DER points better on their internal benchmarks, typically through specialized data augmentation and proprietary post-processing. Whether that's worth $0.40-$0.60/hour vs free pyannote depends on your scale and operational tolerance. See our best speaker diarization tools comparison for broader context.

When to use pyannote vs a hosted service

The decision usually comes down to operational tolerance, not raw accuracy. Both pyannote and hosted services are good enough for production; the question is what you want to spend engineering time on.

Use pyannote when

  • Data sovereignty matters. Audio cannot leave your infrastructure (regulated industry, government, legal-sensitive content).
  • Research or academic use. You need to reproduce results, customize the model, or publish.
  • Custom training or fine-tuning. Hosted services don't let you train; pyannote does.
  • You have GPU infrastructure. Existing GPU fleet → marginal cost is near zero.
  • Free is required. Bootstrapped projects, academic research, hobby work.
  • High volume that amortizes ops cost. If you're processing 1000+ hours/month, self-hosted may pay for itself.

Use hosted when

  • You don't want to operate the model. The biggest one. Hosted services handle versions, GPUs, scaling, monitoring.
  • You need an SLA. Pyannote has no SLA; commercial services do.
  • Single API for transcription + diarization. Deepgram, AssemblyAI, and similar services bundle both in one HTTP call.
  • Predictable per-call cost. Self-hosted cost is “GPU + engineer time” which is harder to attribute to per-customer billing.
  • Small scale. If you're processing 100 hours/month or less, hosted is cheaper than running a GPU 24/7 for that volume.
  • Need LLM integrations. AssemblyAI's LeMUR and Deepgram's formatted output features add value beyond raw diarization.

A realistic decision rule: if processing volume is <500 hours/month, hosted almost always wins on total cost (engineering time + ops + accuracy). At 500-5000 hours/month, the comparison gets close. Above 5000 hours/month, self-hosted typically wins on cost if you have engineering capacity to maintain it.

Operational reality check

The hidden cost of self-hosting that devs underestimate before they've done it. None of these are blockers but they're real:

GPU costs and provisioning

AWS T4 on-demand: ~$0.35/hour. Spot: ~$0.10/hour. RTX 4090 in a colocated server: ~$1,500 hardware + power. For batch workloads, GPU utilization tends to be bursty — you want to amortize the cost across many jobs, which means a queue and a scheduler. Serverless GPU options (Modal, Replicate, RunPod serverless) help if your traffic is bursty enough that paying for idle GPU isn't worth it.

Model load time (matters for serverless)

First call after process start: 5-15 seconds to load the model into GPU memory. Subsequent calls: instant. For serverless deployments where cold starts are common, this is meaningful. Strategies: keep a warm instance, use serverless platforms with model-loading optimizations (Modal's built-in support for HF models is best-in-class), or batch requests to amortize the load cost.

HF gated-access version management

Tokens expire, license terms get updated, model versions change. In a production environment with multiple deployments, you need a process for rotating tokens, accepting updated agreements, and pinning model versions to avoid surprises. Most teams pin to a specific commit hash on Hugging Face rather than the rolling tag.

Version churn between major releases

Migration from pyannote 2.x to 3.x required code changes (Pipeline API differences, new HF auth flow, different model identifiers). A similar break is possible for 3.x → 4.x whenever it ships. Plan for periodic migration work; a self-hosted production pipeline typically needs ~2-4 days of engineering attention per year just to stay current.

Memory and throughput tuning

Batch size, sample rate, and audio chunking all affect throughput. The defaults work but are conservative. Production tuning typically improves throughput 2-3x via larger batches, lower precision (fp16), and concurrent request handling. This is engineering work that hosted services have already done for you.

Comparison vs alternatives

Honest snapshot of the diarization landscape as of June 2026. Verify benchmark numbers against each tool's current docs before committing.

ToolLicenseAccuracyOperational costWhen to choose
pyannote.audio 3.1MIT (gated models)SOTA open-sourceGPU + version mgmt + HF authSelf-hosted, free, full control, research
NVIDIA NeMoApache 2.0Comparable to pyannoteHeavier framework; NVIDIA-stack lock-inNVIDIA-aligned shops, GPU-rich infrastructure
SpeechBrainApache 2.0Slightly behind pyannote 3.1Lighter, modular, easier to extendResearch, custom architectures, education
Deepgram diarizationCommercial1-3 DER better on internal benchmarks$0.40-$0.60/hour bundled with transcriptionProduction, simple HTTP API, predictable per-call cost
AssemblyAI diarizationCommercialComparable to Deepgram$0.40-$0.60/hour bundled with transcriptionLLM-integrated workflows (LeMUR), English focus
VexaScribe (hosted Whisper + pyannote)Commercial (SaaS)Whisper Large-v3 + pyannote 3.1 production stack$2-$20/month subscriptionYou want the Whisper + pyannote output without operating the models

Open-source vs commercial split. The open-source side is dominated by pyannote with NeMo and SpeechBrain as alternatives. The commercial side is dominated by Deepgram and AssemblyAI for API-first usage. SaaS pipelines (VexaScribe and similar) abstract the diarization detail entirely — VexaScribe specifically runs Whisper + pyannote internally, so the “hosted” option for end-users is functionally the same stack described on this page, just operated by us instead of by you.

Where VexaScribe fits — we run Whisper + pyannote

Honest framing: VexaScribe's production pipeline is Whisper Large-v3 + pyannote 3.1, the same stack described on this page. We wrote this guide because we operate it daily — the gotchas around HF gated access, GPU provisioning, model load times, and version churn are things we've actually hit and solved, not theoretical. If you decide to operate the stack yourself, this page is what we wish we'd had when we started. If you decide you'd rather use someone else's operated version, we're an option — the user-facing output is what you'd build with Whisper + pyannote (transcription with speaker labels, TXT/DOCX/SRT/VTT/JSON exports), at $2-$20/month subscriptions.

When self-hosted pyannote is still the right answer: data sovereignty requirements (audio cannot leave your infra), research that needs reproducibility or custom training, very high volumes where a dedicated GPU fleet amortizes ops cost, or anything regulated where the chain of custody matters. We run the same stack and recommend self-hosting in those cases, not us.

When a hosted Whisper + pyannote pipeline (ours or similar) makes sense: non-technical colleagues need transcripts and you don't want to build a custom UI; small projects where building the pipeline isn't worth the engineering time; bridge solution while you scope a self-hosted build; you've evaluated pyannote and decided the ops overhead isn't worth it for your volume. For these, our hosted pipeline exists — same models, less maintenance.

More relevant: our honest diarization tools comparison, our transcription API comparison covering Deepgram, AssemblyAI, Speechmatics, and similar dev-focused services.

Frequently asked questions

Is pyannote.audio free?

The code is MIT-licensed and free. The models on Hugging Face (pyannote/speaker-diarization-3.1, pyannote/segmentation-3.0) are gated — you must accept the user agreement and authenticate with a HF token to download them. Acceptance is free for both research and commercial use as of 2026; the gating exists for usage tracking and to enforce the license terms, not to charge for access. The practical cost of running pyannote is the compute (GPU strongly recommended for production) plus your engineering time to integrate, maintain, and handle version updates.

How accurate is pyannote.audio in 2026?

On the standard benchmarks reported in the pyannote/speaker-diarization-3.1 model card, DER (Diarization Error Rate) is approximately 12-14% on AMI, 9-11% on VoxConverse, and 17-19% on DIHARD III. These are state-of-the-art numbers for open-source diarization in 2026. Real-world performance scales with audio quality: clean two-speaker studio audio reaches 95%+ correct speaker attribution; meeting-room audio with three or more speakers and overlap drops to 75-85%. Some commercial services (Deepgram, AssemblyAI) report 1-3 DER points better on internal benchmarks but charge $0.40-$1.00/hour. For most self-hosted use cases, pyannote is the right tradeoff between accuracy and cost.

How do I get access to pyannote/speaker-diarization-3.1 on Hugging Face?

Three steps. (1) Visit huggingface.co/pyannote/speaker-diarization-3.1, scroll to the agreement section, and accept the user agreement (it asks for your name, email, affiliation, and intended use). (2) Visit huggingface.co/pyannote/segmentation-3.0 and accept the agreement there too — speaker-diarization-3.1 depends on the segmentation model under the hood, and you need access to both. (3) Generate a Hugging Face access token at huggingface.co/settings/tokens (read scope is sufficient), then pass it to from_pretrained() via use_auth_token="your_token_here" or via the HF_TOKEN environment variable. Common gotcha: forgetting to accept the segmentation-3.0 agreement causes a 401 even after speaker-diarization-3.1 is approved.

Can I use pyannote with Whisper?

Yes — Whisper + pyannote is the dominant self-hosted transcription + diarization pipeline in 2026. The standard pattern: run Whisper (any size, Large-v3 recommended for accuracy) to get the transcript with word-level timestamps, run pyannote separately to get speaker turn boundaries, then align the two using the timestamps to attach a speaker label to each transcript segment. The easiest implementation is WhisperX, which wraps Whisper + pyannote + word-level alignment in one library — pip install whisperx and you have a working pipeline in 10 lines of Python. For production, faster-whisper (CTranslate2-based, 2-4x faster) is the common Whisper substitute, paired with pyannote for diarization.

What's the difference between pyannote 2.x and 3.x?

Two main shifts. (1) Architecture: 3.x introduced the Powerset multi-class formulation (Bredin, ICASSP 2023), which replaces the previous segmentation + clustering pipeline with a unified model that handles overlap detection natively. Better accuracy on overlapping speech, simpler pipeline. (2) Model hosting: 3.x models are gated on Hugging Face requiring explicit agreement acceptance, while 2.x models were openly downloadable. The 2.x models are still available on Hugging Face but no longer actively maintained; new projects should start with 3.x. Migration from 2.x to 3.x typically requires updating the Pipeline.from_pretrained() model identifier and handling the new HF authentication flow, but the high-level API is similar.

Does pyannote.audio require a GPU?

GPU is strongly recommended for production but CPU works for testing and small workloads. On a modern GPU (T4, A10, RTX 3090+), pyannote processes audio at roughly 5-10x real-time (a 1-hour recording takes 6-12 minutes). On CPU (modern x86 with AVX-512), processing is approximately real-time or slightly slower — a 1-hour recording takes 60-90 minutes. For interactive use cases (you're diarizing one file at a time during development) CPU is fine. For batch processing or any production workload, the cost savings of GPU outweigh the complexity. AWS T4 instances cost roughly $0.35/hour on-demand, $0.10/hour spot, and process audio at 5-10x real-time — a 1-hour recording costs $0.04-$0.07 of compute.

How does pyannote compare to Deepgram or AssemblyAI diarization?

Three different products, three different tradeoffs. pyannote.audio: free, open-source, runs on your infrastructure, state-of-the-art for open-source, but you operate the model (GPU, version management, gated access flow). Deepgram diarization: commercial API, 1-3 DER points better on Deepgram's internal benchmarks, $0.40-$0.60/hour bundled with transcription, simple HTTP integration. AssemblyAI diarization: commercial API, comparable accuracy to Deepgram, $0.40-$0.60/hour, particularly strong for English with their LeMUR LLM integrations on top. The decision: pyannote if you need self-hosted (data sovereignty, research, custom training), hosted if you want to skip operations work. For most production pipelines, the hosted services pay for themselves in saved engineering time within the first month.

Is there a non-gated pyannote model?

Older 2.x models (pyannote/speaker-diarization, pyannote/segmentation) were ungated and remain accessible without authentication. They're not actively maintained but work for testing and projects that can't navigate the gated-access flow (some restrictive enterprise environments block HF authentication). For production work in 2026, the 3.x models are recommended despite the gating; the accuracy gain from the Powerset architecture is meaningful, especially on overlapping speech. If gating is a hard blocker, alternatives include NVIDIA NeMo's diarization (Apache 2.0, ungated) and SpeechBrain (Apache 2.0, ungated). Both are slightly behind pyannote 3.x on benchmarks but ungated and free to use commercially.

Methodology & disclosure

Sources: pyannote.audio repository at github.com/pyannote/pyannote-audio (MIT license; maintained by Hervé Bredin et al.). Current model card at huggingface.co/pyannote/speaker-diarization-3.1. Powerset architecture from Bredin, “Powerset multi-class cross entropy loss for neural speaker diarization” (arXiv:2210.13513). Whisper paper at arXiv:2212.04356. WhisperX at github.com/m-bain/whisperX. faster-whisper at github.com/SYSTRAN/faster-whisper. Deepgram and AssemblyAI pricing from their respective public pricing pages as of June 2026.

Benchmark caveats: DER numbers reported here are from the published speaker-diarization-3.1 model card and may shift slightly as the model is updated. For production planning, verify against the current model card and consider running your own benchmark on representative audio from your workload. Commercial vendor benchmark claims (Deepgram, AssemblyAI “1-3 DER better”) are based on each vendor's own published benchmarks; we have not independently reproduced these and recommend treating cross-vendor benchmark comparisons with appropriate skepticism.

Disclosure: This page is published by VexaScribe, a hosted transcription SaaS. Our production pipeline is Whisper Large-v3 + pyannote 3.1 — we operate the exact stack described on this page. That gives the guide credibility (the gotchas are ones we've actually hit) but it also gives us a commercial interest worth surfacing: if you decide running pyannote yourself isn't worth it, we'd like you to consider our hosted version. We've tried to keep the page honest on when self-hosting is genuinely the right answer (data sovereignty, research, very high volume) — those readers should stay with pyannote and not us. The recommendation rules in the “when to use pyannote vs hosted” section are written from that honest tradeoff, not from a goal of converting every reader.

Editorial standards: See our editorial standards.

Related guides