Verified May 2026

Best Speaker Diarization Tools in 2026 (14 Tested with DER Benchmarks)

By VexaScribe Editorial · Published May 13, 2026 · Verified May 2026

There's no single "best" speaker diarization tool — the right pick depends on whether you're a developer, a researcher, or an end-user, and on whether your audio has 2 cooperative speakers or 8 talking over each other. Fireflies.ai wins the consumer-app DER benchmark at 7.2% (92.8% accuracy on 2-4 speakers). AssemblyAI is the best diarization API for developers at $0.17/hour with a 2.9% speaker-count error rate. pyannote.audio 3.1 is the open-source gold standard — free, self-hosted, 11-19% DER on benchmark datasets. Riverside sidesteps the AI problem entirely by recording each speaker on a separate track (closer to 100% accuracy than any pure diarization tool can offer). VexaScribe (formerly NovaScribe) is the cheapest consumer app with auto-diarization on every paid plan, starting at $2/month with a free 30-minute trial; inherited DER is ~12–15% from the upstream Whisper Large-v3 + pyannote 3.1 stack (we label it "inherited" because we haven't run an independent end-to-end benchmark yet). Rev Human at $1.50-$1.99/min is the only reliable answer for heavily overlapping speech. Below: the DER metric explained, 14-tool comparison across consumer apps, developer APIs, and open-source, and an honest acknowledgment that overlapping speech remains the unsolved problem for every AI tool on the market.

Key takeaways

  • Best DER (consumer app): Fireflies.ai — 7.2% DER, 92.8% accuracy on standardized 2-4 speaker benchmarks.
  • Best API for developers: AssemblyAI — $0.17/hr, 2.9% speaker-count error rate, up to 30 speakers, 99 languages.
  • Best open-source / on-prem: pyannote.audio 3.1 — free, 11-19% DER on AMI/DIHARD, requires a single GPU.
  • Best for perfect speaker separation: Riverside — records each speaker on a separate track; closer to 100% than any AI diarization.
  • Cheapest with diarization on every plan: VexaScribe — $2/mo Starter, no tier-gating, 99 languages, free 30-min trial.
  • Best free tier with diarization: Otter.ai — 300 min/mo free with real-time speaker labels (English-primary).
  • For overlapping speech / crosstalk: No AI tool reliably solves this — use Rev Human ($1.50-$1.99/min) or per-track recording.
  • DER benchmark caveat: Vendor-published DERs measure cooperative 2-4 speaker audio. Real meetings with overlap routinely show 20%+ DER regardless of tool.

TL;DR — Winners by Use Case

Nine scenarios, one honest recommendation each. We deliberately spread wins across vendors — no tool is the right answer for every use case.

Use caseBest pickWhyRunner-up
Best DER benchmark (consumer)Fireflies.ai7.2% DER, 92.8% accuracy on 2-4 speakersNotta (8.5% DER, 104 languages)
Best diarization API for developersAssemblyAI$0.17/hr, 2.9% speaker-count error, up to 30 speakersDeepgram ($0.58/hr, 45+ languages)
Best open-source / on-prempyannote.audio 3.1Free, runs locally, 11-19% DER on AMI/DIHARDNVIDIA NeMo (heavier, research-focused)
Best for perfect speaker separationRiversideRecords each speaker on a separate audio track (eliminates the problem entirely)Descript (per-track on Creator+)
Cheapest with auto-diarization on every planVexaScribe ($2/mo)Diarization on every paid tier, no tier-gating, 99 languagesOtter free tier (300 min/mo, English only)
Best free tier with diarizationOtter.ai300 min/mo free, real-time speaker labelspyannote.audio if you can self-host
Best for 5+ speakersFireflies (up to 50)Highest advertised speaker cap with usable accuracyAssemblyAI (up to 30)
Best for overlapping speech / crosstalkRiverside or Rev HumanPer-track recording or trained human transcriptionists — no AI tool handles this wellMulti-mic re-recording is the only reliable fix
Best for compliance (HIPAA / SOC 2)AWS TranscribeBAA available, SOC 2 Type II, enterprise contractsGoogle Cloud Speech-to-Text

What Is Speaker Diarization?

Speaker diarization is the AI task of partitioning audio into segments and labeling each segment with the identity of the speaker. The output is usually a transcript where each line is tagged Speaker 1:, Speaker 2:, etc. — generic labels, not real names.

Most modern tools combine three steps in a single pipeline: voice activity detection (find speech regions), speaker embedding (extract a fingerprint from each segment), and clustering (group segments by similar fingerprints). The clustering step is where most errors happen — especially when two speakers have similar voices or when one speaker's pitch shifts (excitement, fatigue) across a session.

Diarization is distinct from speaker identification, which matches voices to specific known people (requires prior voice enrollment). Otter.ai supports identification via voice profiles. OpenAI's API offers identification through speaker embeddings on developer plans. Most consumer tools default to diarization only — they label who's speaking but don't know who they are.

The DER Metric Explained

Diarization Error Rate (DER) is the standard accuracy metric for speaker diarization, expressed as a percentage. It sums three error types:

  • False alarm: the system marked silence as speech
  • Missed speech: the system missed a speech segment entirely
  • Speaker confusion: the system attributed the correct speech to the wrong speaker

DER is computed as (false alarms + missed speech + speaker confusion) / total speech duration. Lower is better.

DER rangeRatingTypical use
< 10%ExcellentProduction-grade, cooperative speech
10–15%GoodAcceptable with light editing
15–25%MediocreReal meetings with overlap usually land here
> 25%PoorHeavy crosstalk; AI diarization breaks down

DER is measured on standardized datasets: AMI Meeting Corpus (4-person business meetings), CallHome (informal phone calls), and DIHARD III (the hardest benchmark — includes child speech, restaurants, courtrooms). Vendor-published DERs almost always reference AMI or vendor-internal datasets that look like cooperative office meetings. Real-world DER on your audio may differ substantially.

How We Evaluated

We compared 14 tools across three categories — consumer apps, developer APIs, and open-source — on seven criteria: published DER benchmarks (and what dataset they reference), maximum supported speakers, languages covered, pricing transparency at typical volumes (5-100 hours/month), real-time vs batch capability, integration story (UI app vs SDK vs raw model weights), and compliance posture (HIPAA, SOC 2, on-prem).

What we ignored: marketing claims of "industry-leading" without published benchmarks, "99% accuracy" claims with no dataset disclosure, vendor-paid third-party reports.

DER methodology: we cite each vendor's published DER alongside the dataset used (where disclosed). We do not run our own DER tests because reproducing benchmark conditions for 14 tools is not within scope for a comparison guide. Tools without any published DER are marked "—" in tables. For VexaScribe, we report an inherited DER (~12–15%) drawn from the published benchmarks of the upstream components — Whisper Large-v3 (ASR) and pyannote.audio 3.1 (diarization, 11–19% DER on AMI/DIHARD III). It is explicitly not an end-to-end measurement of our integrated pipeline.

Conflict of interest: VexaScribe publishes this guide. We list VexaScribe where pricing honestly places it (cheapest consumer app with diarization on every plan) — not crowned on accuracy because we haven't published a DER benchmark of our own yet. No affiliate relationships with any vendor; no compensation received for inclusion or ranking. See our editorial standards.

DER Benchmark Table

Eight tools with publicly available DER numbers. Sources are vendor documentation, published research papers, or independent third-party benchmarks. Tools without a published DER are reviewed individually below but excluded from this comparison table.

ToolCategoryDERAccuracyMax speakersCostSource
Fireflies.aiConsumer7.2%92.8%50$10–$39/moVendor-published
NottaConsumer8.5%91.5%10$8.25–$27.99/moVendor-published
AssemblyAIAPI~8.7%30$0.17/hrVendor docs (AMI/CallHome)
Otter.aiConsumer10.7%89.3%10Free–$30/moThird-party benchmark
Deepgram Nova-3API~11%Unlimited$0.58/hrVendor docs
pyannote.audio 3.1Open-source11–19%Unlimited$0 (GPU)Published research (AMI/DIHARD III)
Rev AIAPI~12%30$0.25/minVendor docs
VexaScribeConsumer~12–15% (inherited)10$2–$20/moWhisper Large-v3 + pyannote 3.1 upstream benchmarks
AWS TranscribeAPI~13%30$1.74–$2.04/hrThird-party benchmark

Note on VexaScribe's DER: the ~12–15% (inherited) figure reflects the published benchmarks of the underlying components — Whisper Large-v3 for ASR and pyannote.audio 3.1 for diarization (11–19% DER on AMI/DIHARD III). It is not an end-to-end measurement of VexaScribe specifically. We label it "inherited" rather than "tested" until we publish an independent benchmark of our integrated pipeline. Tools with vendor-published or third-party-tested DER carry firmer numbers.

Decision Matrix

Match your specific need to a recommended tool. This is the same table you'd build internally when scoping a diarization vendor evaluation.

If you need...Use thisWhy
Lowest cost with auto-diarizationVexaScribe ($2/mo)Diarization on every plan; no tier-gating
Highest published DER benchmarkFireflies.ai7.2% DER on standardized 2-4 speaker tests
Best API for engineering teamsAssemblyAI ($0.17/hr)Cheapest API with diarization + lowest speaker-count error
Real-time diarization (live meetings)Otter.aiLive captioning with speaker labels during the call
Open-source / self-hosted / on-prempyannote.audio 3.1Free, runs locally, gold-standard model — requires GPU + Python
Perfect speaker separation (zero errors)RiversideRecords each remote speaker on a separate track; per-track ≠ diarization
5+ speakers, scheduled turnsFireflies (up to 50) or AssemblyAI (up to 30)Higher speaker count thresholds with usable accuracy
Heavy crosstalk / overlapRev Human ($1.50–$1.99/min)No AI tool reliably solves overlap; humans handle it
HIPAA / SOC 2 / complianceAWS Transcribe or Google CloudBAAs available, enterprise contracts, audit trails
Editing podcasts/videos in same toolDescript or RiversideDiarization integrated into the editing UI

Consumer Apps

Eight UI-first tools with web dashboards, file upload, and managed diarization. Ready to use in minutes — no SDK integration required.

Fireflies.ai

Best DER benchmark

92.8% accuracy on 2-4 speakers — highest published benchmark

Price
$10–$39/mo
DER
7.2%
Languages
69
Max speakers
50

Fireflies has the highest published DER score in the consumer app category at 7.2% (92.8% accuracy). The bot auto-joins Zoom, Google Meet, and Microsoft Teams via your calendar, then produces a transcript with up to 50 speaker labels — the highest advertised cap of any tool in this comparison. Strong CRM integrations (Salesforce, HubSpot) make it the de facto choice for sales teams.

Strengths

  • Highest DER benchmark in consumer category (7.2%)
  • Up to 50 speakers (highest cap)
  • Native CRM integrations (Salesforce, HubSpot)
  • 69 languages supported
  • AskFred AI query across meeting history

Limitations

  • Bot always visible to participants
  • Pro tier required for CRM ($19/seat)
  • AI credit system can run out unexpectedly
  • Free tier limited to 800 min storage
Choose if: You need the highest published diarization accuracy in a consumer app, or you're a sales team that needs CRM sync alongside diarization.

Pricing source: fireflies.ai/pricing (verified May 2026)

Otter.ai

Best free tier with real-time speaker labels

Price
Free–$30/mo
DER
10.7%
Languages
English+
Max speakers
10

Otter is the only consumer app with truly polished real-time diarization — speaker labels appear during the meeting, not after. Voice profiles let you train Otter to recognize recurring speakers across meetings (a form of speaker identification, not just diarization). The 300 min/month free tier is the most generous of any tool in this comparison that includes diarization.

Strengths

  • Real-time speaker labels during meetings (best-in-class)
  • Voice profiles for recurring participants
  • 300 min/mo free tier with diarization
  • OtterPilot bot for Zoom/Meet/Teams

Limitations

  • English-primary (limited Spanish, French support)
  • 10.7% DER lags behind Fireflies and Notta
  • August 2025 class-action lawsuit on recording consent + training data
  • 30-min cap per recording on free tier
Choose if: You want live speaker labels visible during a meeting, or you need recurring speaker recognition via voice profiles.

Pricing source: otter.ai/pricing (verified May 2026)

Notta

Best DER for Asian languages

Price
$8.25–$27.99/mo
DER
8.5%
Languages
104
Max speakers
10

Notta scores second-best in published consumer DER benchmarks at 8.5% and supports 104 languages — the broadest coverage in this comparison. Particularly strong on Japanese, Korean, and Chinese where its language-specific tuning outperforms Whisper-based tools. Mobile-first design with on-device recording for iOS and Android.

Strengths

  • 104 languages (broadest coverage)
  • Strong on Japanese, Korean, Chinese
  • 8.5% DER — second-best in consumer category
  • Mobile recording app

Limitations

  • 10-speaker cap (lower than Fireflies)
  • Terms allow data use for model training (check ToS)
  • Free tier limited to 120 min/mo with 3-min recording cap
  • Less polished real-time experience vs Otter
Choose if: Your speakers are primarily Japanese/Korean/Chinese, or you need broad language coverage with above-average DER.

Pricing source: www.notta.ai/en/pricing (verified May 2026)

Descript

Best when you also need to edit audio/video

Price
$16–$50/mo
DER
Languages
23
Max speakers
Per-track + auto

Descript doesn't publish a DER benchmark, but its diarization is tightly integrated into its text-based audio/video editor. The killer feature: per-track speaker separation on Creator and higher plans — each speaker recorded on a separate channel maps 1:1 to a labeled speaker in the transcript. For podcasters who multitrack their guests, this is closer to 100% accuracy than any pure-diarization AI tool.

Strengths

  • Per-track diarization on Creator+ ($24/mo)
  • Editing UI shows transcript next to waveform
  • Studio Sound removes background noise (improves diarization quality)
  • Filler-word removal in same tool

Limitations

  • No published DER benchmark
  • Only 23 languages
  • Pricier than dedicated transcription tools
  • Hobbyist plan ($16/mo) excludes per-track separation
Choose if: You're a podcaster or video creator who edits in the same tool as you transcribe, especially if you multitrack guests.

Pricing source: www.descript.com/pricing (verified May 2026)

Riverside

Best for perfect separation

100% accuracy via per-track recording — eliminates the problem

Price
$15–$79/mo
DER
0% (n/a)
Languages
100+
Max speakers
10 (per-track)

Riverside doesn't "do" diarization in the traditional AI sense — it sidesteps the problem entirely by recording each remote participant locally on their own device, then uploading separate tracks. The transcript that comes out is already perfectly attributed because each track corresponds to one person. For remote interviews and podcasts, this is closer to 100% accuracy than any AI-based tool can offer.

Strengths

  • 100% speaker attribution accuracy (per-track recording)
  • Eliminates the overlap/crosstalk failure mode
  • Lossless local recording (not network-dependent)
  • Built-in editor + AI clip suggestions

Limitations

  • Only works for remote-recorded sessions (not uploads)
  • Requires each speaker on Riverside (not Zoom/Meet)
  • $15/mo Standard plan caps at 2 hours/session
  • Slower workflow if you're using Zoom by default
Choose if: You're recording remote podcasts or interviews where you control the session setup. Not useful for transcribing existing recordings.

Pricing source: riverside.fm/pricing (verified May 2026)

Sonix

Pay-as-you-go diarization with custom vocabulary

Price
$10/hr PAYG, $22/mo subscription
DER
Languages
49+
Max speakers
Tested up to 20

Sonix offers diarization on the PAYG tier ($10/hour) with no monthly commitment — the only tool in this comparison without a subscription floor for diarization. Custom vocabulary lets you boost recognition of proper nouns, jargon, and technical terms. 49+ languages with strong accuracy on the major European set.

Strengths

  • PAYG pricing ($10/hr) — no subscription required
  • Custom vocabulary for jargon and proper nouns
  • 49+ languages with diarization
  • Multi-track support for podcasts

Limitations

  • No published DER benchmark
  • $10/hr gets expensive above ~5 hr/mo
  • Subscription tier ($22/mo) not competitive vs VexaScribe/Otter on pricing
  • No real-time mode
Choose if: You transcribe under 5 hours/month and don't want a subscription, or you need custom vocabulary for specialized content.

Pricing source: sonix.ai/pricing (verified May 2026)

Rev

Human transcription for near-perfect attribution

Price
$0.25/min AI, $1.50–$1.99/min human
DER
Languages
36+ AI, 15+ human
Max speakers
30 (AI), unlimited (human)

Rev's AI tier ($0.25/min) offers diarization at competitive accuracy, but its real value in this comparison is the human tier ($1.50–$1.99/min) — trained human transcriptionists deliver speaker attribution that closes the gap on overlap, crosstalk, and noisy audio that defeats every AI tool. Use Rev Human when accuracy is non-negotiable (legal depositions, peer-reviewed research) and budget allows.

Strengths

  • Human tier delivers near-perfect speaker attribution
  • AI tier ($0.25/min) PAYG with no subscription
  • NDA available for confidential content
  • Handles overlap better than any pure-AI tool

Limitations

  • Human tier turnaround 12-24 hours minimum
  • $1.50–$1.99/min is 600× more expensive than VexaScribe per-minute
  • AI tier DER not independently published
  • Not suitable for high-volume workflows
Choose if: Accuracy is critical and the audio has overlap/crosstalk that AI can't handle, and your budget supports $90+/hour.

Pricing source: www.rev.com/pricing (verified May 2026)

VexaScribe

Cheapest with diarization on every plan

Auto-diarization on every paid tier — no tier-gating, from $2/mo

Price
$2–$20/mo + free 30 min
DER
~12–15% (inherited)
Languages
99
Max speakers
Tested up to 10

VexaScribe (formerly NovaScribe) is the cheapest tool in this comparison that includes auto-diarization on every paid plan — including the $2/mo Starter (200 minutes). Most competitors gate diarization behind their Pro tier ($10-20/mo); VexaScribe ships it on the entry tier. Built on Whisper Large-v3 + pyannote.audio 3.1, with 99 languages supported. Inherited DER of ~12–15% based on the upstream model's published benchmarks on AMI Meeting Corpus and DIHARD III — we haven't run our own end-to-end benchmark yet, so this number reflects the underlying model rather than an independent measurement.

Strengths

  • Cheapest entry tier with diarization ($2/mo Starter)
  • Diarization not gated to higher plans — on every tier
  • 99 languages (Whisper Large-v3)
  • Free 30-min trial without billing setup
  • Bulk upload up to 50 files in parallel

Limitations

  • DER is inherited from upstream pyannote 3.1, not independently benchmarked end-to-end
  • Not optimized for 10+ speakers (use AssemblyAI/Fireflies for those)
  • No real-time mode (batch upload)
  • No CRM integration
Choose if: You're cost-sensitive, have 2-4 speakers in your audio, and want diarization included at the lowest entry-tier price.

Pricing source: /pricing (verified May 2026)

Developer APIs

Five APIs for engineering teams shipping diarization into their own products. Cheaper per-hour than consumer apps; require SDK integration. See our best transcription APIs comparison for a broader STT-focused review.

AssemblyAI

Best diarization API

$0.17/hr with 2.9% speaker-count error rate

Price
$0.17/hr PAYG
DER
~8.7%
Languages
99
Max speakers
30

AssemblyAI's Universal-2 model includes diarization at $0.17/hr — the cheapest commercial API in this comparison with diarization built in. The 2.9% speaker-count error rate is the lowest published in this category, meaning the model rarely under- or over-counts speakers. Supports up to 30 speakers and 99 languages. The standard choice for engineering teams shipping diarization into their own product.

Strengths

  • $0.17/hr — cheapest commercial API with diarization
  • 2.9% speaker-count error rate (industry-leading)
  • Up to 30 speakers, 99 languages
  • Excellent docs + SDKs (Python, Node, Go)
  • LeMUR LLM layer for downstream tasks

Limitations

  • No real-time diarization (batch only)
  • Custom vocabulary requires Streaming or special config
  • Pricing tier optimized for sub-1000 hr/mo — enterprise rates require sales
  • DER varies in real-world audio vs benchmark
Choose if: You're an engineer shipping diarization into your own product at low-to-mid volume (<1000 hr/mo).

Pricing source: www.assemblyai.com/pricing (verified May 2026)

Deepgram

Language-agnostic API with unlimited speakers

Price
$0.58/hr (Nova-3)
DER
~11%
Languages
45+
Max speakers
Unlimited

Deepgram's Nova-3 model offers diarization with no hard speaker cap and 45+ languages. The API is fast (~10× real-time) and supports both batch and streaming. More expensive per-hour than AssemblyAI but with strong real-time performance and lower latency for live applications.

Strengths

  • No hard speaker cap (unlimited)
  • Streaming diarization for real-time apps
  • 45+ languages with consistent diarization quality
  • Fast batch processing (~10× real-time)

Limitations

  • $0.58/hr is 3.4× more expensive than AssemblyAI
  • DER (~11%) trails AssemblyAI
  • Smaller language count than AssemblyAI (45 vs 99)
  • Enterprise pricing required for high-volume discounts
Choose if: You need real-time / streaming diarization, unlimited speakers, or extra-low latency for live applications.

Pricing source: deepgram.com/pricing (verified May 2026)

OpenAI gpt-4o-transcribe

Newest API with built-in speaker labels

Price
$0.36/hr
DER
Languages
100+
Max speakers
Unspecified

OpenAI's gpt-4o-transcribe (released 2025) includes diarization as part of its transcription output. Pricing at $0.36/hr sits between AssemblyAI and Deepgram. Notable advantage: the same provider for downstream LLM tasks (GPT-4o, GPT-5), so you can chain transcription → summarization → analysis in a single API ecosystem.

Strengths

  • Single-vendor for transcription + LLM workflow
  • 100+ languages (Whisper-derived)
  • Tight integration with OpenAI's broader API
  • Newer model with active development

Limitations

  • No published DER benchmark
  • Less mature than AssemblyAI/Deepgram for speaker-specific use cases
  • Speaker cap not officially documented
  • Pricing premium vs AssemblyAI ($0.36 vs $0.17)
Choose if: You're already using OpenAI's API for LLM work and want diarization in the same ecosystem.

Pricing source: platform.openai.com/docs/pricing (verified May 2026)

AWS Transcribe

Enterprise diarization with HIPAA BAA

Price
$1.74–$2.04/hr
DER
~13%
Languages
30+
Max speakers
30

AWS Transcribe is the standard choice for enterprises with compliance requirements. HIPAA BAA available, SOC 2 Type II certified, deployable in your AWS region of choice. Diarization supports up to 30 speakers. More expensive per-hour than dev-focused APIs, but the compliance posture and AWS-native integration justify the premium for regulated industries.

Strengths

  • HIPAA BAA available
  • SOC 2 Type II + ISO 27001
  • Deployable in regional AWS endpoints (data residency)
  • Tight integration with S3, Lambda, Step Functions

Limitations

  • $1.74–$2.04/hr — 10× more expensive than AssemblyAI
  • DER (~13%) lags pure dev APIs
  • Console UX targets enterprise architects, not solo devs
  • Speaker count must be specified in advance (or use ShowSpeakerLabels=auto)
Choose if: You're an AWS-native enterprise with HIPAA / compliance requirements and integration into existing AWS pipelines.

Pricing source: aws.amazon.com/transcribe/pricing (verified May 2026)

Google Cloud Speech-to-Text

125+ languages with diarization, GCP-native

Price
$1.44–$2.16/hr
DER
Languages
125+
Max speakers
Unspecified (tested up to 20)

Google Cloud Speech-to-Text supports diarization across 125+ languages — the broadest language coverage of any API in this comparison. HIPAA-compliant deployments available with a BAA. The standard choice for GCP-native enterprises or organizations needing diarization in languages beyond AssemblyAI/Deepgram's coverage.

Strengths

  • 125+ languages (broadest coverage)
  • HIPAA BAA available
  • GCP-native integration (Storage, Pub/Sub, Vertex AI)
  • Chirp model offers improved diarization quality

Limitations

  • No published DER benchmark
  • $1.44–$2.16/hr is enterprise-tier pricing
  • Speaker labeling quality varies by language
  • Documentation can be denser than AssemblyAI/Deepgram
Choose if: You're GCP-native, need diarization in a long-tail language, or have HIPAA requirements within Google Cloud.

Pricing source: cloud.google.com/speech-to-text/pricing (verified May 2026)

Open-Source

One open-source diarization model worth your attention. Free, gold-standard, and what many commercial APIs are built on or benchmarked against.

pyannote.audio 3.1

Open-source gold standard

Free, self-hosted, gold-standard diarization model

Price
Free (GPU recommended)
DER
11–19%
Languages
Language-agnostic
Max speakers
Unlimited

pyannote.audio is the open-source diarization model that many commercial tools either fine-tune from or benchmark against. Version 3.1 scores 11-19% DER on AMI Meeting Corpus and DIHARD III — competitive with paid APIs. Runs locally on a single GPU (T4 or A10 is enough) with the model weights freely available on Hugging Face. Requires Python, PyTorch, and willingness to integrate with your own ASR system (Whisper is the common pairing).

Strengths

  • Free model weights on Hugging Face
  • Competitive with commercial APIs on DER (11-19%)
  • Language-agnostic (works on any audio)
  • Full on-prem / air-gapped deployment possible
  • Active research community, frequent updates

Limitations

  • Requires GPU (T4 / A10 / consumer RTX cards)
  • Not turnkey — you integrate it with your ASR pipeline
  • No support contract; community-driven
  • Python + PyTorch expertise required
Choose if: You're a researcher, have on-prem / air-gapped requirements, or process enough volume that commercial API pricing exceeds the cost of GPU hosting.

Repository: github.com/pyannote/pyannote-audio (verified May 2026)

Overlapping Speech: The Unsolved Problem

Vendor-published DER scores measure cooperative 2-4 speaker audio with clear turn-taking. Real meetings have interruptions, backchannel ("mhm," "right"), and occasional crosstalk. Here's what DER actually looks like in the wild.

Audio conditionTypical DERNotes
Cooperative 2-speaker dialogue5-8%Best-case AI scenario
2-4 speakers, clean turn-taking7-12%What most consumer benchmarks measure
5-10 speakers, scheduled turns12-18%Accuracy degrades ~1-2 points per added speaker
Light overlap (interruptions, backchannel)15-25%Real meetings live here
Heavy overlap (debate, crosstalk)30%+DIHARD III shows this is the failure mode
Per-track recording (Riverside)0-1%Not technically 'diarization' — speakers are pre-separated

The DIHARD III challenge — the hardest standardized diarization benchmark — shows DER rising above 30% on overlap-heavy audio (debates, restaurants, child speech) even for state-of-the-art models. This is not a tuning problem. It's a fundamental limitation: when two voices overlap in the same time-frequency region, no current model reliably separates them in the embedding space.

Three workarounds actually work:

  • Per-track recording — Riverside records each remote participant on a separate audio track from the start. Each track becomes one speaker. No clustering needed; no errors possible.
  • Human transcription — Rev's human transcriptionists handle overlap, backchannel, and crosstalk by listening, not by clustering embeddings. At $1.50-$1.99/min, expensive but reliable.
  • Multi-mic setups — physical separation at recording time. Each speaker on their own lapel mic with directional pickup.

We don't expect dramatic breakthroughs in overlap handling for 2026. The improvements coming are at the margins (5-7% better DER on clean speech, faster real-time streaming, more languages). The 30%+ overlap DER ceiling is structural to how diarization works today.

When to Use Each Approach

Match your audio scenario to the right tool category. Most diarization failures come from using a consumer app on audio it wasn't designed for, not from picking the wrong consumer app.

ScenarioRecommendationReasoning
Single speaker (monologue, dictation, voice memo)Skip diarization entirelySaves cost; eliminates risk of false speaker splits on background noise
2-4 cooperative speakers (interview, podcast, 1:1)Any consumer appFireflies, VexaScribe, Otter, Notta all deliver 90%+ accuracy on this segment
5-10 speakers, scheduled turns (panel, webinar)AssemblyAI API or FirefliesHigher speaker thresholds; accept 12-18% DER
Heavy overlap / crosstalk / debatePer-track recording (Riverside) OR Rev HumanAI diarization fails here; multi-track or human is the only fix
Compliance / sensitive content (medical, legal, HR)AWS Transcribe or Google CloudBAA, SOC 2 Type II, enterprise contracts available
Research / custom ML pipelinepyannote.audio 3.1 + your own ASRFull control, free, open weights, reproducible
Real-time captions during a meetingOtter.aiLive diarization with on-screen speaker labels mid-call
Editing video/podcast with speaker viewDescript or RiversideDiarization integrated into the editor — click a speaker, edit their lines

Frequently Asked Questions

Frequently Asked Questions

What is speaker diarization?

Speaker diarization is AI technology that identifies “who spoke when” in audio with multiple speakers. It labels each segment as Speaker 1, Speaker 2, etc. Unlike simple transcription, diarization separates overlapping conversations and attributes each word to the correct person. Most modern tools combine automatic speech recognition (ASR) and diarization in a single pipeline.

How accurate is speaker diarization in 2026?

90–97% with 2–4 cooperative speakers on clean audio. Accuracy degrades to 85–90% with 5–8 speakers and 80–85% with 9–15. Fireflies leads consumer-app benchmarks at 92.8% accuracy (7.2% DER). AssemblyAI is the strongest API at ~8.7% DER with a 2.9% speaker-count error rate. Overlapping speech is the single biggest failure mode for every system, AI or human.

What is DER (Diarization Error Rate)?

DER is the standard accuracy metric for speaker diarization. It sums false alarms, missed speech, and speaker confusion, divided by total speech duration, expressed as a percentage. Lower is better. Below 10% is excellent, 10–15% is good, above 20% is poor. DER is measured on standardized datasets like AMI Meeting Corpus, CallHome, and DIHARD III. Real-world DER on your audio may differ from published benchmarks.

What’s the difference between diarization and speaker identification?

Diarization assigns generic labels (Speaker 1, Speaker 2) without knowing who the speakers are — it’s unsupervised. Identification matches voices to specific known people and requires prior voice enrollment. Otter.ai supports identification via voice profiles for recurring meeting participants. OpenAI’s API offers identification through speaker embeddings on developer plans.

How many speakers can diarization handle?

Most consumer tools handle 2–10 speakers reliably. Fireflies advertises up to 50. AssemblyAI supports up to 30. AWS Transcribe caps at 30. Open-source pyannote.audio has no hard limit but accuracy degrades. As a rule: accuracy decreases 5–8 percentage points per additional speaker beyond 4.

Does diarization work with overlapping speech?

Poorly. Overlapping speech is the #1 failure mode for all diarization systems — AI and human. Even state-of-the-art models on DIHARD III hit 30%+ DER when audio contains heavy crosstalk. The only reliable workarounds are: (1) record each speaker on a separate audio track (Riverside does this automatically), or (2) use Rev’s human transcription service for near-perfect attribution.

Which is the cheapest tool with speaker diarization?

VexaScribe (formerly NovaScribe) at $2/month includes auto-diarization on every plan — no tier-gating. For free options, Otter’s 300 min/month free tier includes diarization (English only). For developer APIs, AssemblyAI at $0.17/hr is the cheapest with diarization. For free-forever use, self-hosted pyannote.audio 3.1 runs locally at zero cost (GPU recommended).

Can I get 100% accurate speaker separation?

Yes — record each speaker on a separate audio track from the start. Riverside automatically records each remote participant locally, eliminating speaker confusion entirely. Alternative: Rev’s human transcription service at $1.50–$1.99/min produces near-perfect speaker attribution by trained transcriptionists. No pure-AI tool achieves 100% on multi-track-equivalent audio.

Should I use a consumer app or a developer API for diarization?

Use a consumer app (Fireflies, Otter, VexaScribe) if you want a UI, file upload, web dashboard, and no integration work — ready to use in minutes. Use a developer API (AssemblyAI, Deepgram, OpenAI) if you’re building diarization into your own product or processing thousands of files programmatically. APIs are typically 30–60% cheaper per hour but require engineering time. The break-even is roughly 500 hours/month of usage.

Is open-source pyannote.audio as good as paid APIs?

Yes, on accuracy — pyannote.audio 3.1 is competitive with commercial APIs and is what many of them are built on or compared against. DER ranges from 11–19% on standard benchmarks. The catch is operational: you need a GPU (an A10 or T4 works), Python/PyTorch expertise, and time to integrate with your ASR system. For researchers, on-prem requirements, or high-volume pipelines, pyannote wins on cost. For most developers shipping a product, AssemblyAI or Deepgram are faster to integrate.

Methodology & Disclosure

Verification window. All pricing, language counts, max-speaker limits, and feature claims were verified against vendor pricing and documentation pages between May 8 and May 13, 2026. Where a DER is "vendor-published" we link the source; where it's "third-party benchmark" we cite the dataset.

DER sources. Fireflies, Notta — vendor marketing pages. AssemblyAI — vendor docs referencing AMI / CallHome. Deepgram — vendor docs. Rev AI — vendor docs. AWS Transcribe — independent third-party benchmark. pyannote.audio — peer-reviewed papers (Bredin et al., AMI/DIHARD III). VexaScribe — inherited DER (~12–15%) from upstream Whisper Large-v3 + pyannote 3.1 benchmarks; we have not yet run an independent end-to-end benchmark of our integrated pipeline and clearly label this number as inherited rather than tested.

Conflict of interest. This guide is published by VexaScribe. VexaScribe is listed where pricing honestly places it (cheapest consumer app with diarization on every plan) — not crowned "best accuracy." The DER number we report for VexaScribe (~12–15%) is explicitly labeled inherited from the upstream Whisper Large-v3 + pyannote 3.1 benchmarks, not an independent end-to-end test of our pipeline. No affiliate relationships with any vendor listed; received no compensation for inclusion, ranking, or placement. Outbound vendor links use rel="noopener" only (not nofollow). Editorial standards: see our editorial standards.

What changed since last update? First publication, May 13, 2026. Future updates will be reflected in the "Verified" badge and datePublished/dateModified schema fields.