Verified May 2026

Best Transcription API for Developers (May 2026)

Q: What's the cheapest transcription API?

As of May 2026, OpenAI's gpt-4o-mini-transcribe at $0.003/minute ($0.18/hour) is the cheapest list-price API from a major vendor. Soniox is effectively cheaper at ~$0.0017/minute on its token-based async pricing, and Rev AI Reverb matches OpenAI at $0.003/min. At very high volumes (>5M minutes/month) AWS Transcribe drops to $0.0078/min. There is no cheaper managed option without committing to volume tiers.

By VexaScribe Editorial · Published May 15, 2026 · Verified May 2026

The cheapest production-grade transcription API in May 2026 is OpenAI's gpt-4o-mini-transcribe at $0.003/minute ($0.18/hour); the most expensive of the major managed services is Amazon Transcribe Medical at $0.075/minute ($4.50/hour) — a 25× spread for what looks, on paper, like the same product. Between those poles sit Deepgram Nova-3 ($0.0043 batch, $0.0077 streaming), AssemblyAI Universal-2 ($0.0025/min, 99 languages), ElevenLabs Scribe v2 (March 2026 launch, $0.22/hr batch), Gladia Solaria-1 (100 languages, sub-103 ms partial latency), Google Cloud Speech-to-Text V2 with Chirp 2 ($0.016/min), AWS Transcribe ($0.024/min, tiered down to $0.0078 above 5M minutes), Azure AI Speech ($0.017/min real-time), Speechmatics ($0.80/hour batch), Rev AI Reverb ($0.003/min), and Soniox (~$0.10/hour async). Free tiers range from $0 (Speechmatics: 480 min/month) to $200 in credits (Deepgram). Maximum file size on the OpenAI endpoint stays at 25 MB — a hard limit you'll hit before your second podcast episode. This guide compares all twelve on price, real-time latency, batch throughput, language breadth, accuracy on the Open ASR Leaderboard, SDK quality, and HIPAA/SOC 2 posture.

TL;DR — Winners by Use Case

If you only read one section, read this. Twelve APIs, eight developer scenarios, one recommendation each — with a runner-up so you have a fallback.

Use case	Best pick	Why	Runner-up
Cheapest pay-as-you-go for batch jobs	OpenAI gpt-4o-mini-transcribe	$0.003/min, no minimum, OpenAI SDK already in stack	AssemblyAI Universal-2 ($0.0025/min)
Lowest streaming latency	ElevenLabs Scribe v2 Realtime	150 ms first-word on FLEURS benchmark	Gladia Solaria-1 (sub-103 ms partial)
Most languages supported	Google Cloud STT V2 (Chirp 2)	125+ languages, broadest coverage	Gladia Solaria-1 (100, includes long-tail)
HIPAA-compliant production workload	AWS Transcribe Medical	BAA signed, US infrastructure, IAM-native	Azure AI Speech (BAA + Microsoft 365 alignment)
Best free tier for evaluation	Deepgram ($200 credit, no card)	~26,000 batch minutes free	AssemblyAI ($50 credit, ~333 hours)
Diarization quality	ElevenLabs Scribe v2	98% speaker label accuracy (highest published)	AssemblyAI Universal-2
Self-host instead of API	OpenAI Whisper Large-v3 (open weights)	MIT license, mature ecosystem, ~6.4% Open ASR Leaderboard WER	NVIDIA Parakeet (fastest open-source on Open ASR Leaderboard)
Real-world general use	AssemblyAI Universal-2	$0.0025/min, 99 languages, full feature set bundled	Deepgram Nova-3 (better latency, fewer languages)

How We Evaluated

We benchmarked each provider on six axes that matter to engineers building on top of a transcription API, not the marketing-page axes vendors prefer.

The six axes:

All-in price per minute at four scale tiers (10, 100, 1,000, 10,000 hours per month)
First-word latency for streaming use cases
WER on the Hugging Face Open ASR Leaderboard English short-form track plus the multilingual FLEURS track
Language coverage with native-speaker validation cited from vendor blogs
Compliance posture verified against vendor trust pages
SDK ergonomics — lines of code from npm install to first transcript

Audio classes tested: a 90-second English mono call-center clip (clear), a 4-hour podcast at 44.1 kHz stereo (long-form), a 6-minute Spanish/English code-switched conversation (multilingual), a 30-second clip with 12 dB SNR pink noise (degraded).

What we ignored: aggregate "accuracy %" claims without dataset disclosure, marketing claims of "best in class," and vendor-paid third-party reports.

How we priced: pay-as-you-go list price for the smallest available billing increment, no negotiated discount, USD, May 2026 snapshot.

Conflict of interest: VexaScribe publishes this guide. VexaScribe does not sell a transcription API. No affiliate relationships with any vendor; no compensation received for inclusion or ranking. See our editorial standards.

The Provider Landscape in 2026

Three things changed the speech-to-text market between 2023 and 2026.

Whisper went open-weights, then OpenAI moved upstream.

OpenAI released Whisper under MIT in 2022. By 2024 the lift was over: every cloud has a hosted Whisper. OpenAI responded in March 2025 by shipping gpt-4o-transcribe and gpt-4o-mini-transcribe, both built on the GPT-4o family and priced at $0.006 and $0.003 per minute respectively. whisper-1 remains in the API at $0.006/min but is now legacy.

Real-time stopped being a premium tier.

ElevenLabs Scribe v2 Realtime hit 150 ms latency on FLEURS in March 2026. AssemblyAI shipped Universal-Streaming-Multilingual covering English, Spanish, French, German, Italian, and Portuguese. Gladia's Solaria-1 publishes partial transcripts in under 103 ms. Streaming used to cost 1.5–3× batch — now it's ~1.8× and shrinking.

The Open ASR Leaderboard became the default citation.

The Hugging Face leaderboard tracks dozens of models using a standardized eval harness across ESPNet, NeMo, SpeechBrain, and Transformers. Vendor-reported WER is no longer credible without a leaderboard cross-reference.

Hallucination is a known, measurable failure mode.

Koenecke et al. (FAccT 2024) found ~1% of Whisper outputs contained entirely fabricated phrases, and 38% of those hallucinations carried explicit harms (violence, fabricated authority, false medical claims). Production systems on top of any of these APIs need silence-detection guards.

At-a-Glance Comparison of 12 Transcription APIs

Pricing is pay-as-you-go USD list rate, smallest billing increment. Verified May 2026. Alphabetical to avoid ranking implications.

Provider	Model	Price/min	Free tier	Languages	RT	Batch	Max file
AssemblyAI	Universal-2	$0.0025/min	$50 credit (~333 hrs)	99	Yes	Yes	5 GB / 10 hrs
AWS Transcribe	Standard	$0.024/min → $0.0078 at scale	60 min/mo, 12 mo	31+	Yes	Yes	2 GB / 4 hrs
Azure AI Speech	Neural 2025	$0.017 RT / $0.003 batch	5 hrs/mo (F0)	100+	Yes	Yes (Fast/Batch)	1 GB / 4 hrs
Deepgram	Nova-3	$0.0043 batch / $0.0077 stream	$200 credit, no card	36+	Yes	Yes	2 GB / 10 hrs
ElevenLabs	Scribe v2	$0.0037 batch / $0.0065 stream	10 hrs/mo (Free)	99 / 57 RT	Yes	Yes	3 GB / 4.5 hrs
Gladia	Solaria-1	$0.0102 / $0.0033 (Growth)	10 hrs/mo	100	Yes (sub-103ms)	Yes	5 GB / 8 hrs
Google Cloud STT V2	Chirp 2	$0.016 / $0.004 (>2M min)	60 min/mo	125+	Yes	Yes	10 MB sync, 1 GB async
OpenAI Whisper API	whisper-1	$0.006/min	None	99	No	Yes	25 MB
OpenAI gpt-4o-transcribe	gpt-4o + mini	$0.006 / $0.003	None	50+ optimized	Yes (Realtime API)	Yes	25 MB (REST)
Rev AI	Reverb ASR	$0.003 / $0.005 (Foreign)	5 hrs free trial	58+ async / 9 stream	Yes	Yes	4 GB / 17 hrs
Soniox	Soniox v2	~$0.0017 async / ~$0.002 stream	$200 credit	60+	Yes (code-switching)	Yes	5 GB
Speechmatics	Ursa-2	$0.013 batch / $0.017 RT	480 min/mo	50+	Yes	Yes	No hard cap

Per-minute prices for hourly-priced vendors derived as (hourly rate ÷ 60); rounded to 4 decimals. Each vendor's pricing page is linked in the detailed reviews below.

Detailed Reviews of Each API

Each review covers what the API does well, where it falls short, and the developer profile that should pick it. Reviews are alphabetical to avoid ranking implications. We do not recommend a single "winner."

AssemblyAI Universal-2

General-purpose ASR shop founded 2017, NYC-based. Universal-2 is the current production model (released August 2024), with Universal-Streaming for real-time launched in 2025.

Pricing: $0.0025/min ($0.15/hour) — the lowest list rate among English-speaking-market leaders. $50 free credit covers ~333 hours.

Strengths

• 99-language coverage in batch
• LeMUR (LLM layer over transcripts) bundled
• Diarization, PII redaction, summarization, sentiment, content moderation included

Weaknesses

• Streaming language coverage smaller than batch
• Streaming WER on Open ASR Leaderboard not best-in-class

Best for: Teams that want one vendor for batch + real-time + post-processing without juggling SKUs. AssemblyAI docs →

AWS Transcribe

AWS's first-party STT service; available in standard, Medical, and Call Analytics flavors. Tightly coupled to S3, Lambda, and Kinesis.

Pricing: $0.024/min standard (Tier 1, ≤250k min/mo), tiering to $0.015 / $0.0102 / $0.0078 at higher volumes. Medical is a flat $0.075/min with no volume discount.

Strengths

• AWS-native auth (IAM, KMS, VPC endpoints)
• HIPAA-eligible with BAA
• Tightly integrated with Comprehend Medical and Bedrock

Weaknesses

• Highest entry-tier price among major clouds
• Multilingual coverage (~31 languages) trails competitors

Best for: Teams already deep in AWS who want IAM-native STT. AWS Transcribe docs →

Azure AI Speech

Microsoft Cognitive Services Speech, now branded under Azure AI Foundry. Three flavors: real-time, fast transcription, and batch transcription.

Pricing: $1.00/audio hour real-time ($0.017/min); $0.36/hour fast transcription; $0.18/hour batch ($0.003/min). Custom acoustic/language models add ~20%.

Strengths

• 100+ languages
• Pronunciation assessment endpoint (unique)
• Strong custom-model story; SOC 2, HIPAA, ISO 27001

Weaknesses

• Real-time premium is steep at low volume
• Documentation fragmented across namespaces
• SDK feature parity uneven across languages

Best for: Teams in Microsoft 365 ecosystems; education products needing pronunciation scoring. Azure Speech docs →

Deepgram Nova-3

Speech-first AI company, in-house transformer architecture. Nova-3 (February 2025) is the third-generation general model; Nova-3 Medical exists for clinical workflows.

Pricing: $0.0043/min batch, $0.0077/min streaming on Pay-As-You-Go. Growth plan drops streaming to $0.0065/min. $200 free credit, no card required.

Strengths

• Sub-300 ms streaming latency
• Strong out-of-box diarization
• Real-second-level billing
• Custom vocabulary up to 1000 keywords

Weaknesses

• ~7% on Open ASR Leaderboard — solid but not class-leading
• Language count (36+) trails Google/AssemblyAI/Gladia

Best for: Voice-agent and call-center workloads. Deepgram docs →

ElevenLabs Scribe v2

ElevenLabs (best known for TTS) shipped Scribe v1 in 2024 and Scribe v2 in March 2026. The Realtime variant launched in the same release.

Pricing: $0.22/hour ($0.0037/min) batch; $0.39/hour ($0.0065/min) Scribe v2 Realtime.

Strengths

• 99 languages batch / 57 realtime
• 98% speaker-label accuracy
• 150 ms first-word latency on FLEURS
• SOC 2, HIPAA, GDPR, EU residency

Weaknesses

• Newer player — fewer production case studies
• SDK surface narrower than AWS/Google
• Pricing changed twice in 18 months

Best for: Apps needing fast streaming + strong diarization. ElevenLabs Scribe docs →

Gladia Solaria-1

Paris-based vendor, originally a Whisper-as-a-service host, now ships in-house Solaria-1 (2025) alongside Whisper endpoints.

Pricing: $0.0102/min async / $0.0125/min real-time (Starter); drops to $0.0033 / $0.0042 on Growth with volume commit. 10 hrs/mo free.

Strengths

• 100 languages — including long-tail not offered elsewhere
• Sub-103 ms partial transcripts
• Audio-intelligence features bundled

Weaknesses

• SDK quality more polished in Python than Node/Go
• Documentation thinner on edge cases
• Cheaper Growth tier requires upfront commitment

Best for: Multilingual products serving long-tail languages. Gladia docs →

Google Cloud Speech-to-Text V2 (Chirp 2)

V2 is the new control plane; Chirp 2 is Google's 2024 multilingual foundation model trained on millions of hours.

Pricing: $0.016/min standard (V2); high-volume tier drops batch to $0.004/min (>2M min/mo). Billed in 15-second chunks rounded up.

Strengths

• 125+ languages (broadest coverage)
• Strong African and South Asian language performance
• Adaptation API for domain vocabulary
• Tight integration with Vertex AI

Weaknesses

• 15-second billing increment costs you on short clips
• V1 vs V2 vs legacy pricing pages confuse evaluation
• Chirp 2 not available in all regions

Best for: Multilingual products with global users. Google Cloud STT docs →

OpenAI Whisper API (whisper-1)

The original Whisper hosted endpoint, available since March 2023. Now considered legacy; OpenAI recommends gpt-4o-transcribe for new projects.

Pricing: $0.006/min flat. No free tier; requires existing OpenAI billing account.

Strengths

• Open-weights model — same architecture self-hostable
• Mature openai SDK in 9 languages
• 99 languages

Weaknesses

• 25 MB hard file size limit
• No real-time/streaming on this endpoint
• Hallucination rate ~1% (Koenecke et al., FAccT 2024)

Best for: Quick integration for short files; teams wanting open-weights fallback. OpenAI Whisper docs →

OpenAI gpt-4o-transcribe / gpt-4o-mini-transcribe

Built on the GPT-4o family, shipped March 2025 as the recommended successor to whisper-1. Available via REST and the OpenAI Realtime API for streaming.

Pricing: $0.006/min (gpt-4o-transcribe), $0.003/min (gpt-4o-mini-transcribe). Same 25 MB REST limit applies.

Strengths

• Best price/quality ratio in OpenAI ecosystem
• Streaming via Realtime API
• GPT-4o noise-robustness research
• Prompt parameter for context priming

Weaknesses

• Same 25 MB ceiling on REST endpoint
• Unclear long-tail language behavior
• No leaderboard submission as of May 2026

Best for: Teams already on OpenAI for chat/embeddings. OpenAI speech-to-text docs →

Rev AI

Long-running transcription company; Rev AI is the API-first arm. Reverb is their proprietary 2024 ASR model; Foreign Language is a separate SKU.

Pricing: $0.003/min Reverb ASR, $0.005/min Foreign Language, $1.99/min human transcription. Tiered enterprise pricing available.

Strengths

• HIPAA + SOC 2 + GDPR + PCI
• BAA available on enterprise
• 99.99% uptime SLA
• Mature human-fallback workflow

Weaknesses

• Streaming covers only 9 languages
• SDKs less polished than Big-Tech clouds
• Some docs reference legacy Rev.ai v1 endpoints

Best for: Healthcare, legal, regulated industries. Rev AI docs →

Soniox

Independent vendor with a token-based pricing model (rare in STT). Specializes in code-switching and real-time scenarios.

Pricing: Token-based, equating to ~$0.10/hour async ($0.0017/min) and ~$0.12/hour streaming ($0.002/min). Audio tokens billed at $1.50–$2.00 per million.

Strengths

• 60+ languages with native-speaker accuracy
• Mid-utterance language switching
• Price/min near the bottom of the market

Weaknesses

• Token model harder to predict at integration time
• Smaller community / fewer Stack Overflow answers
• No published Open ASR Leaderboard submission

Best for: Multilingual call centers, language-tutoring apps. Soniox docs →

Speechmatics

UK-based (Cambridge), founded 2006. Known historically for accent and dialect robustness; Ursa-2 model (2024) is current.

Pricing: $0.80/hour batch Standard, $1.04/hour batch Enhanced; $1.04/hour real-time Standard, $1.35/hour real-time Enhanced. 480 min/mo free.

Strengths

• Top-tier accent coverage (UK, Irish, Scottish, Indian, Caribbean)
• Deployable as on-premise container
• GDPR-native (UK/EU jurisdiction)

Weaknesses

• Higher list price than AI-first vendors
• Language count (50+) middle-of-pack
• SDKs lean on REST/WebSocket primitives

Best for: Accent-heavy use cases and on-premise/air-gapped deployments. Speechmatics docs →

Pricing at Volume: 10 / 100 / 1,000 / 10,000 Hours per Month

List prices flatten the picture. Real production workloads tier. We computed monthly cost at four volume points using each vendor's published tiers as of May 2026, no negotiation, no commit discounts.

Provider	10 hrs/mo	100 hrs/mo	1,000 hrs/mo	10,000 hrs/mo
AssemblyAI Universal-2	$1.50	$15.00	$150.00	$1,500.00
AWS Transcribe (standard)	$14.40	$144.00	$1,440.00	$9,000.00¹
Azure AI Speech (real-time)	$10.00	$100.00	$1,000.00	$10,000.00²
Azure AI Speech (batch)	$1.80	$18.00	$180.00	$1,800.00
Deepgram Nova-3 (batch)	$2.58	$25.80	$258.00	$2,580.00
Deepgram Nova-3 (streaming)	$4.62	$46.20	$462.00	$4,620.00
ElevenLabs Scribe v2 (batch)	$2.20	$22.00	$220.00	$2,200.00
Gladia Solaria-1 (Starter)	$6.10	$61.00	$610.00	$6,100.00³
Google Cloud STT V2	$9.60	$96.00	$960.00	$2,400.00⁴
OpenAI gpt-4o-transcribe	$3.60	$36.00	$360.00	$3,600.00
OpenAI gpt-4o-mini-transcribe	$1.80	$18.00	$180.00	$1,800.00
Rev AI Reverb	$1.80	$18.00	$180.00	$1,800.00
Soniox (async)	$1.00	$10.00	$100.00	$1,000.00
Speechmatics (batch Standard)	$8.00	$80.00	$800.00	$6,400.00⁵

¹ AWS Tier 2 ($0.015/min) kicks in at >250k min/mo (4,167 hrs).

² Azure commitment tier drops to $0.50/hour at 50k hrs/mo.

³ Gladia Growth tier ($0.20/hour async) requires upfront annual commit.

⁴ GCP high-volume tier drops batch to $0.004/min at >2M min/mo.

⁵ Speechmatics auto-applies 20% off above 500 hrs/mo.

Three patterns matter for budget planning. First, the cheapest vendor at 10 hours (Soniox at $1.00) stays the cheapest at 10,000 hours. Second, the AWS / Google / Azure trio look expensive at low volume but compete aggressively at scale. Third, the "real-time premium" is real: Deepgram streaming costs 1.79× batch, and Azure real-time costs 5.6× batch. If your product can defer transcription by even 30 seconds, batch endpoints save 40–80%.

Latency Comparison for Streaming Use Cases

Streaming latency is the time from speech onset to the first transcript token ("first-word latency"). For voice agents and live captioning, you want sub-500 ms — anything above that is perceptible delay.

Provider	Streaming	First-word	RTF	Price
ElevenLabs Scribe v2 Realtime	Yes	~150 ms (FLEURS)	Sub-real-time	$0.39/hr
Gladia Solaria-1	Yes	Sub-103 ms (partial)	Sub-real-time	$0.0125/min
Soniox	Yes	~200 ms	Sub-real-time	~$0.002/min
Deepgram Nova-3 streaming	Yes	<300 ms	Sub-real-time	$0.0077/min
AssemblyAI Universal-Streaming	Yes	~400 ms	Sub-real-time	Bundled with PAYG
Azure AI Speech Real-time	Yes	~400-600 ms	Sub-real-time	$0.017/min
AWS Transcribe Streaming	Yes	~500-800 ms	Sub-real-time	$0.024/min
OpenAI gpt-4o-transcribe (Realtime API)	Yes	~500-700 ms	Sub-real-time	$0.006/min
OpenAI Whisper API (whisper-1)	No	n/a	Batch only	$0.006/min
Speechmatics Real-time	Yes	~700 ms	Sub-real-time	$1.04/hr Standard

Accuracy / WER Comparison

We separate vendor-claimed WER from independently verified WER. The Hugging Face Open ASR Leaderboard is the closest thing to a neutral cross-vendor benchmark.

Provider	Self-reported	Open ASR Leaderboard	Notes
AssemblyAI Universal-2	<7% on clean English (vendor)	~7% (vendor leaderboard submission)	Strong real-world results; published noise benchmarks
AWS Transcribe Standard	Not published	Not on leaderboard	Marketing-only WER claims; verify on your data
Azure AI Speech Neural	~5% English (vendor demos)	Not on leaderboard	Custom models can improve domain WER significantly
Deepgram Nova-3	<6% English (vendor)	Submitted, ~7% on Earnings22	Publishes noise-robustness numbers
ElevenLabs Scribe v2	Not on Open ASR Leaderboard yet	Pending	Strong demos; independent benchmarks pending
Gladia Solaria-1	<8% English (vendor)	Submitted, mid-pack	Originally Whisper-based; in-house model since 2024
Google Cloud Chirp 2	Not directly published	Not on leaderboard (proprietary)	Strong on multilingual; English mid-pack
OpenAI Whisper Large-v3	2.0% LibriSpeech, 7.44% mean Open ASR	~6.4% (open-source leaderboard)	Most-cited reference model
OpenAI gpt-4o-transcribe	Better than Whisper on noisy audio (vendor)	Not submitted	Hosted-only; cannot independently benchmark
Rev AI Reverb	<8% English (vendor)	Submitted, mid-pack	Human fallback option for compliance workflows
Soniox	Code-switching specialist	Not submitted	Token-based pricing; verify on multilingual data
Speechmatics Ursa-2	<6% English (vendor)	Submitted, competitive	Strongest accent coverage in our tests

WER on a benchmark is not WER on your data. Always run a small eval on representative samples before committing. Also evaluate hallucination rate separately (see Koenecke et al., FAccT 2024) — a model with low WER but high hallucination rate is dangerous in clinical, legal, or journalism workflows.

Compliance & Enterprise Features

If you're shipping to healthcare, legal, finance, or government, compliance is the gating criterion. Verify directly against each vendor's trust center before signing.

Provider	HIPAA	SOC 2	GDPR	On-prem	BAA
AssemblyAI	Yes (BAA)	Yes	Yes	No	Yes
AWS Transcribe	Yes (Medical, BAA)	Yes	Yes	No	Yes
Azure AI Speech	Yes (BAA)	Yes	Yes	Yes (Container)	Yes
Deepgram	Yes (BAA)	Yes	Yes	Yes (Self-hosted)	Yes
ElevenLabs Scribe v2	Yes (BAA)	Yes	Yes (EU residency)	No	Yes
Gladia	Limited	Yes	Yes (EU-based)	No	Enterprise
Google Cloud STT	Yes (BAA)	Yes	Yes	No	Yes
OpenAI	Enterprise tier only	Yes	Yes	No	Enterprise only
Rev AI	Yes (HIPAA Plus)	Yes	Yes	No	Yes
Soniox	Limited	Yes	Yes	Enterprise	Enterprise
Speechmatics	Limited	Yes	Yes (UK/EU)	Yes (Container)	Enterprise

Decision Framework: Pick One API in Under Five Minutes

Eight branches, one recommendation each. Read top-to-bottom; first match wins.

→
If you need HIPAA + a signed BAA today → AWS Transcribe Medical (or Deepgram Nova-3 Medical with BAA). Skip OpenAI unless you're already on the Enterprise tier.
→
If you need on-premise / air-gapped deployment → Speechmatics Container, Deepgram Self-Hosted, or Azure Speech Container.
→
If you serve users in 50+ languages including long-tail → Gladia Solaria-1 (100 languages) or Google Cloud STT V2 with Chirp 2 (125+ languages).
→
If you need real-time captioning under 200 ms → ElevenLabs Scribe v2 Realtime (150 ms) or Soniox (~200 ms).
→
If you're already on OpenAI and the file is under 25 MB → gpt-4o-mini-transcribe for cost, gpt-4o-transcribe for accuracy.
→
If you're on AWS / GCP / Azure and want IAM-native auth → use the cloud's first-party service.
→
If you can self-host and your team has GPU MLOps experience → Whisper Large-v3 (MIT) or NVIDIA's open-source models.
→
None of the above → AssemblyAI Universal-2 ($0.0025/min, 99 languages, $50 free credit) is the safest default.

Eight Common Pitfalls When Integrating a Transcription API

What we wish we'd known before shipping our first STT integration.

1. The 25 MB OpenAI file size limit

Hard limit on the REST endpoint. ~25 minutes of mono 16 kHz WAV. Chunk on silence boundaries (pydub or ffmpeg silencedetect), not fixed time. Add 0.5 s overlap and dedupe at boundaries.

2. Hallucinations on silence

Whisper-family models output "Thank you for watching" or "Subscribe!" on long silent segments — training-data leakage. Run VAD before transcription. Per Koenecke et al., ~1% of clips hallucinate, 38% of those carry harm vectors.

3. Rate limits hit during traffic spikes

Every vendor has per-minute concurrency caps. Implement token-bucket queuing client-side. For real-time, request quota increases before launch — approval takes 2–5 business days.

4. Cost spikes from misconfigured batch jobs

Re-transcribing for a code change loops you through full vendor pricing. Cache transcripts keyed by audio file hash. Set per-user spend caps server-side. Alert when daily cost exceeds 2× rolling 7-day average.

5. Language-detection failure modes

Auto-detect works on clean monolingual audio. It fails on code-switching, short clips (<10 s), and accented English. When users provide language context, pass it explicitly — don't rely on detection.

6. Diarization labels are not stable across calls

"Speaker 0" in one transcript is not "Speaker 0" in the next. Re-identifying speakers across sessions requires speaker embeddings. Don't store "Speaker 0" as a user-facing label.

7. Streaming reconnects lose context

WebSocket disconnects (mobile network changes, Wi-Fi handoff, server-side timeouts after 4 hours) drop in-progress utterances. Buffer client-side audio for 2–3 seconds; on reconnect, replay the buffer.

8. Custom vocabulary doesn't fix proper nouns automatically

"Boost" parameters help product names but won't fix names of people. For proper nouns, use the vendor's prompt or context field. Keyword lists over ~200 entries silently degrade accuracy.

Frequently Asked Questions

What's the cheapest transcription API?

As of May 2026, OpenAI's gpt-4o-mini-transcribe at $0.003/minute ($0.18/hour) is the cheapest list-price API from a major vendor. Soniox is effectively cheaper at ~$0.0017/minute on its token-based async pricing, and Rev AI Reverb matches OpenAI at $0.003/min. At very high volumes (>5M minutes/month) AWS Transcribe drops to $0.0078/min. There is no cheaper managed option without committing to volume tiers.

Which transcription API has the lowest latency?

ElevenLabs Scribe v2 Realtime publishes 150 ms first-word latency on the FLEURS benchmark — the only number we found measured against a public dataset. Gladia Solaria-1 publishes sub-103 ms partial-token latency (a different metric). Soniox reports ~200 ms. Deepgram Nova-3 reports under 300 ms. For voice-agent use cases (LLM completion budget around 800 ms), any of these leaves headroom.

Which transcription API supports the most languages?

Google Cloud Speech-to-Text V2 with Chirp 2 lists 125+ languages — the broadest. Gladia Solaria-1 lists 100, including languages not offered by other commercial APIs (long-tail African, Central Asian, and Pacific). AssemblyAI Universal-2 and ElevenLabs Scribe v2 each support 99. Real-time language coverage is always smaller than batch — verify your specific languages are streaming-supported before architecting on real-time.

Is OpenAI Whisper API the best for general use?

Not anymore. whisper-1 is now considered legacy; OpenAI itself recommends gpt-4o-transcribe for new projects (better accuracy, same $0.006/min) or gpt-4o-mini-transcribe for cost (half the price). Whisper Large-v3 open-source weights — which you self-host — score around 6.4% WER on the Hugging Face Open ASR Leaderboard. Strong, but a few open-source models now beat it on that benchmark.

What's the difference between batch and real-time transcription APIs?

Batch (also called async or prerecorded) accepts a complete audio file and returns the full transcript after processing — typically 5–30 seconds for short files, sub-real-time for long files. Real-time (also called streaming) processes audio chunks as they arrive, returning partial transcripts within 150–600 ms. Real-time costs 1.5–5× batch on most vendors. Choose batch unless your use case truly needs live captions.

Which APIs support speaker diarization out of the box?

Diarization (separating speakers without prior voice samples) is included on AssemblyAI Universal-2, Deepgram Nova-3, ElevenLabs Scribe v2, Google Cloud STT V2, AWS Transcribe, Azure AI Speech, Speechmatics, Gladia, Rev AI, and Soniox. OpenAI does not provide diarization on the transcription endpoint — you need a separate library (Pyannote, NVIDIA NeMo) post-hoc.

Are there HIPAA-compliant transcription APIs?

Yes. AWS Transcribe Medical, Azure AI Speech, Deepgram Nova-3, AssemblyAI, Rev AI (HIPAA Plus subscription), Google Cloud STT, and ElevenLabs Scribe v2 all sign Business Associate Agreements. OpenAI signs BAAs only on its Enterprise tier — standard API accounts cannot ship clinical workflows. Treat any vendor that won't sign a BAA as not HIPAA-eligible regardless of marketing language.

How do I handle files larger than 25 MB on the OpenAI API?

Chunk on silence boundaries, never on fixed time intervals. In Python: pydub.silence.split_on_silence(audio, min_silence_len=500, silence_thresh=-40). From the CLI: ffmpeg -af silencedetect=n=-30dB:d=0.5. Add 0.5-second overlap on each chunk and dedupe overlapping words at boundaries. Better: switch to a vendor without the 25 MB cap (Deepgram, AssemblyAI, Gladia all accept multi-GB uploads).

Can I self-host instead of using an API?

Yes. OpenAI Whisper is open-weights (MIT) since 2022. NVIDIA Canary and Parakeet variants are open-source and currently lead the Open ASR Leaderboard. Self-hosting costs roughly $0.001–$0.003/min on a reserved A10G/L4 GPU at 50–80% utilization, beating most APIs at scale, but adds MLOps overhead, eval pipelines, and on-call burden. For most teams, the developer-time-versus-cost tradeoff favors a managed API up to several thousand hours per month.

What's the difference between Whisper and gpt-4o-transcribe?

Both are OpenAI hosted endpoints. whisper-1 is the original Whisper Large model from 2022, $0.006/min, REST-only (no streaming). gpt-4o-transcribe (March 2025) is built on the GPT-4o family — better accuracy on noisy audio, supports streaming via the OpenAI Realtime API, accepts a prompt parameter for context priming, and costs the same $0.006/min. For new projects OpenAI recommends gpt-4o-transcribe. whisper-1 stays available for backward compatibility.

Are there free transcription APIs for development?

Every vendor in this guide has a free tier or free credit. Most generous: Deepgram ($200 credit, no card, ~26,000 batch minutes), AssemblyAI ($50 credit, ~333 hours of Universal-2), ElevenLabs (10 hrs/mo Scribe v2), Gladia (10 hrs/mo), Speechmatics (480 min/mo). Google Cloud and AWS offer 60 minutes/month for 12 months. OpenAI has no free tier — you pay from minute one.

How accurate are these APIs in noisy real-world audio?

Word Error Rate degrades 2–4× on noisy audio (background music, reverb, low SNR) versus clean studio audio. Open ASR Leaderboard numbers (5–15% WER on benchmark sets) become 15–35% on phone-quality recordings. Deepgram Nova-3 and AssemblyAI Universal-2 publish noise-robustness benchmarks; most vendors do not. Beyond WER, the Koenecke et al. FAccT 2024 paper documented that ~1% of Whisper outputs hallucinate (entirely fabricated phrases) on real-world audio — independent of WER. Run voice-activity detection pre-processing and evaluate hallucination rate separately on your own data before shipping.

Methodology, Verification, and Conflict-of-Interest Disclosure

Verification window. All pricing, model names, language counts, free-tier values, and compliance attestations were verified against vendor pricing pages, vendor documentation, and (where applicable) vendor trust centers between May 1 and May 14, 2026. Where a number is "vendor-claimed" we say so explicitly. Where a number is independently reproduced (Open ASR Leaderboard, FLEURS), we cite the dataset.

Methodology. We tested each API on four audio classes: clean call-center English (90 s mono), long-form podcast (4 hr stereo 44.1 kHz), Spanish/English code-switched conversation (6 min), and degraded audio (30 s, 12 dB SNR pink noise). Pricing was modeled at four monthly volume points (10, 100, 1,000, 10,000 hours) using each vendor's published list rate with no commit discount applied. WER numbers come from the Hugging Face Open ASR Leaderboard where available; vendor-reported numbers are flagged as "self-reported."

Conflict of interest. This guide is published by VexaScribe. VexaScribe does not sell a transcription API. VexaScribe has no affiliate relationships with any vendor listed and received no compensation, free credits, or other consideration for inclusion or ranking. Outbound vendor links use rel="noopener" (not nofollow) because OpenAI, Google, AWS, and the others are authoritative entities. Editorial standards: see our editorial standards.

Best Transcription API for Developers (May 2026)

TL;DR — Winners by Use Case

How We Evaluated

The Provider Landscape in 2026

At-a-Glance Comparison of 12 Transcription APIs

Detailed Reviews of Each API

AssemblyAI Universal-2

AWS Transcribe

Azure AI Speech

Deepgram Nova-3

ElevenLabs Scribe v2

Gladia Solaria-1

Google Cloud Speech-to-Text V2 (Chirp 2)

OpenAI Whisper API (whisper-1)

OpenAI gpt-4o-transcribe / gpt-4o-mini-transcribe

Rev AI

Soniox

Speechmatics

Pricing at Volume: 10 / 100 / 1,000 / 10,000 Hours per Month

Latency Comparison for Streaming Use Cases

Accuracy / WER Comparison

Compliance & Enterprise Features

Decision Framework: Pick One API in Under Five Minutes

Eight Common Pitfalls When Integrating a Transcription API

1. The 25 MB OpenAI file size limit

2. Hallucinations on silence

3. Rate limits hit during traffic spikes

4. Cost spikes from misconfigured batch jobs

5. Language-detection failure modes

6. Diarization labels are not stable across calls

7. Streaming reconnects lose context

8. Custom vocabulary doesn't fix proper nouns automatically

Frequently Asked Questions

Methodology, Verification, and Conflict-of-Interest Disclosure

Related VexaScribe resources

Editorial standards

How accurate is Whisper?

Whisper transcription

Privacy policy