Best Transcription API for Developers (May 2026)
By VexaScribe Editorial · Published May 15, 2026 · Verified May 2026
The cheapest production-grade transcription API in May 2026 is OpenAI's gpt-4o-mini-transcribe at $0.003/minute ($0.18/hour); the most expensive of the major managed services is Amazon Transcribe Medical at $0.075/minute ($4.50/hour) — a 25× spread for what looks, on paper, like the same product. Between those poles sit Deepgram Nova-3 ($0.0043 batch, $0.0077 streaming), AssemblyAI Universal-2 ($0.0025/min, 99 languages), ElevenLabs Scribe v2 (March 2026 launch, $0.22/hr batch), Gladia Solaria-1 (100 languages, sub-103 ms partial latency), Google Cloud Speech-to-Text V2 with Chirp 2 ($0.016/min), AWS Transcribe ($0.024/min, tiered down to $0.0078 above 5M minutes), Azure AI Speech ($0.017/min real-time), Speechmatics ($0.80/hour batch), Rev AI Reverb ($0.003/min), and Soniox (~$0.10/hour async). Free tiers range from $0 (Speechmatics: 480 min/month) to $200 in credits (Deepgram). Maximum file size on the OpenAI endpoint stays at 25 MB — a hard limit you'll hit before your second podcast episode. This guide compares all twelve on price, real-time latency, batch throughput, language breadth, accuracy on the Open ASR Leaderboard, SDK quality, and HIPAA/SOC 2 posture.
TL;DR — Winners by Use Case
If you only read one section, read this. Twelve APIs, eight developer scenarios, one recommendation each — with a runner-up so you have a fallback.
| Use case | Best pick | Why | Runner-up |
|---|---|---|---|
| Cheapest pay-as-you-go for batch jobs | OpenAI gpt-4o-mini-transcribe | $0.003/min, no minimum, OpenAI SDK already in stack | AssemblyAI Universal-2 ($0.0025/min) |
| Lowest streaming latency | ElevenLabs Scribe v2 Realtime | 150 ms first-word on FLEURS benchmark | Gladia Solaria-1 (sub-103 ms partial) |
| Most languages supported | Google Cloud STT V2 (Chirp 2) | 125+ languages, broadest coverage | Gladia Solaria-1 (100, includes long-tail) |
| HIPAA-compliant production workload | AWS Transcribe Medical | BAA signed, US infrastructure, IAM-native | Azure AI Speech (BAA + Microsoft 365 alignment) |
| Best free tier for evaluation | Deepgram ($200 credit, no card) | ~26,000 batch minutes free | AssemblyAI ($50 credit, ~333 hours) |
| Diarization quality | ElevenLabs Scribe v2 | 98% speaker label accuracy (highest published) | AssemblyAI Universal-2 |
| Self-host instead of API | OpenAI Whisper Large-v3 (open weights) | MIT license, mature ecosystem, ~6.4% Open ASR Leaderboard WER | NVIDIA Parakeet (fastest open-source on Open ASR Leaderboard) |
| Real-world general use | AssemblyAI Universal-2 | $0.0025/min, 99 languages, full feature set bundled | Deepgram Nova-3 (better latency, fewer languages) |
How We Evaluated
We benchmarked each provider on six axes that matter to engineers building on top of a transcription API, not the marketing-page axes vendors prefer.
The six axes:
- All-in price per minute at four scale tiers (10, 100, 1,000, 10,000 hours per month)
- First-word latency for streaming use cases
- WER on the Hugging Face Open ASR Leaderboard English short-form track plus the multilingual FLEURS track
- Language coverage with native-speaker validation cited from vendor blogs
- Compliance posture verified against vendor trust pages
- SDK ergonomics — lines of code from
npm installto first transcript
Audio classes tested: a 90-second English mono call-center clip (clear), a 4-hour podcast at 44.1 kHz stereo (long-form), a 6-minute Spanish/English code-switched conversation (multilingual), a 30-second clip with 12 dB SNR pink noise (degraded).
What we ignored: aggregate "accuracy %" claims without dataset disclosure, marketing claims of "best in class," and vendor-paid third-party reports.
How we priced: pay-as-you-go list price for the smallest available billing increment, no negotiated discount, USD, May 2026 snapshot.
The Provider Landscape in 2026
Three things changed the speech-to-text market between 2023 and 2026.
Whisper went open-weights, then OpenAI moved upstream.
OpenAI released Whisper under MIT in 2022. By 2024 the lift was over: every cloud has a hosted Whisper. OpenAI responded in March 2025 by shipping gpt-4o-transcribe and gpt-4o-mini-transcribe, both built on the GPT-4o family and priced at $0.006 and $0.003 per minute respectively. whisper-1 remains in the API at $0.006/min but is now legacy.
Real-time stopped being a premium tier.
ElevenLabs Scribe v2 Realtime hit 150 ms latency on FLEURS in March 2026. AssemblyAI shipped Universal-Streaming-Multilingual covering English, Spanish, French, German, Italian, and Portuguese. Gladia's Solaria-1 publishes partial transcripts in under 103 ms. Streaming used to cost 1.5–3× batch — now it's ~1.8× and shrinking.
The Open ASR Leaderboard became the default citation.
The Hugging Face leaderboard tracks dozens of models using a standardized eval harness across ESPNet, NeMo, SpeechBrain, and Transformers. Vendor-reported WER is no longer credible without a leaderboard cross-reference.
Hallucination is a known, measurable failure mode.
Koenecke et al. (FAccT 2024) found ~1% of Whisper outputs contained entirely fabricated phrases, and 38% of those hallucinations carried explicit harms (violence, fabricated authority, false medical claims). Production systems on top of any of these APIs need silence-detection guards.
At-a-Glance Comparison of 12 Transcription APIs
Pricing is pay-as-you-go USD list rate, smallest billing increment. Verified May 2026. Alphabetical to avoid ranking implications.
| Provider | Model | Price/min | Free tier | Languages | RT | Batch | Max file |
|---|---|---|---|---|---|---|---|
| AssemblyAI | Universal-2 | $0.0025/min | $50 credit (~333 hrs) | 99 | Yes | Yes | 5 GB / 10 hrs |
| AWS Transcribe | Standard | $0.024/min → $0.0078 at scale | 60 min/mo, 12 mo | 31+ | Yes | Yes | 2 GB / 4 hrs |
| Azure AI Speech | Neural 2025 | $0.017 RT / $0.003 batch | 5 hrs/mo (F0) | 100+ | Yes | Yes (Fast/Batch) | 1 GB / 4 hrs |
| Deepgram | Nova-3 | $0.0043 batch / $0.0077 stream | $200 credit, no card | 36+ | Yes | Yes | 2 GB / 10 hrs |
| ElevenLabs | Scribe v2 | $0.0037 batch / $0.0065 stream | 10 hrs/mo (Free) | 99 / 57 RT | Yes | Yes | 3 GB / 4.5 hrs |
| Gladia | Solaria-1 | $0.0102 / $0.0033 (Growth) | 10 hrs/mo | 100 | Yes (sub-103ms) | Yes | 5 GB / 8 hrs |
| Google Cloud STT V2 | Chirp 2 | $0.016 / $0.004 (>2M min) | 60 min/mo | 125+ | Yes | Yes | 10 MB sync, 1 GB async |
| OpenAI Whisper API | whisper-1 | $0.006/min | None | 99 | No | Yes | 25 MB |
| OpenAI gpt-4o-transcribe | gpt-4o + mini | $0.006 / $0.003 | None | 50+ optimized | Yes (Realtime API) | Yes | 25 MB (REST) |
| Rev AI | Reverb ASR | $0.003 / $0.005 (Foreign) | 5 hrs free trial | 58+ async / 9 stream | Yes | Yes | 4 GB / 17 hrs |
| Soniox | Soniox v2 | ~$0.0017 async / ~$0.002 stream | $200 credit | 60+ | Yes (code-switching) | Yes | 5 GB |
| Speechmatics | Ursa-2 | $0.013 batch / $0.017 RT | 480 min/mo | 50+ | Yes | Yes | No hard cap |
Per-minute prices for hourly-priced vendors derived as (hourly rate ÷ 60); rounded to 4 decimals. Each vendor's pricing page is linked in the detailed reviews below.
Detailed Reviews of Each API
Each review covers what the API does well, where it falls short, and the developer profile that should pick it. Reviews are alphabetical to avoid ranking implications. We do not recommend a single "winner."
AssemblyAI Universal-2
General-purpose ASR shop founded 2017, NYC-based. Universal-2 is the current production model (released August 2024), with Universal-Streaming for real-time launched in 2025.
Pricing: $0.0025/min ($0.15/hour) — the lowest list rate among English-speaking-market leaders. $50 free credit covers ~333 hours.
Strengths
- • 99-language coverage in batch
- • LeMUR (LLM layer over transcripts) bundled
- • Diarization, PII redaction, summarization, sentiment, content moderation included
Weaknesses
- • Streaming language coverage smaller than batch
- • Streaming WER on Open ASR Leaderboard not best-in-class
Best for: Teams that want one vendor for batch + real-time + post-processing without juggling SKUs. AssemblyAI docs →
AWS Transcribe
AWS's first-party STT service; available in standard, Medical, and Call Analytics flavors. Tightly coupled to S3, Lambda, and Kinesis.
Pricing: $0.024/min standard (Tier 1, ≤250k min/mo), tiering to $0.015 / $0.0102 / $0.0078 at higher volumes. Medical is a flat $0.075/min with no volume discount.
Strengths
- • AWS-native auth (IAM, KMS, VPC endpoints)
- • HIPAA-eligible with BAA
- • Tightly integrated with Comprehend Medical and Bedrock
Weaknesses
- • Highest entry-tier price among major clouds
- • Multilingual coverage (~31 languages) trails competitors
Best for: Teams already deep in AWS who want IAM-native STT. AWS Transcribe docs →
Azure AI Speech
Microsoft Cognitive Services Speech, now branded under Azure AI Foundry. Three flavors: real-time, fast transcription, and batch transcription.
Pricing: $1.00/audio hour real-time ($0.017/min); $0.36/hour fast transcription; $0.18/hour batch ($0.003/min). Custom acoustic/language models add ~20%.
Strengths
- • 100+ languages
- • Pronunciation assessment endpoint (unique)
- • Strong custom-model story; SOC 2, HIPAA, ISO 27001
Weaknesses
- • Real-time premium is steep at low volume
- • Documentation fragmented across namespaces
- • SDK feature parity uneven across languages
Best for: Teams in Microsoft 365 ecosystems; education products needing pronunciation scoring. Azure Speech docs →
Deepgram Nova-3
Speech-first AI company, in-house transformer architecture. Nova-3 (February 2025) is the third-generation general model; Nova-3 Medical exists for clinical workflows.
Pricing: $0.0043/min batch, $0.0077/min streaming on Pay-As-You-Go. Growth plan drops streaming to $0.0065/min. $200 free credit, no card required.
Strengths
- • Sub-300 ms streaming latency
- • Strong out-of-box diarization
- • Real-second-level billing
- • Custom vocabulary up to 1000 keywords
Weaknesses
- • ~7% on Open ASR Leaderboard — solid but not class-leading
- • Language count (36+) trails Google/AssemblyAI/Gladia
Best for: Voice-agent and call-center workloads. Deepgram docs →
ElevenLabs Scribe v2
ElevenLabs (best known for TTS) shipped Scribe v1 in 2024 and Scribe v2 in March 2026. The Realtime variant launched in the same release.
Pricing: $0.22/hour ($0.0037/min) batch; $0.39/hour ($0.0065/min) Scribe v2 Realtime.
Strengths
- • 99 languages batch / 57 realtime
- • 98% speaker-label accuracy
- • 150 ms first-word latency on FLEURS
- • SOC 2, HIPAA, GDPR, EU residency
Weaknesses
- • Newer player — fewer production case studies
- • SDK surface narrower than AWS/Google
- • Pricing changed twice in 18 months
Best for: Apps needing fast streaming + strong diarization. ElevenLabs Scribe docs →
Gladia Solaria-1
Paris-based vendor, originally a Whisper-as-a-service host, now ships in-house Solaria-1 (2025) alongside Whisper endpoints.
Pricing: $0.0102/min async / $0.0125/min real-time (Starter); drops to $0.0033 / $0.0042 on Growth with volume commit. 10 hrs/mo free.
Strengths
- • 100 languages — including long-tail not offered elsewhere
- • Sub-103 ms partial transcripts
- • Audio-intelligence features bundled
Weaknesses
- • SDK quality more polished in Python than Node/Go
- • Documentation thinner on edge cases
- • Cheaper Growth tier requires upfront commitment
Best for: Multilingual products serving long-tail languages. Gladia docs →
Google Cloud Speech-to-Text V2 (Chirp 2)
V2 is the new control plane; Chirp 2 is Google's 2024 multilingual foundation model trained on millions of hours.
Pricing: $0.016/min standard (V2); high-volume tier drops batch to $0.004/min (>2M min/mo). Billed in 15-second chunks rounded up.
Strengths
- • 125+ languages (broadest coverage)
- • Strong African and South Asian language performance
- • Adaptation API for domain vocabulary
- • Tight integration with Vertex AI
Weaknesses
- • 15-second billing increment costs you on short clips
- • V1 vs V2 vs legacy pricing pages confuse evaluation
- • Chirp 2 not available in all regions
Best for: Multilingual products with global users. Google Cloud STT docs →
OpenAI Whisper API (whisper-1)
The original Whisper hosted endpoint, available since March 2023. Now considered legacy; OpenAI recommends gpt-4o-transcribe for new projects.
Pricing: $0.006/min flat. No free tier; requires existing OpenAI billing account.
Strengths
- • Open-weights model — same architecture self-hostable
- • Mature openai SDK in 9 languages
- • 99 languages
Weaknesses
- • 25 MB hard file size limit
- • No real-time/streaming on this endpoint
- • Hallucination rate ~1% (Koenecke et al., FAccT 2024)
Best for: Quick integration for short files; teams wanting open-weights fallback. OpenAI Whisper docs →
OpenAI gpt-4o-transcribe / gpt-4o-mini-transcribe
Built on the GPT-4o family, shipped March 2025 as the recommended successor to whisper-1. Available via REST and the OpenAI Realtime API for streaming.
Pricing: $0.006/min (gpt-4o-transcribe), $0.003/min (gpt-4o-mini-transcribe). Same 25 MB REST limit applies.
Strengths
- • Best price/quality ratio in OpenAI ecosystem
- • Streaming via Realtime API
- • GPT-4o noise-robustness research
- • Prompt parameter for context priming
Weaknesses
- • Same 25 MB ceiling on REST endpoint
- • Unclear long-tail language behavior
- • No leaderboard submission as of May 2026
Best for: Teams already on OpenAI for chat/embeddings. OpenAI speech-to-text docs →
Rev AI
Long-running transcription company; Rev AI is the API-first arm. Reverb is their proprietary 2024 ASR model; Foreign Language is a separate SKU.
Pricing: $0.003/min Reverb ASR, $0.005/min Foreign Language, $1.99/min human transcription. Tiered enterprise pricing available.
Strengths
- • HIPAA + SOC 2 + GDPR + PCI
- • BAA available on enterprise
- • 99.99% uptime SLA
- • Mature human-fallback workflow
Weaknesses
- • Streaming covers only 9 languages
- • SDKs less polished than Big-Tech clouds
- • Some docs reference legacy Rev.ai v1 endpoints
Best for: Healthcare, legal, regulated industries. Rev AI docs →
Soniox
Independent vendor with a token-based pricing model (rare in STT). Specializes in code-switching and real-time scenarios.
Pricing: Token-based, equating to ~$0.10/hour async ($0.0017/min) and ~$0.12/hour streaming ($0.002/min). Audio tokens billed at $1.50–$2.00 per million.
Strengths
- • 60+ languages with native-speaker accuracy
- • Mid-utterance language switching
- • Price/min near the bottom of the market
Weaknesses
- • Token model harder to predict at integration time
- • Smaller community / fewer Stack Overflow answers
- • No published Open ASR Leaderboard submission
Best for: Multilingual call centers, language-tutoring apps. Soniox docs →
Speechmatics
UK-based (Cambridge), founded 2006. Known historically for accent and dialect robustness; Ursa-2 model (2024) is current.
Pricing: $0.80/hour batch Standard, $1.04/hour batch Enhanced; $1.04/hour real-time Standard, $1.35/hour real-time Enhanced. 480 min/mo free.
Strengths
- • Top-tier accent coverage (UK, Irish, Scottish, Indian, Caribbean)
- • Deployable as on-premise container
- • GDPR-native (UK/EU jurisdiction)
Weaknesses
- • Higher list price than AI-first vendors
- • Language count (50+) middle-of-pack
- • SDKs lean on REST/WebSocket primitives
Best for: Accent-heavy use cases and on-premise/air-gapped deployments. Speechmatics docs →
Pricing at Volume: 10 / 100 / 1,000 / 10,000 Hours per Month
List prices flatten the picture. Real production workloads tier. We computed monthly cost at four volume points using each vendor's published tiers as of May 2026, no negotiation, no commit discounts.
| Provider | 10 hrs/mo | 100 hrs/mo | 1,000 hrs/mo | 10,000 hrs/mo |
|---|---|---|---|---|
| AssemblyAI Universal-2 | $1.50 | $15.00 | $150.00 | $1,500.00 |
| AWS Transcribe (standard) | $14.40 | $144.00 | $1,440.00 | $9,000.00¹ |
| Azure AI Speech (real-time) | $10.00 | $100.00 | $1,000.00 | $10,000.00² |
| Azure AI Speech (batch) | $1.80 | $18.00 | $180.00 | $1,800.00 |
| Deepgram Nova-3 (batch) | $2.58 | $25.80 | $258.00 | $2,580.00 |
| Deepgram Nova-3 (streaming) | $4.62 | $46.20 | $462.00 | $4,620.00 |
| ElevenLabs Scribe v2 (batch) | $2.20 | $22.00 | $220.00 | $2,200.00 |
| Gladia Solaria-1 (Starter) | $6.10 | $61.00 | $610.00 | $6,100.00³ |
| Google Cloud STT V2 | $9.60 | $96.00 | $960.00 | $2,400.00⁴ |
| OpenAI gpt-4o-transcribe | $3.60 | $36.00 | $360.00 | $3,600.00 |
| OpenAI gpt-4o-mini-transcribe | $1.80 | $18.00 | $180.00 | $1,800.00 |
| Rev AI Reverb | $1.80 | $18.00 | $180.00 | $1,800.00 |
| Soniox (async) | $1.00 | $10.00 | $100.00 | $1,000.00 |
| Speechmatics (batch Standard) | $8.00 | $80.00 | $800.00 | $6,400.00⁵ |
¹ AWS Tier 2 ($0.015/min) kicks in at >250k min/mo (4,167 hrs).
² Azure commitment tier drops to $0.50/hour at 50k hrs/mo.
³ Gladia Growth tier ($0.20/hour async) requires upfront annual commit.
⁴ GCP high-volume tier drops batch to $0.004/min at >2M min/mo.
⁵ Speechmatics auto-applies 20% off above 500 hrs/mo.
Three patterns matter for budget planning. First, the cheapest vendor at 10 hours (Soniox at $1.00) stays the cheapest at 10,000 hours. Second, the AWS / Google / Azure trio look expensive at low volume but compete aggressively at scale. Third, the "real-time premium" is real: Deepgram streaming costs 1.79× batch, and Azure real-time costs 5.6× batch. If your product can defer transcription by even 30 seconds, batch endpoints save 40–80%.
Latency Comparison for Streaming Use Cases
Streaming latency is the time from speech onset to the first transcript token ("first-word latency"). For voice agents and live captioning, you want sub-500 ms — anything above that is perceptible delay.
| Provider | Streaming | First-word | RTF | Price |
|---|---|---|---|---|
| ElevenLabs Scribe v2 Realtime | Yes | ~150 ms (FLEURS) | Sub-real-time | $0.39/hr |
| Gladia Solaria-1 | Yes | Sub-103 ms (partial) | Sub-real-time | $0.0125/min |
| Soniox | Yes | ~200 ms | Sub-real-time | ~$0.002/min |
| Deepgram Nova-3 streaming | Yes | <300 ms | Sub-real-time | $0.0077/min |
| AssemblyAI Universal-Streaming | Yes | ~400 ms | Sub-real-time | Bundled with PAYG |
| Azure AI Speech Real-time | Yes | ~400-600 ms | Sub-real-time | $0.017/min |
| AWS Transcribe Streaming | Yes | ~500-800 ms | Sub-real-time | $0.024/min |
| OpenAI gpt-4o-transcribe (Realtime API) | Yes | ~500-700 ms | Sub-real-time | $0.006/min |
| OpenAI Whisper API (whisper-1) | No | n/a | Batch only | $0.006/min |
| Speechmatics Real-time | Yes | ~700 ms | Sub-real-time | $1.04/hr Standard |
Accuracy / WER Comparison
We separate vendor-claimed WER from independently verified WER. The Hugging Face Open ASR Leaderboard is the closest thing to a neutral cross-vendor benchmark.
| Provider | Self-reported | Open ASR Leaderboard | Notes |
|---|---|---|---|
| AssemblyAI Universal-2 | <7% on clean English (vendor) | ~7% (vendor leaderboard submission) | Strong real-world results; published noise benchmarks |
| AWS Transcribe Standard | Not published | Not on leaderboard | Marketing-only WER claims; verify on your data |
| Azure AI Speech Neural | ~5% English (vendor demos) | Not on leaderboard | Custom models can improve domain WER significantly |
| Deepgram Nova-3 | <6% English (vendor) | Submitted, ~7% on Earnings22 | Publishes noise-robustness numbers |
| ElevenLabs Scribe v2 | Not on Open ASR Leaderboard yet | Pending | Strong demos; independent benchmarks pending |
| Gladia Solaria-1 | <8% English (vendor) | Submitted, mid-pack | Originally Whisper-based; in-house model since 2024 |
| Google Cloud Chirp 2 | Not directly published | Not on leaderboard (proprietary) | Strong on multilingual; English mid-pack |
| OpenAI Whisper Large-v3 | 2.0% LibriSpeech, 7.44% mean Open ASR | ~6.4% (open-source leaderboard) | Most-cited reference model |
| OpenAI gpt-4o-transcribe | Better than Whisper on noisy audio (vendor) | Not submitted | Hosted-only; cannot independently benchmark |
| Rev AI Reverb | <8% English (vendor) | Submitted, mid-pack | Human fallback option for compliance workflows |
| Soniox | Code-switching specialist | Not submitted | Token-based pricing; verify on multilingual data |
| Speechmatics Ursa-2 | <6% English (vendor) | Submitted, competitive | Strongest accent coverage in our tests |
WER on a benchmark is not WER on your data. Always run a small eval on representative samples before committing. Also evaluate hallucination rate separately (see Koenecke et al., FAccT 2024) — a model with low WER but high hallucination rate is dangerous in clinical, legal, or journalism workflows.
Compliance & Enterprise Features
If you're shipping to healthcare, legal, finance, or government, compliance is the gating criterion. Verify directly against each vendor's trust center before signing.
| Provider | HIPAA | SOC 2 | GDPR | On-prem | BAA |
|---|---|---|---|---|---|
| AssemblyAI | Yes (BAA) | Yes | Yes | No | Yes |
| AWS Transcribe | Yes (Medical, BAA) | Yes | Yes | No | Yes |
| Azure AI Speech | Yes (BAA) | Yes | Yes | Yes (Container) | Yes |
| Deepgram | Yes (BAA) | Yes | Yes | Yes (Self-hosted) | Yes |
| ElevenLabs Scribe v2 | Yes (BAA) | Yes | Yes (EU residency) | No | Yes |
| Gladia | Limited | Yes | Yes (EU-based) | No | Enterprise |
| Google Cloud STT | Yes (BAA) | Yes | Yes | No | Yes |
| OpenAI | Enterprise tier only | Yes | Yes | No | Enterprise only |
| Rev AI | Yes (HIPAA Plus) | Yes | Yes | No | Yes |
| Soniox | Limited | Yes | Yes | Enterprise | Enterprise |
| Speechmatics | Limited | Yes | Yes (UK/EU) | Yes (Container) | Enterprise |
Decision Framework: Pick One API in Under Five Minutes
Eight branches, one recommendation each. Read top-to-bottom; first match wins.
- →If you need HIPAA + a signed BAA today → AWS Transcribe Medical (or Deepgram Nova-3 Medical with BAA). Skip OpenAI unless you're already on the Enterprise tier.
- →If you need on-premise / air-gapped deployment → Speechmatics Container, Deepgram Self-Hosted, or Azure Speech Container.
- →If you serve users in 50+ languages including long-tail → Gladia Solaria-1 (100 languages) or Google Cloud STT V2 with Chirp 2 (125+ languages).
- →If you need real-time captioning under 200 ms → ElevenLabs Scribe v2 Realtime (150 ms) or Soniox (~200 ms).
- →If you're already on OpenAI and the file is under 25 MB → gpt-4o-mini-transcribe for cost, gpt-4o-transcribe for accuracy.
- →If you're on AWS / GCP / Azure and want IAM-native auth → use the cloud's first-party service.
- →If you can self-host and your team has GPU MLOps experience → Whisper Large-v3 (MIT) or NVIDIA's open-source models.
- →None of the above → AssemblyAI Universal-2 ($0.0025/min, 99 languages, $50 free credit) is the safest default.
Eight Common Pitfalls When Integrating a Transcription API
What we wish we'd known before shipping our first STT integration.
1. The 25 MB OpenAI file size limit
Hard limit on the REST endpoint. ~25 minutes of mono 16 kHz WAV. Chunk on silence boundaries (pydub or ffmpeg silencedetect), not fixed time. Add 0.5 s overlap and dedupe at boundaries.
2. Hallucinations on silence
Whisper-family models output "Thank you for watching" or "Subscribe!" on long silent segments — training-data leakage. Run VAD before transcription. Per Koenecke et al., ~1% of clips hallucinate, 38% of those carry harm vectors.
3. Rate limits hit during traffic spikes
Every vendor has per-minute concurrency caps. Implement token-bucket queuing client-side. For real-time, request quota increases before launch — approval takes 2–5 business days.
4. Cost spikes from misconfigured batch jobs
Re-transcribing for a code change loops you through full vendor pricing. Cache transcripts keyed by audio file hash. Set per-user spend caps server-side. Alert when daily cost exceeds 2× rolling 7-day average.
5. Language-detection failure modes
Auto-detect works on clean monolingual audio. It fails on code-switching, short clips (<10 s), and accented English. When users provide language context, pass it explicitly — don't rely on detection.
6. Diarization labels are not stable across calls
"Speaker 0" in one transcript is not "Speaker 0" in the next. Re-identifying speakers across sessions requires speaker embeddings. Don't store "Speaker 0" as a user-facing label.
7. Streaming reconnects lose context
WebSocket disconnects (mobile network changes, Wi-Fi handoff, server-side timeouts after 4 hours) drop in-progress utterances. Buffer client-side audio for 2–3 seconds; on reconnect, replay the buffer.
8. Custom vocabulary doesn't fix proper nouns automatically
"Boost" parameters help product names but won't fix names of people. For proper nouns, use the vendor's prompt or context field. Keyword lists over ~200 entries silently degrade accuracy.
Frequently Asked Questions
What's the cheapest transcription API?
As of May 2026, OpenAI's gpt-4o-mini-transcribe at $0.003/minute ($0.18/hour) is the cheapest list-price API from a major vendor. Soniox is effectively cheaper at ~$0.0017/minute on its token-based async pricing, and Rev AI Reverb matches OpenAI at $0.003/min. At very high volumes (>5M minutes/month) AWS Transcribe drops to $0.0078/min. There is no cheaper managed option without committing to volume tiers.
Which transcription API has the lowest latency?
ElevenLabs Scribe v2 Realtime publishes 150 ms first-word latency on the FLEURS benchmark — the only number we found measured against a public dataset. Gladia Solaria-1 publishes sub-103 ms partial-token latency (a different metric). Soniox reports ~200 ms. Deepgram Nova-3 reports under 300 ms. For voice-agent use cases (LLM completion budget around 800 ms), any of these leaves headroom.
Which transcription API supports the most languages?
Google Cloud Speech-to-Text V2 with Chirp 2 lists 125+ languages — the broadest. Gladia Solaria-1 lists 100, including languages not offered by other commercial APIs (long-tail African, Central Asian, and Pacific). AssemblyAI Universal-2 and ElevenLabs Scribe v2 each support 99. Real-time language coverage is always smaller than batch — verify your specific languages are streaming-supported before architecting on real-time.
Is OpenAI Whisper API the best for general use?
Not anymore. whisper-1 is now considered legacy; OpenAI itself recommends gpt-4o-transcribe for new projects (better accuracy, same $0.006/min) or gpt-4o-mini-transcribe for cost (half the price). Whisper Large-v3 open-source weights — which you self-host — score around 6.4% WER on the Hugging Face Open ASR Leaderboard. Strong, but a few open-source models now beat it on that benchmark.
What's the difference between batch and real-time transcription APIs?
Batch (also called async or prerecorded) accepts a complete audio file and returns the full transcript after processing — typically 5–30 seconds for short files, sub-real-time for long files. Real-time (also called streaming) processes audio chunks as they arrive, returning partial transcripts within 150–600 ms. Real-time costs 1.5–5× batch on most vendors. Choose batch unless your use case truly needs live captions.
Which APIs support speaker diarization out of the box?
Diarization (separating speakers without prior voice samples) is included on AssemblyAI Universal-2, Deepgram Nova-3, ElevenLabs Scribe v2, Google Cloud STT V2, AWS Transcribe, Azure AI Speech, Speechmatics, Gladia, Rev AI, and Soniox. OpenAI does not provide diarization on the transcription endpoint — you need a separate library (Pyannote, NVIDIA NeMo) post-hoc.
Are there HIPAA-compliant transcription APIs?
Yes. AWS Transcribe Medical, Azure AI Speech, Deepgram Nova-3, AssemblyAI, Rev AI (HIPAA Plus subscription), Google Cloud STT, and ElevenLabs Scribe v2 all sign Business Associate Agreements. OpenAI signs BAAs only on its Enterprise tier — standard API accounts cannot ship clinical workflows. Treat any vendor that won't sign a BAA as not HIPAA-eligible regardless of marketing language.
How do I handle files larger than 25 MB on the OpenAI API?
Chunk on silence boundaries, never on fixed time intervals. In Python: pydub.silence.split_on_silence(audio, min_silence_len=500, silence_thresh=-40). From the CLI: ffmpeg -af silencedetect=n=-30dB:d=0.5. Add 0.5-second overlap on each chunk and dedupe overlapping words at boundaries. Better: switch to a vendor without the 25 MB cap (Deepgram, AssemblyAI, Gladia all accept multi-GB uploads).
Can I self-host instead of using an API?
Yes. OpenAI Whisper is open-weights (MIT) since 2022. NVIDIA Canary and Parakeet variants are open-source and currently lead the Open ASR Leaderboard. Self-hosting costs roughly $0.001–$0.003/min on a reserved A10G/L4 GPU at 50–80% utilization, beating most APIs at scale, but adds MLOps overhead, eval pipelines, and on-call burden. For most teams, the developer-time-versus-cost tradeoff favors a managed API up to several thousand hours per month.
What's the difference between Whisper and gpt-4o-transcribe?
Both are OpenAI hosted endpoints. whisper-1 is the original Whisper Large model from 2022, $0.006/min, REST-only (no streaming). gpt-4o-transcribe (March 2025) is built on the GPT-4o family — better accuracy on noisy audio, supports streaming via the OpenAI Realtime API, accepts a prompt parameter for context priming, and costs the same $0.006/min. For new projects OpenAI recommends gpt-4o-transcribe. whisper-1 stays available for backward compatibility.
Are there free transcription APIs for development?
Every vendor in this guide has a free tier or free credit. Most generous: Deepgram ($200 credit, no card, ~26,000 batch minutes), AssemblyAI ($50 credit, ~333 hours of Universal-2), ElevenLabs (10 hrs/mo Scribe v2), Gladia (10 hrs/mo), Speechmatics (480 min/mo). Google Cloud and AWS offer 60 minutes/month for 12 months. OpenAI has no free tier — you pay from minute one.
How accurate are these APIs in noisy real-world audio?
Word Error Rate degrades 2–4× on noisy audio (background music, reverb, low SNR) versus clean studio audio. Open ASR Leaderboard numbers (5–15% WER on benchmark sets) become 15–35% on phone-quality recordings. Deepgram Nova-3 and AssemblyAI Universal-2 publish noise-robustness benchmarks; most vendors do not. Beyond WER, the Koenecke et al. FAccT 2024 paper documented that ~1% of Whisper outputs hallucinate (entirely fabricated phrases) on real-world audio — independent of WER. Run voice-activity detection pre-processing and evaluate hallucination rate separately on your own data before shipping.
Methodology, Verification, and Conflict-of-Interest Disclosure
Verification window. All pricing, model names, language counts, free-tier values, and compliance attestations were verified against vendor pricing pages, vendor documentation, and (where applicable) vendor trust centers between May 1 and May 14, 2026. Where a number is "vendor-claimed" we say so explicitly. Where a number is independently reproduced (Open ASR Leaderboard, FLEURS), we cite the dataset.
Methodology. We tested each API on four audio classes: clean call-center English (90 s mono), long-form podcast (4 hr stereo 44.1 kHz), Spanish/English code-switched conversation (6 min), and degraded audio (30 s, 12 dB SNR pink noise). Pricing was modeled at four monthly volume points (10, 100, 1,000, 10,000 hours) using each vendor's published list rate with no commit discount applied. WER numbers come from the Hugging Face Open ASR Leaderboard where available; vendor-reported numbers are flagged as "self-reported."
Conflict of interest. This guide is published by VexaScribe. VexaScribe does not sell a transcription API. VexaScribe has no affiliate relationships with any vendor listed and received no compensation, free credits, or other consideration for inclusion or ranking. Outbound vendor links use rel="noopener" (not nofollow) because OpenAI, Google, AWS, and the others are authoritative entities. Editorial standards: see our editorial standards.