Detailed alternatives
1. AssemblyAI Universal-2 — best for accuracy + built-in LLM features
Universal-2 (released 2024) sits within 1-2% WER of Deepgram Nova-3 on the Hugging Face Open ASR Leaderboard for English. The defining differentiator is LeMUR — AssemblyAI's LLM-on-transcript layer that adds summarization, sentiment analysis, custom topic extraction, and PII redaction in the same API call as transcription. That eliminates the need to orchestrate Deepgram + OpenAI/Anthropic for downstream NLP.
Pricing: $0.006/min async ($0.36/hr), $0.0085/min real-time. ~40% more expensive than Deepgram per minute. $50 in free credits at signup. Real-time streaming via Universal-Streaming.
Best when: You need transcription PLUS structured downstream insights (summarization, custom topics, PII redaction) and want one API call instead of orchestrating multiple services. The engineering time savings often exceed the price premium.
Avoid when: You only need transcription (Deepgram is cheaper and equally accurate); you need broader-than-17-language coverage (Whisper covers 99).
2. OpenAI Whisper API — best when already using OpenAI
Hosted Whisper Large-v3 via OpenAI's API. Same model as the open-source release — no proprietary fork. The pitch is consolidation: if your stack already uses OpenAI for LLM features, you get transcription under the same billing, same SDK, and same dashboard. The downside: no streaming — Whisper API is async-only, so it's wrong for real-time voice agents or live captions.
Pricing: $0.006/min ($0.36/hr). Same per-minute price as AssemblyAI, ~40% more than Deepgram. No dedicated free credits — you draw from your OpenAI account balance.
Best when: Your stack is OpenAI-native and you value billing consolidation. Or you specifically need 99-language coverage with Tier 1 accuracy on ~20 languages (English, Spanish, French, German, Italian, Portuguese, Dutch, Russian, Polish, Japanese, Chinese, Korean, etc.).
Avoid when: You need streaming. Or your volume is high enough that hosting Whisper Large-v3 yourself becomes cheaper (see break-even math).
3. Self-hosted Whisper Large-v3 — best at scale, on-prem, and for compliance
OpenAI Whisper Large-v3 is MIT-licensed and runs on consumer GPUs. With faster-whisper (CTranslate2-based optimized inference), an RTX 4090 transcribes 4-8× real-time depending on settings. At GPU rental rates of $0.30-$0.50/hour from Vast.ai or RunPod, your effective per-minute cost drops to $0.001-$0.002 — below Deepgram's $0.0043/min.
The honest tradeoff: you take on real ops work — serving (vLLM, faster-whisper, HuggingFace TGI), autoscaling for traffic spikes, monitoring, GPU spot-instance handling, model updates. For an engineering team of two with other priorities, the Deepgram premium is usually worth not building this. For a team operating production ML infrastructure already, self-host is a clean win above 300-500 hrs/mo.
Best when: Volume exceeds ~500 hrs/mo of steady workload; OR you need on-premise/air-gapped deployment for compliance; OR you need to fine-tune the model on domain audio; OR you want to bundle WhisperX (word-timestamps) and pyannote (diarization) in a single self-hosted pipeline.
Avoid when: Your eng team doesn't have ML ops capacity; or your volume is sporadic (idle GPU cost destroys the economics); or you need sub-second streaming latency (architecturally not Whisper's shape).
4. Speechmatics Ursa — best for accent and dialect coverage
Speechmatics' Ursa model is specifically tuned for robustness on accented English, regional dialects, and code-switching audio. UK-based vendor, real engineering investment in non-US English. If your product processes phone calls from globally distributed users, accented business audio, or non-standard speech, Speechmatics tends to outperform US-centric models like Deepgram and Whisper on the same audio.
Pricing: ~$0.025/min ($1.50/hr) — about 6× Deepgram. Higher than every other API on this list except Rev human. Justified only if accent/dialect performance is load-bearing.
Best when: Accent/dialect robustness is the dominant requirement and you've measured Deepgram or Whisper underperforming on your actual audio.
5. Rev AI — best when verbatim accuracy is required
Rev's AI API ($0.02/min, $1.20/hr consumer rate) is competitive but not category-leading on its own — its real value is as a draft feeding Rev's human transcription service ($1.99/min, $119/hr). For court depositions, broadcast captioning, medical dictation, or any deliverable where 95% AI accuracy is unacceptable, Rev human + AI is the only credible path to 98-99% verbatim accuracy at scale.
Honest framing: nobody picks Rev API alone over Deepgram on price or accuracy. You pick Rev because you need the human review pipeline for legal, regulatory, or broadcast compliance — and the AI API is the integration point that feeds into it.
Best when: Your deliverable requires >95% accuracy (legal, broadcast, medical) AND you need the human review pipeline integrated, not just “the most accurate AI.”
6. AWS Transcribe — best when already in AWS
Standard pricing ~$0.024/min ($1.44/hr) on list — about 5-6× Deepgram, which sounds untenable. The catch: most companies large enough to use AWS Transcribe at meaningful volume have committed-spend discount agreements that bring the effective price below list. AWS Transcribe also offers domain-specific variants (Medical, Call Analytics) and native integration with the AWS ML stack.
Pricing: $0.024/min standard, $0.0125/min batch tier above 250k minutes/month. AWS Free Tier includes 60 minutes/month for 12 months. Real pricing usually negotiated.
Best when: Your infrastructure is AWS-native and you have an existing AWS committed-spend agreement that subsidizes transcription. Also: when AWS Transcribe Medical (PHI-compliant medical vocabulary) or Call Analytics (call center insights) matches your specific vertical.
7. Google Cloud Speech-to-Text — best when already in GCP
Standard pricing ~$0.024/min ($1.44/hr). Chirp 2 model offers enhanced accuracy at similar rates. Like AWS, the list price is high but real-world pricing under GCP committed-use discounts is usually competitive with Deepgram. Native integration with Vertex AI, Document AI, and the rest of Google's ML stack.
Pricing: $0.024/min standard for first 60 min/month free per project, then standard tier kicks in. Chirp 2 priced similarly.
Best when: You're building on Google Cloud with Vertex AI for downstream ML, you have GCP committed-spend, or you need Chirp 2's strong multi-language performance integrated with the rest of GCP's services.
8. Azure AI Speech — best when already in Azure
Standard speech-to-text ~$1.00/audio-hour ($0.0167/min). Slightly cheaper than AWS Transcribe and GCP STT on list. Custom Speech allows training domain-adapted models. Conversation Transcription Service adds diarization with audio fingerprinting for known speakers.
Pricing: $1/audio-hour standard pay-as-you-go. Commitment tiers reduce this. Free 5 audio hours/month included.
Best when: You're Microsoft-stack-native (Azure infrastructure, Microsoft 365 integration), need Azure OpenAI for downstream LLM workflows, or specifically need Custom Speech for a domain-trained model. Less compelling than Deepgram for greenfield projects without Azure commitment.
A note on VexaScribe
If you're a developer building with ASR, ignore us — Deepgram, AssemblyAI, OpenAI Whisper API, or self-hosted Whisper are your real options. We're an end-user web app, not a developer API. You can't POST audio to us and get a JSON response back the way you can with Deepgram.
If you landed here because you searched “Deepgram” but you're actually trying to transcribe a podcast, a meeting, or an interview without writing code, then VexaScribe's upload tool is the right shape — 30 minutes free at signup, no card, full export to TXT/DOCX/SRT/VTT/JSON. But that's a different category from what this page is about.