Formerly NovaScribe — same team, same product, refreshed name. Read the announcement →
Whisper Transcription Online — Hosted OpenAI Whisper in 99 Languages
Run Whisper Large-v3 in your browser. No Python, no CUDA, no per-minute API math. Files up to 5 GB, 99 languages, SRT/VTT export.
Whisper transcription converts speech to text using OpenAI's Whisper, an automatic speech recognition model trained on 680,000 hours of audio across 99 languages. VexaScribe (formerly NovaScribe) runs Whisper Large-v3 on its own GPUs so you can transcribe files up to 5 GB without installing Python, managing CUDA, or paying per-minute API fees. Whisper ships in eight checkpoints from tiny (39M parameters) up to large-v3 (1,550M parameters); our default is Large-v3, which scores 2.0% Word Error Rate on LibriSpeech test-clean and 7.44% mean WER across the eight-dataset Open ASR Leaderboard. We are an independent service and not affiliated with OpenAI; "Whisper" was released by OpenAI in September 2022 under the MIT license. Drop in any MP3, WAV, M4A, FLAC, MP4, or MOV — get back SRT, VTT, plain text, or JSON with word-level timestamps in roughly 1/10th of real-time.
What Is Whisper?
Whisper is an open-source automatic speech recognition (ASR) model released by OpenAI in September 2022. It is a Transformer encoder-decoder trained on 680,000 hours of weakly supervised multilingual audio scraped from the public web, distributed under the MIT license. Roughly 117,000 hours of that training data are non-English, which is why a single set of weights can transcribe 99 languages and translate any of them into English.
Unlike commercial ASR systems that wrap proprietary acoustic and language models, Whisper is a single sequence-to-sequence Transformer. It performs language identification, voice activity detection, punctuation, and timestamp generation as part of the same forward pass. Because the model and weights are public, anyone can run Whisper locally on a GPU, fine-tune it for a specific domain, or — like VexaScribe — host it as a managed service.
The lineage matters when you read claims online. Whisper Large-v3 (November 2023) uses 128 mel-frequency bins and adds a Cantonese token. Whisper Turbo (officially large-v3-turbo, October 2024) has only 4 decoder layers — down from 32 — making it 4–8× faster at the cost of a small accuracy drop. As of 2026, OpenAI also offers two newer hosted-only models, gpt-4o-transcribe and gpt-4o-mini-transcribe, both released March 20, 2025; they are not the same as Whisper and have different licensing. See our deep-dive on Whisper accuracy.
Whisper Model Family: Every Checkpoint Compared
Whisper ships in eight checkpoints from tiny (39M parameters, ~75 MB on disk) up to large-v3 (1,550M parameters, ~3.1 GB). The October 2024 large-v3-turbo is an 809M-parameter distillation that runs roughly 4–8× faster than Large-v3 with only a small accuracy penalty.
| Model | Parameters | Disk | Recommended use | English-only? | WER |
|---|---|---|---|---|---|
tiny | 39M | ~75 MB | Quick draft on a CPU laptop | Yes (tiny.en) | ~7.6% |
base | 74M | ~142 MB | Edge/mobile, voice notes | Yes (base.en) | ~5.0% |
small | 244M | ~466 MB | Balanced quality on a single consumer GPU | Yes (small.en) | ~3.4% |
medium | 769M | ~1.5 GB | Production English on a 5 GB GPU | Yes (medium.en) | ~2.9% |
large (v1) | 1,550M | ~3.0 GB | Original 2022 multilingual flagship | No | ~2.7% |
large-v2 | 1,550M | ~3.0 GB | Dec 2022 retraining; better non-English | No | ~2.4% |
large-v3 ★ | 1,550M | ~3.1 GB | VexaScribe default — best overall accuracy | No | ~2.0% |
large-v3-turbo | 809M | ~1.6 GB | When latency matters, ~4–8× faster | No | ~2.1% |
WER values on LibriSpeech test-clean. Large-v3 and large-v3-turbo numbers are from the official Hugging Face model cards; smaller-checkpoint WERs are from the OpenAI Whisper paper Table 9.
The English-only .en variants outperform their multilingual counterparts on English audio at the same parameter count, especially for tiny.en and base.en. Whisper Turbo is not trained for translation tasks — if you need to convert non-English speech into English subtitles, use multilingual large-v3.
Whisper Accuracy: WER by Language and Condition
Whisper Large-v3 averages 7.44% Word Error Rate across the eight-dataset Open ASR Leaderboard, but real-world WER ranges from ~2% on clean read English to over 20% on accented telephony or non-English low-resource speech.
The single biggest variable is not the model — it is the audio. Read speech in a quiet room is roughly 5× easier than the same speaker on a noisy Zoom call.
| Language tier | Examples | Clean | Noisy |
|---|---|---|---|
| Tier 1 — high-resource Western | English, Spanish, French, German, Portuguese, Italian | 3–8% | 10–18% |
| Tier 2 — high-resource non-Western | Mandarin, Japanese, Russian, Korean, Arabic | 6–12% | 14–22% |
| Tier 3 — medium-resource | Dutch, Polish, Turkish, Vietnamese, Thai | 8–15% | 18–28% |
| Tier 4 — low-resource | Swahili, Bengali, Tamil, Welsh, Marathi | 15–30% | 30–55% |
| Reference — LibriSpeech test-clean (English read) | American English audiobooks, single speaker | ~2.0% | n/a |
These ranges combine OpenAI's own FLEURS evaluation in the Whisper paper with reproductions on Common Voice and the Open ASR Leaderboard. For a deeper breakdown of WER methodology, see our Whisper accuracy guide.
How VexaScribe Uses Whisper Under the Hood
VexaScribe is a hosted Whisper service. Your file is resampled to 16 kHz mono, fed to Whisper Large-v3 on our GPUs, optionally diarized for speaker labels, then returned as SRT, VTT, plain text, or JSON.
- 1
Ingest
Upload over TLS 1.2+ — files up to 5 GB. Audio extracted from video files automatically.
- 2
Decode and normalize
FFmpeg decodes any of 17 supported formats and resamples to 16 kHz mono PCM — the input format Whisper expects.
- 3
Voice-activity detection
Silence and non-speech regions are trimmed before the model sees them, reducing hallucination on quiet sections.
- 4
Whisper Large-v3 inference
Default model. A 60-minute file completes in 5–10 minutes on our GPU pipeline.
- 5
Speaker diarization (optional)
Speaker turns are identified and aligned with Whisper word timestamps to produce Speaker 1 / Speaker 2 output.
- 6
Export
SRT, VTT, plain text, DOCX, or JSON with word-level timestamps and confidence scores.
Hosted Whisper vs OpenAI API vs Self-Hosted
If you transcribe under ~10 hours per month and you can write code, the OpenAI Whisper API is competitive at $0.006/minute. If you have an idle GPU, self-hosting is free per minute. VexaScribe is the right choice when you want speaker labels, files larger than 25 MB, no per-minute math, and a UI.
| Criterion | VexaScribe | OpenAI direct | Self-hosted |
|---|---|---|---|
| Cost per audio hour | $0 free tier; $0.20–$0.40/hr at volume on paid plans | $0.36/hr ($0.006/min, no minimum) | $0/min, but GPU electricity + amortized hardware (~$0.05–$0.30/hr) |
| Max file size | 5 GB per file (up to 10 hours) | 25 MB per request — chunk longer files yourself | Limited only by your disk |
| Languages | 99 (Whisper Large-v3) | 99 (whisper-1) | 99 (any checkpoint) |
| Speaker diarization | Included automatically | Not in whisper-1; separate model needed | Not included — install Pyannote and align manually |
| File formats supported | 17 formats (MP3, WAV, M4A, FLAC, OGG, MP4, MOV, WebM, MKV, AAC, AIFF, WMA, AMR, OPUS, AVI, FLV, WMV) | M4A, MP3, MP4, MPEG, MPGA, WAV, WEBM (≤25 MB) | Whatever FFmpeg can decode |
| Setup time | Under 30 seconds (sign in, drag a file) | ~10 minutes (API key + curl/Python) | Hours to days (CUDA, PyTorch, model download, chunking, VAD) |
| GPU required | No — we run the GPUs | No — OpenAI runs them | Yes — ~10 GB VRAM for Large-v3, ~6 GB for Turbo |
| Extra features included | Diarization, AI summaries, translation (133 languages), SRT/VTT export, word-level timestamps | Word-level timestamps; translation to English. No UI, no diarization | Whatever you build yourself |
Read this table honestly: if your monthly volume is small and you are comfortable with code, the OpenAI API direct can be cheaper than VexaScribe per minute — that is just true. VexaScribe wins when (a) your files exceed 25 MB and you do not want to chunk them, (b) you need speaker labels without writing diarization code, (c) you want SRT/VTT formatted to broadcast standards, or (d) you want a flat monthly bill instead of per-minute math.
What Whisper Does Not Do Well
Whisper hallucinates entire sentences on silent or near-silent audio, performs poorly on song lyrics and overlapping speech, has no built-in speaker diarization, and cannot stream audio in real time without external chunking logic.
1. Hallucination on silent or non-speech audio
The 2024 ACM FAccT paper "Careless Whisper" (Koenecke et al.) found that roughly 1% of Whisper transcriptions contained entire hallucinated phrases, and that 38% of those hallucinations carried explicit harms — invented violence, false medical claims, or fabricated authority. Hallucinations cluster on segments with longer non-vocal pauses.
2. Music, song lyrics, and overlapping speech
Whisper was trained mostly on monologue and dialogue, not lyrics. Singing, autotuned vocals, and dense music backing tracks produce garbled or invented transcripts. Two people speaking at once degrade quality sharply — the decoder picks one voice and mostly ignores the other.
3. No built-in speaker diarization
Stock Whisper outputs a single stream of text. To produce Speaker 1 / Speaker 2 output, you must run a separate diarization model and align its segment boundaries with Whisper's word timestamps. VexaScribe does this for you automatically.
4. No real-time streaming in standard Whisper
Whisper is a sequence-to-sequence model that processes 30-second windows, so the canonical implementation is not streaming-capable. Real-time variants exist (faster-whisper + WebRTC, OpenAI's gpt-4o-transcribe Realtime API), but they require additional engineering.
How to Use Whisper Transcription on VexaScribe
Drop in any audio or video file up to 5 GB, choose Whisper Large-v3 plus your language, and download the result as SRT, VTT, plain text, or JSON.
- 1
Upload your file
Drag and drop any MP3, WAV, M4A, FLAC, OGG, MP4, MOV, MKV, or WebM file up to 5 GB. Files travel over TLS 1.2+ and are stored encrypted at rest.
- 2
Pick Large-v3 and language
Whisper Large-v3 is the default — best accuracy. Choose one of 99 languages or leave on auto-detect. Toggle speaker diarization on if you need labels.
- 3
Export and share
Download as SRT, VTT, plain text, DOCX, or JSON with word-level timestamps. Send the transcript straight to the AI summary tool.
Whisper-Supported Languages: All 99, with Quality Tiers
Whisper supports 99 languages out of a single Large-v3 checkpoint, with English, Spanish, French, German, Italian, Portuguese, Dutch, Russian, Mandarin, and Japanese all in the high-quality tier (sub-10% WER on clean read speech). Cantonese was added as a separate token in Large-v3 in November 2023.
Tier 1 — Production-grade (3–8% WER on clean speech)
English, Spanish, French, German, Italian, Portuguese, Dutch, Polish, Catalan, Russian, Japanese, Mandarin, Korean.
Tier 2 — Solid (6–12% WER)
Arabic, Turkish, Vietnamese, Thai, Indonesian, Hebrew, Hindi, Czech, Greek, Hungarian, Finnish, Swedish, Norwegian, Danish, Romanian, Bulgarian, Ukrainian, Cantonese.
Tier 3 — Usable but proofread (8–15% WER)
Tagalog, Swahili, Bengali, Tamil, Telugu, Urdu, Persian, Malay, Welsh, Slovak, Slovenian, Croatian, Serbian, Lithuanian, Latvian, Estonian.
Tier 4 — Experimental (15%+ WER)
Yoruba, Maori, Lao, Khmer, Burmese, Pashto, Sindhi, Tatar, Sundanese, Lingala, Luxembourgish, Faroese, Maltese, Hausa.
For non-English audio, Whisper can also translate directly into English (the translate task) without a separate translation step. If you need to translate into a language other than English, see transcribe and translate audio.
Privacy and Data Handling
Your audio is encrypted in transit (TLS 1.2+) and at rest in AWS eu-west-2. We do not use your files to train Whisper or any other model, and you can delete any file with one click.
- Encryption: TLS 1.2+ in transit; encrypted at rest in AWS eu-west-2.
- No training on your data: Your audio and transcripts are never used to train or fine-tune any model. Whisper Large-v3 is a frozen open-source checkpoint we run as-is.
- Self-serve deletion: Delete any file at any time from your dashboard. Account deletion purges all transcripts and audio.
- Whisper inference on our infrastructure: The transcription step runs on our own GPU pipeline — your audio does not leave our environment for inference.
See our full privacy policy and editorial standards.
Pricing: VexaScribe vs OpenAI Whisper API
The OpenAI Whisper API is $0.006/minute ($0.36/hour) with no minimums or subscriptions, billed per second. VexaScribe paid plans bundle storage, diarization, summaries, and a UI into a flat monthly fee. The breakeven where VexaScribe becomes cheaper is around 50 hours of audio per month.
| Workload | VexaScribe | OpenAI direct | Verdict |
|---|---|---|---|
| 10 hrs/month | Starter $2/mo (200 min) — covers 3 hrs; or Basic $5/mo (1,000 min) | 10 × 60 × $0.006 = $3.60 | OpenAI direct is competitive at this volume |
| 50 hrs/month | Pro $10/mo (2,500 min — 41 hrs) or Studio $20/mo (6,000 min — 100 hrs) | 50 × 60 × $0.006 = $18.00 | VexaScribe Studio cheaper and includes summaries + diarization |
| 100 hrs/month | Studio $20/mo (6,000 min — 100 hrs) | 100 × 60 × $0.006 = $36.00 | VexaScribe saves $16/mo and includes UI |
| 200 hrs/month | Studio $20/mo with overage at $0.0033/min — about $40/mo total | 200 × 60 × $0.006 = $72.00 | VexaScribe saves ~$32/mo at this volume |
VexaScribe pricing reflects published Starter / Basic / Pro / Studio plans on /pricing. OpenAI rates from their official API pricing page (whisper-1, $0.006/minute, verified May 2026).
Whisper Transcription — Frequently Asked Questions
What is Whisper?
Whisper is an open-source automatic speech recognition (ASR) model released by OpenAI in September 2022 under the MIT license. It is a Transformer encoder-decoder trained on 680,000 hours of multilingual, multitask audio. Whisper ships in eight checkpoints from tiny (39M parameters) to large-v3 (1,550M parameters) and transcribes 99 languages, with the ability to translate any of them into English.
How accurate is Whisper transcription?
Whisper Large-v3 scores around 2.0% Word Error Rate on LibriSpeech test-clean (clean read English) and a 7.44% mean WER across the eight-dataset Hugging Face Open ASR Leaderboard. Real-world conversational accuracy is lower — typically 8–15% on Zoom calls, 15–25% on phone calls or noisy environments. Accuracy is highest for English, Spanish, French, German, and Mandarin; lowest for low-resource languages like Yoruba, Lao, or Pashto. See our Whisper accuracy benchmarks for full per-language numbers.
What languages does Whisper support?
99 languages, all from the same Whisper Large-v3 checkpoint. The high-quality tier (sub-10% WER on clean speech) includes English, Spanish, French, German, Italian, Portuguese, Dutch, Polish, Russian, Mandarin, Japanese, and Korean. Cantonese was added as a separate token in Large-v3 in November 2023. Whisper can also translate any of the 99 source languages into English in a single pass — no separate translation step.
Do I need a GPU to use Whisper?
Not on VexaScribe — we run the GPUs. If you self-host, the tiny and base models run on a CPU at near real-time speed, but Whisper Large-v3 effectively requires a CUDA GPU with at least 10 GB VRAM, and Large-v3 Turbo needs about 6 GB. On a CPU-only laptop, Large-v3 takes 10–30× the audio duration to transcribe, which is rarely practical.
How does VexaScribe differ from running Whisper locally?
Three things: (1) you skip the Python, CUDA, FFmpeg, and Pyannote setup plus the GPU itself, (2) we add speaker diarization, silence-trimming voice-activity detection, and SRT/VTT subtitle formatting on top of the raw Whisper output, (3) files can be up to 5 GB and 10 hours per upload versus the OpenAI API's 25 MB cap. Per-minute math goes to the OpenAI API; flat-fee plans plus features go to VexaScribe.
Is VexaScribe affiliated with OpenAI?
No. VexaScribe (formerly NovaScribe) is an independent service and not affiliated with, endorsed by, or sponsored by OpenAI. We use OpenAI's open-source Whisper model under its MIT license. "Whisper" and "OpenAI" are trademarks of OpenAI.
What's the difference between Whisper and gpt-4o-transcribe?
Whisper is OpenAI's open-source ASR model from 2022; the weights are public under MIT. gpt-4o-transcribe and gpt-4o-mini-transcribe are newer hosted-only transcription models OpenAI released on March 20, 2025. They are not open source, you cannot run them locally, and they use a different architecture built on top of GPT-4o. Pricing is similar — gpt-4o-transcribe matches Whisper at $0.006/min; gpt-4o-mini-transcribe is $0.003/min. VexaScribe runs Whisper Large-v3 because it is open and reproducible.
What's the difference between Whisper Large-v3 and Whisper Turbo?
Whisper Large-v3 (November 2023) has 1,550M parameters and 32 decoder layers. Whisper Large-v3 Turbo (October 2024) has 809M parameters and only 4 decoder layers. Turbo runs roughly 4–8× faster than Large-v3 depending on hardware, and needs about 6 GB VRAM instead of 10 GB. Accuracy is slightly worse — around 7.83% mean WER on the Open ASR Leaderboard versus 7.44% for Large-v3. Turbo is also not trained for translation, only transcription.
How does VexaScribe handle long files (over the OpenAI API 25 MB limit)?
The OpenAI Whisper API rejects requests larger than 25 MB, so longer files must be chunked client-side. VexaScribe accepts files up to 5 GB in a single upload — we chunk on our side using FFmpeg with silence-aware boundaries, run Whisper on each segment, and stitch the timestamps back together so the output is a single continuous transcript with consistent speaker labels.
Does Whisper hallucinate?
Yes. The 2024 ACM FAccT paper "Careless Whisper" by Koenecke et al. found that about 1% of Whisper transcriptions contained entirely fabricated phrases that were not in the underlying audio, and that 38% of those contained explicit harms — invented violence, false authority claims, or fabricated medical content. Hallucinations cluster on segments with long silences. VexaScribe runs voice-activity detection before Whisper to trim those silences, which reduces but does not eliminate hallucinations — proofread anything safety-critical.
Can I export the timestamps from Whisper?
Yes. VexaScribe exports word-level timestamps in JSON (each word with start, end, and confidence), segment-level timestamps in SRT and VTT subtitle formats, and plain text without timestamps. SRT and VTT exports are formatted to broadcast-friendly defaults so you can drop them straight into a video editor or upload to YouTube.
Is my audio used to train Whisper?
No. Whisper Large-v3 is a frozen open-source checkpoint released by OpenAI under MIT — we run it as-is and do not retrain it. We do not use your audio or transcripts to train any other model either. Audio files transit over TLS 1.2+ and are stored encrypted at rest in AWS eu-west-2. You can delete any file at any time from your dashboard.
Related VexaScribe tools
Transcribe audio to text
All formats, 99 languages, 95% accuracy
MP3 to text
Bitrate, codec, and edge-case troubleshooting
SRT subtitle generator
Whisper transcripts as broadcast-ready subtitles
Transcribe and translate
133-language translation in one pass
How accurate is Whisper?
WER benchmarks across LibriSpeech, FLEURS
Transcript summary
Turn any transcript into a structured AI summary