Formerly NovaScribe — same team, same product, refreshed name. Read the announcement →
MP3 to Text Converter — Free, Accurate, AI-Powered
Convert any MP3 to a finished .txt, .docx, or .srt file in seconds. Whisper Large-v3 accuracy, 99 languages, 5 export formats.
VexaScribe (formerly NovaScribe) converts MP3 audio to text using OpenAI's Whisper Large-v3 model — trained on 680,000 hours of multilingual speech — and reaches around 97% practical accuracy on clean MP3s recorded at 128 kbps or higher, across 99 languages. Files up to 5 GB work on paid tiers; the free preview accepts up to 30 minutes of audio with a quick signup. A 60-minute episode typically completes in 110-140 seconds. The system auto-detects language, resamples to 16 kHz mono internally, and outputs 5 export formats including timestamped subtitles.
How It Works: 3 Steps from MP3 to Text
Converting an MP3 to text on VexaScribe takes three steps: upload the file, wait roughly 2 minutes per hour of audio, and download your transcript.
- 1
Upload your MP3
Drag and drop or browse. Accepts MP3, M4A, WAV, FLAC, OGG, WebM up to 5 GB. Files are auto-resampled to 16 kHz mono — no need to convert beforehand.
- 2
Whisper transcribes the audio
GPU pipeline auto-detects language across 99 supported languages and runs Whisper Large-v3. A 1-hour MP3 takes about 110-140 seconds.
- 3
Edit and export
Review the transcript in the synced editor (speaker labels included), correct any words, and download as TXT, SRT, VTT, DOCX, or JSON.
Accuracy: What to Expect on Real MP3s
On clean studio MP3s at 128 kbps or higher, Whisper Large-v3 achieves around 97% word accuracy — 2.7% Word Error Rate on the LibriSpeech test-clean benchmark per OpenAI's Whisper paper. On noisy real-world recordings, expect 88-95%.
| Audio condition | Whisper Large-v3 WER | Practical accuracy | Editing needed |
|---|---|---|---|
| Studio podcast, single speaker, 192 kbps MP3 | 2.7% | ~97% | Light proofread |
| Zoom meeting, 2-4 speakers, 128 kbps MP3 | 5.2% | ~95% | Moderate cleanup, speaker labels |
| Phone-call recording, 64 kbps MP3 | ~9% | ~91% | Heavy edit on cross-talk |
| Field interview with background noise | ~12% | ~88% | Manual correction of proper nouns |
| Music with sung vocals (lyrics) | 25%+ | <75% | Not recommended — see FAQ |
These are realistic ranges, not marketing claims. Whisper Large-v3's reported 2.7% WER comes from the LibriSpeech test-clean benchmark — a clean read-speech corpus. Your real-world MP3 of a noisy cafe interview will score worse, and that is normal. We surface the model's confidence per segment so you can spot risky sections fast. For a deeper dive on accuracy across audio conditions, see how accurate is Whisper.
MP3 Bitrate & Codec Guide: Will My File Work?
Any MP3 from 64 kbps up will transcribe accurately. Below 32 kbps, accuracy degrades sharply. Whisper handles MP3 better than most engines, but FLAC or WAV always edges out compressed audio in noisy conditions.
Accuracy by MP3 bitrate
| Bitrate | Use case | Expected accuracy | Recommendation |
|---|---|---|---|
| 320 kbps CBR | Music masters, studio podcast | ~97% | Ideal — re-encode unnecessary |
| 192 kbps CBR/VBR | Most podcasts, YouTube rips | ~96% | Excellent — upload as-is |
| 128 kbps CBR | Default export from most DAWs | ~95% | Good — upload as-is |
| 64 kbps mono | Phone calls, voicemail, AM radio | ~91% | Acceptable — expect proper-noun errors |
| ≤32 kbps | Heavily compressed legacy files | <85% | Re-record if possible |
MP3 vs WAV vs FLAC vs Opus
| Format | Compression | Whisper accuracy | File size (1 hr speech) |
|---|---|---|---|
| WAV | None (PCM) | Baseline (~97%) | ~600 MB |
| FLAC | Lossless | Baseline (~97%) | ~300 MB |
| MP3 192 kbps | Lossy | ~96% (≈10% WER degradation in noisy scenarios) | ~85 MB |
| OGG Opus 64 kbps | Lossy (modern) | ~96% (≈2% degradation) | ~30 MB |
IBM Watson research found MP3 compression introduces roughly a 10% relative WER increase in noisy conditions versus uncompressed WAV, while OGG Opus only adds about 2%. A 2011 SciTePress study confirms MP3 does not significantly distort speech recognition down to about 24 kbps. Translation: don't waste effort upgrading a 128 kbps podcast to FLAC — Whisper barely notices. See all supported audio formats.
Speed & File Size Limits
VexaScribe transcribes a 1-hour MP3 in roughly 110-140 seconds on the cloud and accepts files up to 5 GB on paid tiers. That's about 25-30× real-time on our GPU pipeline.
Realistic processing times
- • 10-minute MP3 → ~25 seconds
- • 60-minute podcast → ~2 minutes
- • 3-hour university lecture → ~6 minutes
- • 8-hour all-day recording → ~16 minutes
CPU-only processing (running Whisper locally on a typical laptop) runs roughly 0.5× to 2× real-time, so a 1-hour MP3 takes 30 minutes to 2 hours on a laptop without a GPU. The MLCommons Whisper Inference v5.1 benchmark (September 2025) confirms these orders of magnitude on standard cloud hardware. See paid plan limits for full per-tier quotas.
Privacy & Data Handling
Your MP3 is uploaded over HTTPS, transcribed on isolated workers, and stored encrypted at rest. We don't train AI models on your audio.
- Encryption in transit: TLS 1.2+ between your browser and our API.
- Encryption at rest: Audio blobs and transcript records encrypted in AWS eu-west-2.
- Self-serve deletion: Delete files and transcripts at any time from your dashboard. Account deletion purges all data.
- No model training: We do not use customer audio to train Whisper or any other model.
- No third-party transcription APIs: Whisper runs on our own GPU infrastructure — your audio doesn't leave our environment.
Top Use Cases for MP3 to Text
MP3 is still the default export from podcast hosts, voice recorders, and call-recording apps — which makes MP3-to-text the most common transcription task.
Podcast show notes & SEO
Turn each episode into a searchable web page. Transcripts double episode discoverability and improve YouTube ranking when you also upload as captions.
Journalism interviews
Interview MP3s become quote-ready text with timestamps to cite. Multi-speaker recordings get speaker labels automatically.
Lecture & meeting notes
University recordings and Zoom MP3 exports become searchable study material. Generate AI summaries from the transcript.
Voice memo cleanup
iPhone Voice Memos export as M4A or MP3 — convert to text for note-taking, journaling, or integrating into Notion/Obsidian.
Subtitle generation
Generate SRT/VTT subtitles for video editors who recorded audio separately. Word-level timestamps for accurate sync.
Multilingual translation prep
Transcribe in source language, then translate the transcript to English (or 132 other languages) for international subtitles.
Export Formats Explained: TXT, DOCX, SRT, VTT, JSON
VexaScribe exports your MP3 transcript in 5 formats: plain TXT for reading, DOCX for editing in Word, SRT and VTT for subtitles, and JSON for developers.
| Format | Best for | Timestamps | Speaker labels |
|---|---|---|---|
| TXT | Reading, copy-paste, summaries | No | Optional |
| DOCX | Editing in Word, sharing with non-technical reviewers | Optional | Yes |
| SRT | YouTube, Premiere, Final Cut subtitles | Yes (HH:MM:SS,ms) | Optional |
| VTT | HTML5 video, web players | Yes (HH:MM:SS.ms) | Optional |
| JSON | Developer pipelines, custom UIs, search indexing | Yes (per-word) | Yes |
SRT export sample
1 00:00:00,000 --> 00:00:06,840 [Speaker 1]: And so, my fellow Americans, ask not what your country can do for you, 2 00:00:06,840 --> 00:00:11,200 ask what you can do for your country.
JSON export sample (excerpt)
{
"language": "en",
"duration_sec": 47.32,
"segments": [{
"start": 0.0,
"end": 6.84,
"speaker": "Speaker 1",
"text": "And so, my fellow Americans..."
}]
}Need just subtitles? Use the dedicated SRT generator. Need a summary instead of a full transcript? See transcript to summary.
Troubleshooting: Corrupt, DRM, M4P, and Silent MP3s
Most MP3 upload failures fall into 4 categories. Each has a specific fix.
1. Corrupt MP3 header ("invalid frame sync")
Symptom: Upload finishes but transcription fails immediately.
Cause: Incomplete download or aborted DAW export left a malformed MP3 header.
Fix: Run `ffmpeg -i broken.mp3 -c:a libmp3lame -b:a 192k fixed.mp3` to rewrap the file, then retry the upload.
2. DRM-protected M4P (Apple Music, audiobook DRM)
Symptom: Upload returns "unsupported format" error.
Cause: M4P contains FairPlay DRM; the file cannot be decoded without removing the protection.
Fix: VexaScribe cannot and will not strip DRM. Re-record from the original source if you have legal rights, or purchase a DRM-free version.
3. Silent or truncated MP3
Symptom: Transcript returns empty or only the first 1-2 words.
Cause: File is silent past the first second, or audio is on a channel the decoder didn't pick up.
Fix: Verify with VLC. Re-export with `ffmpeg -i input.mp3 -ac 1 -ar 16000 mono.mp3` to force mono 16 kHz.
4. VBR with broken Xing/LAME header
Symptom: Duration displays incorrectly; transcript stops early.
Cause: Variable-bitrate file is missing the Xing/LAME header that signals duration to decoders.
Fix: Re-encode to constant bitrate: `ffmpeg -i vbr.mp3 -c:a libmp3lame -b:a 192k cbr.mp3`
Power-User: Run Whisper Locally for Sensitive Files
For files you cannot legally upload — sealed legal recordings, classified research, attorney-client material — you can run Whisper Large-v3 locally with a single pip install. It's slower (CPU 0.5-2× real-time) but never touches a network.
# Install Whisper locally (Python 3.9+) pip install -U openai-whisper # Transcribe an MP3 with the same model VexaScribe uses whisper audio.mp3 --model large-v3 --language en --output_format srt # For CPU-only laptops, use the smaller (faster) model: whisper audio.mp3 --model medium --language en --output_format txt
When to use which:
- VexaScribe cloud (default): 25-30× real-time, polished editor, speaker diarization, 5 export formats.
- Local Whisper: Sensitive material that cannot leave the device. Best for full local control.
- OpenAI API: If you have OpenAI credits and 25 MB chunks are workable for your use case.
- Hybrid: Transcribe locally for sensitive files, then paste the TXT into transcript to summary for the summarization step.
See the official Whisper README for full installation instructions and supported model sizes.
Convert MP3s for Cents Per Minute
Pay-per-minute pricing means you pay only for what you transcribe. No per-export fees.
Free trial
30 min total
No credit card
Starter
200 min/month
Solo creators
Pro
2,500 min/month
Active publishers
MP3 to Text — Frequently Asked Questions
Is MP3 to text really free on VexaScribe?
Yes, the free preview transcribes up to 30 minutes of audio with no credit card required. Paid plans start at $2/month for 200 minutes (Starter), $5/month for 1,000 minutes (Basic), $10/month for 2,500 minutes (Pro), and $20/month for 6,000 minutes (Studio). All paid tiers include all 5 export formats — TXT, DOCX, SRT, VTT, and JSON.
How accurate is MP3-to-text transcription?
On clean MP3s recorded at 128 kbps or higher, OpenAI's Whisper Large-v3 model — which powers VexaScribe — reaches around 97% word accuracy (2.7% Word Error Rate on the LibriSpeech test-clean benchmark per the official Whisper paper). Real-world accuracy varies: studio podcasts 95-97%, multi-speaker Zoom recordings 90-95%, phone-call recordings 85-92%, and audio with heavy background noise 80-90%.
Is my MP3 file private?
Audio files transit over TLS 1.2+ encryption and are stored encrypted at rest in AWS eu-west-2. We do not train AI models on your audio. We do not sell user data. You can delete files at any time from your dashboard, and account deletion is self-serve.
Can VexaScribe identify multiple speakers in an MP3?
Yes, automatic speaker diarization is included on every transcript. Multiple speakers are labeled Speaker 1, Speaker 2, Speaker 3, and so on. You can rename speakers in the built-in editor (e.g., "Host", "Guest", actual names) before exporting.
What languages does the MP3 transcriber support?
VexaScribe transcribes MP3 files in 99 languages with automatic language detection — English, Spanish, German, Portuguese, French, Italian, Dutch, Polish, Russian, Mandarin, Japanese, Korean, Arabic, Hindi, Vietnamese, Thai, Turkish, and many more. No need to select the language manually.
Can I edit the transcript after it's generated?
Yes, every transcript opens in a synced editor where you can correct words, merge or split segments, rename speakers, and adjust timestamps. Edits save automatically. When you're ready, export to TXT, DOCX, SRT, VTT, or JSON.
How long can my MP3 be?
Free preview accepts files up to 30 minutes. Paid plans support files up to 5 GB and 10 hours per file. A 1-hour MP3 at 128 kbps is roughly 58 MB, well within both limits.
Can I transcribe an MP3 offline without uploading?
Yes, you can run OpenAI's Whisper model locally on your own machine using `pip install openai-whisper`. CPU-only processing runs at roughly 0.5–2× real-time, so a 1-hour MP3 takes 30 minutes to 2 hours on a laptop. This is the most private option for sensitive files where you want full local control. VexaScribe's cloud is faster (25-30× real-time) and adds speaker diarization, a polished editor, and 5 export formats.
Does VexaScribe transcribe music lyrics from MP3s?
Not reliably. Whisper's Word Error Rate on music with sung vocals exceeds 25% — the model is trained for speech, not lyrics. For studio-quality spoken-word audio (podcasts, interviews, lectures), accuracy is excellent. For music transcription, use a dedicated lyrics-recognition service.
How long does processing take?
Roughly 110-140 seconds per hour of audio on the cloud — about 25-30× real-time. A 10-minute MP3 takes ~25 seconds. A 1-hour podcast typically completes in 2-3 minutes.
What's the maximum MP3 file size?
Free preview: 30 minutes per file. Paid plans: up to 5 GB per file (around 8-10 hours of audio at 128 kbps). The OpenAI Whisper API has a much smaller 25 MB hard cap by comparison — VexaScribe processes large files in a single pass without you needing to chunk them.
Which MP3 versions and codecs are supported?
All MPEG-1 Layer III variants are supported — Constant Bitrate (CBR), Variable Bitrate (VBR), and Average Bitrate (ABR) — at bitrates from 32 kbps to 320 kbps, mono and stereo. Whisper internally resamples to 16 kHz mono per the official README, so there's no need to convert your file before uploading.
Can I transcribe a DRM-protected or M4P MP3?
No. M4P FairPlay-protected files (Apple Music DRM) and any DRM-encrypted audio cannot be transcribed — VexaScribe cannot strip DRM. If you have legal rights to the audio, re-record from the original source or purchase a DRM-free version.
Why does my MP3 show different accuracy than expected?
MP3 compression introduces roughly 10% relative WER increase versus uncompressed WAV in noisy conditions, per IBM Watson research. OGG Opus is more efficient — only ~2% degradation. For best accuracy on long-form material, record in WAV or FLAC and convert to MP3 for storage. For short clips, MP3 at 128 kbps or higher is indistinguishable from WAV in clean environments.
Learn more
WAV to text
The lossless sibling — honest WAV vs MP3 comparison inside
Transcribe audio to text
All formats, languages, accuracy
Whisper transcription
Hosted Whisper Large-v3 with diarization
SRT subtitle generator
Generate subtitles from MP3 audio
Video to SRT
Upload video directly — auto audio extraction + SRT
MP4 to text
Video container to transcript — TXT, DOCX, JSON, SRT
M4A to text
iPhone Voice Memos and Apple ecosystem audio
OGG to text
WhatsApp voice notes, Discord recordings, Android audio
Transcribe Spanish audio
Spanish MP3 podcasts + dialect-specific accuracy notes
Converter MP3 para texto (Português)
Brazilian Portuguese guide — bitrate, M4A vs MP3, LGPD, BRL pricing
How to add subtitles to a video
Step-by-step for YouTube, Premiere, CapCut, iPhone
Podcast transcription
Show notes, speaker labels, video subtitles
Transcript to summary
AI summaries from any transcript
How accurate is Whisper?
WER benchmarks & bitrate impact