Formerly NovaScribe — same team, same product, refreshed name. Read the announcement →

MP3 to Text — Transcribe .mp3 Audio Files to Text (99 Languages)

Convert any .mp3 audio file to .txt, .docx, .srt, .vtt, or .json in seconds. Whisper Large-v3, 99 languages, files up to 5 GB.

VexaScribe (formerly NovaScribe) converts MP3 audio to text using OpenAI's Whisper Large-v3 model — trained on 680,000 hours of multilingual speech — and reaches around 97% practical accuracy on clean MP3s recorded at 128 kbps or higher, across 99 languages. Files up to 5 GB work on paid tiers; the free preview accepts up to 30 minutes of audio with a quick signup. A 60-minute episode typically completes in 110-140 seconds. The system auto-detects language, resamples to 16 kHz mono internally, and outputs 5 export formats including timestamped subtitles.

30 minutes freeNo credit card99 languages5 GB per fileURL support

Paste an MP3 URL — no download needed

New for July 2026: instead of downloading and re-uploading, paste a link directly into the dashboard. Works with direct podcast episode MP3 URLs (the .mp3 link inside an RSS feed), Google Drive public share links (>25MB confirmation token handled), Dropbox shares, S3 URLs, or any HTTPS URL that directly serves an MP3 file. Same minute cost as a file upload — the download step just happens on our side.

Also supports YouTube, TikTok, and Instagram video posts (audio track extracted automatically). See all supported URL sources.

File size cap comparison

MP3 files are compact but not tiny — a 2-hour podcast at 128 kbps is roughly 115 MB. Long lectures, multi-hour interviews, or archived Zoom recordings can easily hit 500 MB or more. Free browser tools reject them; we accept up to 5 GB per file. Verified against vendor pricing pages on July 3, 2026.

Tool	File size cap	What fits
VexaScribe	5 GB per file	~90 hours of MP3 at 128 kbps — no practical limit for podcasts, lectures, or archived recordings
OpenAI Whisper API (raw)	25 MB per file	~26 minutes of MP3 at 128 kbps — requires manual chunking for longer files
Zamzar	100 MB (free) / 400 MB (paid)	~1.7 hours at 128 kbps — rejects long podcasts, multi-hour interviews, or Zoom archives
audioconverter.ai (free)	Undocumented (small)	Fine for short clips only — no size guarantee for long-form audio
audiototext.com (free)	~25 MB typical browser upload cap	Short clips; browser rejects large files

The 5 GB cap matters for a specific practical scenario: multi-hour Zoom or Google Meet recordings exported as MP3, full-length podcast episodes at higher bitrates (192-320 kbps), and archived recordings you did not compress heavily. Under 100 MB, any tool works; over 100 MB, most free tools reject the file.

How It Works: 3 Steps from MP3 to Text

Converting an MP3 to text on VexaScribe takes three steps: upload the file, wait roughly 2 minutes per hour of audio, and download your transcript.

1
Upload your MP3
Drag and drop or browse. Accepts MP3, M4A, WAV, FLAC, OGG, WebM up to 5 GB. Files are auto-resampled to 16 kHz mono — no need to convert beforehand.
2
Whisper transcribes the audio
GPU pipeline auto-detects language across 99 supported languages and runs Whisper Large-v3. A 1-hour MP3 takes about 110-140 seconds.
3
Edit and export
Review the transcript in the synced editor (speaker labels included), correct any words, and download as TXT, SRT, VTT, DOCX, or JSON.

Accuracy: What to Expect on Real MP3s

On clean studio MP3s at 128 kbps or higher, Whisper Large-v3 achieves around 97% word accuracy — 2.7% Word Error Rate on the LibriSpeech test-clean benchmark per OpenAI's Whisper paper. On noisy real-world recordings, expect 88-95%.

Audio condition	Whisper Large-v3 WER	Practical accuracy	Editing needed
Studio podcast, single speaker, 192 kbps MP3	2.7%	~97%	Light proofread
Zoom meeting, 2-4 speakers, 128 kbps MP3	5.2%	~95%	Moderate cleanup, speaker labels
Phone-call recording, 64 kbps MP3	~9%	~91%	Heavy edit on cross-talk
Field interview with background noise	~12%	~88%	Manual correction of proper nouns
Music with sung vocals (lyrics)	25%+	<75%	Not recommended — see FAQ

These are realistic ranges, not marketing claims. Whisper Large-v3's reported 2.7% WER comes from the LibriSpeech test-clean benchmark — a clean read-speech corpus. Your real-world MP3 of a noisy cafe interview will score worse, and that is normal. We surface the model's confidence per segment so you can spot risky sections fast. For a deeper dive on accuracy across audio conditions, see how accurate is Whisper.

MP3 Bitrate & Codec Guide: Will My File Work?

Any MP3 from 64 kbps up will transcribe accurately. Below 32 kbps, accuracy degrades sharply. Whisper handles MP3 better than most engines, but FLAC or WAV always edges out compressed audio in noisy conditions.

Accuracy by MP3 bitrate

Bitrate	Use case	Expected accuracy	Recommendation
320 kbps CBR	Music masters, studio podcast	~97%	Ideal — re-encode unnecessary
192 kbps CBR/VBR	Most podcasts, YouTube rips	~96%	Excellent — upload as-is
128 kbps CBR	Default export from most DAWs	~95%	Good — upload as-is
64 kbps mono	Phone calls, voicemail, AM radio	~91%	Acceptable — expect proper-noun errors
≤32 kbps	Heavily compressed legacy files	<85%	Re-record if possible

MP3 vs WAV vs FLAC vs Opus

Format	Compression	Whisper accuracy	File size (1 hr speech)
WAV	None (PCM)	Baseline (~97%)	~600 MB
FLAC	Lossless	Baseline (~97%)	~300 MB
MP3 192 kbps	Lossy	~96% (≈10% WER degradation in noisy scenarios)	~85 MB
OGG Opus 64 kbps	Lossy (modern)	~96% (≈2% degradation)	~30 MB

IBM Watson research found MP3 compression introduces roughly a 10% relative WER increase in noisy conditions versus uncompressed WAV, while OGG Opus only adds about 2%. A 2011 SciTePress study confirms MP3 does not significantly distort speech recognition down to about 24 kbps. Translation: don't waste effort upgrading a 128 kbps podcast to FLAC — Whisper barely notices. See all supported audio formats.

Speed & File Size Limits

VexaScribe transcribes a 1-hour MP3 in roughly 110-140 seconds on the cloud and accepts files up to 5 GB on paid tiers. That's about 25-30× real-time on our GPU pipeline.

Realistic processing times

• 10-minute MP3 → ~25 seconds
• 60-minute podcast → ~2 minutes
• 3-hour university lecture → ~6 minutes
• 8-hour all-day recording → ~16 minutes

CPU-only processing (running Whisper locally on a typical laptop) runs roughly 0.5× to 2× real-time, so a 1-hour MP3 takes 30 minutes to 2 hours on a laptop without a GPU. The MLCommons Whisper Inference v5.1 benchmark (September 2025) confirms these orders of magnitude on standard cloud hardware. See paid plan limits for full per-tier quotas.

Privacy & Data Handling

Your MP3 is uploaded over HTTPS, transcribed on isolated workers, and stored encrypted at rest. We don't train AI models on your audio.

Encryption in transit: TLS 1.2+ between your browser and our API.
Encryption at rest: Audio blobs and transcript records encrypted in AWS eu-west-2.
Self-serve deletion: Delete files and transcripts at any time from your dashboard. Account deletion purges all data.
No model training: We do not use customer audio to train Whisper or any other model.
No third-party transcription APIs: Whisper runs on our own GPU infrastructure — your audio doesn't leave our environment.

Sensitive material? If you're handling legally sensitive recordings — attorney-client communications, internal HR investigations, classified research — see the local Whisper alternative below. For most podcasts, lectures, and meetings, our cloud is the faster path with full encryption.

Top Use Cases for MP3 to Text

MP3 is still the default export from podcast hosts, voice recorders, and call-recording apps — which makes MP3-to-text the most common transcription task.

Podcast show notes & SEO

Turn each episode into a searchable web page. Transcripts double episode discoverability and improve YouTube ranking when you also upload as captions.

Journalism interviews

Interview MP3s become quote-ready text with timestamps to cite. Multi-speaker recordings get speaker labels automatically.

Lecture & meeting notes

University recordings and Zoom MP3 exports become searchable study material. Generate AI summaries from the transcript.

Voice memo cleanup

iPhone Voice Memos export as M4A or MP3 — convert to text for note-taking, journaling, or integrating into Notion/Obsidian.

Subtitle generation

Generate SRT/VTT subtitles for video editors who recorded audio separately. Word-level timestamps for accurate sync.

Multilingual translation prep

Transcribe in source language, then translate the transcript to English (or 132 other languages) for international subtitles.

After transcribing your MP3: use AI Chat to ask the transcript natural-language questions — “what were the main points?”, “did they mention the deadline?” — and get answers with clickable timestamps that jump the audio player to the exact moment. Available on paid plans.

Export Formats Explained: TXT, DOCX, SRT, VTT, JSON

VexaScribe exports your MP3 transcript in 5 formats: plain TXT for reading, DOCX for editing in Word, SRT and VTT for subtitles, and JSON for developers.

Format	Best for	Timestamps	Speaker labels
TXT	Reading, copy-paste, summaries	No	Optional
DOCX	Editing in Word, sharing with non-technical reviewers	Optional	Yes
SRT	YouTube, Premiere, Final Cut subtitles	Yes (HH:MM:SS,ms)	Optional
VTT	HTML5 video, web players	Yes (HH:MM:SS.ms)	Optional
JSON	Developer pipelines, custom UIs, search indexing	Yes (per-word)	Yes

SRT export sample

1
00:00:00,000 --> 00:00:06,840
[Speaker 1]: And so, my fellow Americans,
ask not what your country can do for you,

2
00:00:06,840 --> 00:00:11,200
ask what you can do for your country.

JSON export sample (excerpt)

{
  "language": "en",
  "duration_sec": 47.32,
  "segments": [{
    "start": 0.0,
    "end": 6.84,
    "speaker": "Speaker 1",
    "text": "And so, my fellow Americans..."
  }]
}

Need just subtitles? Use the dedicated SRT generator. Need a summary instead of a full transcript? See transcript to summary.

Troubleshooting: Corrupt, DRM, M4P, and Silent MP3s

Most MP3 upload failures fall into 4 categories. Each has a specific fix.

1. Corrupt MP3 header ("invalid frame sync")

Symptom: Upload finishes but transcription fails immediately.

Cause: Incomplete download or aborted DAW export left a malformed MP3 header.

Fix: Run `ffmpeg -i broken.mp3 -c:a libmp3lame -b:a 192k fixed.mp3` to rewrap the file, then retry the upload.

2. DRM-protected M4P (Apple Music, audiobook DRM)

Symptom: Upload returns "unsupported format" error.

Cause: M4P contains FairPlay DRM; the file cannot be decoded without removing the protection.

Fix: VexaScribe cannot and will not strip DRM. Re-record from the original source if you have legal rights, or purchase a DRM-free version.

3. Silent or truncated MP3

Symptom: Transcript returns empty or only the first 1-2 words.

Cause: File is silent past the first second, or audio is on a channel the decoder didn't pick up.

Fix: Verify with VLC. Re-export with `ffmpeg -i input.mp3 -ac 1 -ar 16000 mono.mp3` to force mono 16 kHz.

4. VBR with broken Xing/LAME header

Symptom: Duration displays incorrectly; transcript stops early.

Cause: Variable-bitrate file is missing the Xing/LAME header that signals duration to decoders.

Fix: Re-encode to constant bitrate: `ffmpeg -i vbr.mp3 -c:a libmp3lame -b:a 192k cbr.mp3`

Power-User: Run Whisper Locally for Sensitive Files

For files you cannot legally upload — sealed legal recordings, classified research, attorney-client material — you can run Whisper Large-v3 locally with a single pip install. It's slower (CPU 0.5-2× real-time) but never touches a network.

# Install Whisper locally (Python 3.9+)
pip install -U openai-whisper

# Transcribe an MP3 with the same model VexaScribe uses
whisper audio.mp3 --model large-v3 --language en --output_format srt

# For CPU-only laptops, use the smaller (faster) model:
whisper audio.mp3 --model medium --language en --output_format txt

When to use which:

VexaScribe cloud (default): 25-30× real-time, polished editor, speaker diarization, 5 export formats.
Local Whisper: Sensitive material that cannot leave the device. Best for full local control.
OpenAI API: If you have OpenAI credits and 25 MB chunks are workable for your use case.
Hybrid: Transcribe locally for sensitive files, then paste the TXT into transcript to summary for the summarization step.

See the official Whisper README for full installation instructions and supported model sizes.

Convert MP3s for Cents Per Minute

Pay-per-minute pricing means you pay only for what you transcribe. No per-export fees.

Free trial

30 min total

No credit card

Starter

$2/month

200 min/month

Solo creators

Pro

$10/month

2,500 min/month

Active publishers

See all plans, including Basic and Studio →

MP3 to Text — Frequently Asked Questions

Is MP3 to text really free on VexaScribe?

Yes, the free preview transcribes up to 30 minutes of audio with no credit card required. Paid plans start at $2/month for 200 minutes (Starter), $5/month for 1,000 minutes (Basic), $10/month for 2,500 minutes (Pro), and $20/month for 6,000 minutes (Studio). All paid tiers include all 5 export formats — TXT, DOCX, SRT, VTT, and JSON.

How accurate is MP3-to-text transcription?

On clean MP3s recorded at 128 kbps or higher, OpenAI's Whisper Large-v3 model — which powers VexaScribe — reaches around 97% word accuracy (2.7% Word Error Rate on the LibriSpeech test-clean benchmark per the official Whisper paper). Real-world accuracy varies: studio podcasts 95-97%, multi-speaker Zoom recordings 90-95%, phone-call recordings 85-92%, and audio with heavy background noise 80-90%.

Is my MP3 file private?

Audio files transit over TLS 1.2+ encryption and are stored encrypted at rest in AWS eu-west-2. We do not train AI models on your audio. We do not sell user data. You can delete files at any time from your dashboard, and account deletion is self-serve.

Can VexaScribe identify multiple speakers in an MP3?

Yes, automatic speaker diarization is included on every transcript. Multiple speakers are labeled Speaker 1, Speaker 2, Speaker 3, and so on. You can rename speakers in the built-in editor (e.g., "Host", "Guest", actual names) before exporting.

What languages does the MP3 transcriber support?

VexaScribe transcribes MP3 files in 99 languages with automatic language detection — English, Spanish, German, Portuguese, French, Italian, Dutch, Polish, Russian, Mandarin, Japanese, Korean, Arabic, Hindi, Vietnamese, Thai, Turkish, and many more. No need to select the language manually.

Can I edit the transcript after it's generated?

Yes, every transcript opens in a synced editor where you can correct words, merge or split segments, rename speakers, and adjust timestamps. Edits save automatically. When you're ready, export to TXT, DOCX, SRT, VTT, or JSON.

How long can my MP3 be?

Free preview accepts files up to 30 minutes. Paid plans support files up to 5 GB and 10 hours per file. A 1-hour MP3 at 128 kbps is roughly 58 MB, well within both limits.

Can I transcribe an MP3 offline without uploading?

Yes, you can run OpenAI's Whisper model locally on your own machine using `pip install openai-whisper`. CPU-only processing runs at roughly 0.5–2× real-time, so a 1-hour MP3 takes 30 minutes to 2 hours on a laptop. This is the most private option for sensitive files where you want full local control. VexaScribe's cloud is faster (25-30× real-time) and adds speaker diarization, a polished editor, and 5 export formats.

Does VexaScribe transcribe music lyrics from MP3s?

Not reliably. Whisper's Word Error Rate on music with sung vocals exceeds 25% — the model is trained for speech, not lyrics. For studio-quality spoken-word audio (podcasts, interviews, lectures), accuracy is excellent. For music transcription, use a dedicated lyrics-recognition service.

How long does processing take?

Roughly 110-140 seconds per hour of audio on the cloud — about 25-30× real-time. A 10-minute MP3 takes ~25 seconds. A 1-hour podcast typically completes in 2-3 minutes.

What's the maximum MP3 file size?

Free preview: 30 minutes per file. Paid plans: up to 5 GB per file (around 8-10 hours of audio at 128 kbps). The OpenAI Whisper API has a much smaller 25 MB hard cap by comparison — VexaScribe processes large files in a single pass without you needing to chunk them.

Which MP3 versions and codecs are supported?

All MPEG-1 Layer III variants are supported — Constant Bitrate (CBR), Variable Bitrate (VBR), and Average Bitrate (ABR) — at bitrates from 32 kbps to 320 kbps, mono and stereo. Whisper internally resamples to 16 kHz mono per the official README, so there's no need to convert your file before uploading.

Can I transcribe a DRM-protected or M4P MP3?

No. M4P FairPlay-protected files (Apple Music DRM) and any DRM-encrypted audio cannot be transcribed — VexaScribe cannot strip DRM. If you have legal rights to the audio, re-record from the original source or purchase a DRM-free version.

Why does my MP3 show different accuracy than expected?

MP3 compression introduces roughly 10% relative WER increase versus uncompressed WAV in noisy conditions, per IBM Watson research. OGG Opus is more efficient — only ~2% degradation. For best accuracy on long-form material, record in WAV or FLAC and convert to MP3 for storage. For short clips, MP3 at 128 kbps or higher is indistinguishable from WAV in clean environments.

Can I convert an .mp3 to a .txt file (plain text)?

Yes. VexaScribe exports every transcript as .txt by default — plain text, no formatting, one paragraph per speaker. Choose .txt in the export dropdown after transcription completes. Also available: .docx (Word with speaker headers), .srt and .vtt (subtitle files with timestamps), and .json (word-level structured data for developers).

Can I convert an MP3 to PDF?

Not directly — VexaScribe exports .txt, .docx, .srt, .vtt, and .json. For a PDF, export to .docx and use Word, Google Docs, or Pages: File → Save As / Download as → PDF. This preserves speaker labels and timestamps and gives you full control over margins, fonts, and headers.

Convert Your First MP3 in 30 Seconds

30 minutes of free transcription, no credit card required. Upload an MP3 and download a finished transcript in minutes.

Paste an MP3 URL — no download needed

File size cap comparison

How It Works: 3 Steps from MP3 to Text

Upload your MP3

Whisper transcribes the audio

Edit and export

Accuracy: What to Expect on Real MP3s

MP3 Bitrate & Codec Guide: Will My File Work?

Accuracy by MP3 bitrate

MP3 vs WAV vs FLAC vs Opus

Speed & File Size Limits

Realistic processing times

Privacy & Data Handling

Top Use Cases for MP3 to Text

Podcast show notes & SEO

Journalism interviews

Lecture & meeting notes

Voice memo cleanup

Subtitle generation

Multilingual translation prep

Export Formats Explained: TXT, DOCX, SRT, VTT, JSON

Troubleshooting: Corrupt, DRM, M4P, and Silent MP3s

1. Corrupt MP3 header ("invalid frame sync")

2. DRM-protected M4P (Apple Music, audiobook DRM)

3. Silent or truncated MP3

4. VBR with broken Xing/LAME header

Power-User: Run Whisper Locally for Sensitive Files

Convert MP3s for Cents Per Minute

Free trial

Starter

Pro

MP3 to Text — Frequently Asked Questions

Convert Your First MP3 in 30 Seconds

Learn more

Podcast transcription

Transcript to summary

iPhone Voice Memo transcription

WAV to text

Transcribe audio to text

Whisper transcription

SRT subtitle generator

Voicemail to text

Video to SRT

Voicemail transcription

MP4 to text

M4A to text

OGG to text

WhatsApp voice message transcription

Transcribe Spanish audio

Converter MP3 para texto (Português)

How to add subtitles to a video

Podcast transcription

Transcript to summary

How accurate is Whisper?

Transcribe song lyrics