Formerly NovaScribe — same team, same product, refreshed name. Read the announcement →

WAV to Text Converter

Transcribe uncompressed PCM audio with Whisper Large-v3 — all variants supported, honest answers on when WAV actually helps.

VexaScribe (formerly NovaScribe) transcribes WAV audio files using OpenAI's Whisper Large-v3 model. We accept every common WAV variant — PCM 8/16/24/32-bit, IEEE 32-bit float, ADPCM, and μ-law/A-law — at any sample rate from 8 kHz to 192 kHz. Files up to 5 GB per upload, enough for ~8 hours of 44.1 kHz stereo 16-bit WAV. Here's the honest part most pages won't tell you: Whisper resamples everything to 16 kHz mono internally, so on clean podcast or Zoom audio, a WAV transcript is indistinguishable from a 192 kbps MP3 transcript. WAV does genuinely help on three specific cases — noisy field recordings, dense multi-speaker crosstalk, and heavily accented speakers — where MP3 codec artifacts cost you 1-4 percentage points of accuracy. The rest of this page explains exactly when WAV pays for its file size and when it doesn't.

All PCM variantsUp to 5 GB / 10 hr99 languagesNo credit card

What is a WAV file?

A WAV file (.wav) is a Microsoft RIFF container holding uncompressed PCM audio samples — co-developed by Microsoft and IBM in 1991, now the de facto format for studio masters, broadcast deliverables, and any workflow where preserving the original signal matters more than file size.

"Uncompressed" is the most common case but not the only one. The WAV container also supports compressed variants — ADPCM (a 4-bit Microsoft compression scheme used in older Windows voice recorders) and the telephony codecs μ-law and A-law (G.711, 8 kHz / 8-bit logarithmic). When people say "WAV file," they almost always mean linear PCM, which is what this page focuses on.

For audio engineers, WAV is what comes out of a DAW (Pro Tools, Logic, Reaper, Ableton) when you bounce a mix. For broadcast producers, 48 kHz 24-bit WAV is the BBC/EBU delivery standard. For forensic and legal work, WAV is the only format where the recording's admissibility-as-evidence isn't complicated by lossy compression. For everyone else uploading audio for transcription, the question is whether WAV's 10-15× file-size penalty actually buys you a better transcript — see the next section.

WAV vs MP3 for speech recognition: the honest answer

For typical podcast, Zoom, or voice-memo speech, WAV does not give meaningfully better transcripts than 128 kbps+ MP3. Whisper resamples every input to 16 kHz mono internally — so a 96 kHz / 24-bit / stereo WAV and a 128 kbps mono MP3 both produce the same 16 kHz mono samples by the time the recognizer sees them. WAV genuinely wins on three edge cases below.

Audio conditionWAV WERMP3 128 kbps WERReal difference
Studio podcast, single host3–5%3–5%Negligible
Zoom 2-speaker (headsets)7–10%7–11%0.1–0.5 pp
Zoom 4+ speakers, mixed mics10–16%11–18%1–2 pp (WAV wins)
Phone call quality (8 kHz source)13–20%14–22%1–2 pp (WAV wins)
Noisy field recording (cafe, outdoors)15–22%17–26%2–4 pp (WAV wins)
Heavily accented English12–18%13–20%1–2 pp (WAV wins)

Where WAV genuinely helps: noisy field recordings (the codec discards subtle frequency information the recognizer needs to separate speech from background), dense multi-speaker crosstalk (lossy compression smears spectral details that distinguish voices), and heavily accented speech (phonemes near codec compression boundaries get garbled).

Where WAV does not help: studio-clean single-speaker audio, headset Zoom calls, podcast interviews recorded in treated rooms, any audio originally captured at 16 kHz or lower (the codec savings happen above Whisper's 16 kHz internal sample rate). For these, MP3 at 128-192 kbps is functionally identical and uploads ~10× faster.

The Whisper paper's own ablations show robustness to lossy compression at common bitrates — the original Whisper paper (Radford et al., 2022) notes the model was deliberately trained on diverse compressed and uncompressed sources. For format-specific tradeoffs in the other direction, see our MP3 to text guide.

WAV file size: what to expect

Uncompressed PCM file size scales linearly: bytes = sample_rate × bit_depth × channels × duration_seconds / 8. A 1-hour 44.1 kHz stereo 16-bit WAV is ~606 MB; a 1-hour 96 kHz stereo 24-bit master is ~2 GB.

Sample rateBit depthChannels1 minute1 hour
8 kHz16-bitMono938 KB55 MB
16 kHz (Whisper internal)16-bitMono1.8 MB110 MB
44.1 kHz16-bitMono5.0 MB303 MB
44.1 kHz (CD quality)16-bitStereo10.1 MB606 MB
44.1 kHz24-bitStereo15.1 MB908 MB
48 kHz (broadcast)24-bitStereo16.5 MB988 MB
96 kHz (hi-res)24-bitStereo33.0 MB1.98 GB
192 kHz (master)24-bitStereo65.9 MB3.96 GB

Practical implication: VexaScribe accepts files up to 5 GB, which covers everything except multi-hour 96/192 kHz 24-bit stereo masters. If you're close to the limit or your upload bandwidth is the bottleneck, downsample to 16 kHz mono before uploading — it's exactly the format Whisper consumes internally, so you lose nothing.

FFmpeg one-liner for 16 kHz mono PCM:

ffmpeg -i input.wav -ar 16000 -ac 1 -sample_fmt s16 output.wav

This produces a 110 MB/hour file with no Whisper-relevant information loss. For maximum upload speed convert to MP3 192 kbps mono instead (~85 MB/hour) — same transcript quality, dramatically smaller.

WAV technical variants

The WAV container supports a small zoo of sub-formats. VexaScribe accepts all of them and resamples internally. The dynamic range column shows the theoretical SNR — ~96 dB for 16-bit, ~144 dB for 24-bit, derived from the standard 6.02 × N + 1.76 dB formula for quantization noise.

VariantSample depthDynamic rangeCommon sourceSupported?
PCM 8-bit8-bit~48 dBLegacy systems, voicemailYes
PCM 16-bit16-bit~96 dBCD audio, most consumer recordingsYes (most common)
PCM 24-bit24-bit~144 dBBroadcast, professional recordingYes
PCM 32-bit32-bit~192 dBSome DAWs, scientificYes
IEEE float 32-bit32-bit floatEffectively unlimitedPro Tools, Logic, Reaper bouncesYes
ADPCM (Microsoft 4-bit)4-bit compressedLimitedOlder Windows recorders, voicemailYes (narrowband quality)
μ-law / A-law (G.711)8-bit logarithmicTelephone-gradeLegacy telephony, VoIP archivesYes (phone-quality)

For transcription, all variants produce identical output because Whisper resamples to 16 kHz mono regardless. Bit depth matters only when you also need the WAV for audio production downstream (mixing, mastering, sample manipulation). The narrowband formats — ADPCM, μ-law, A-law — start with telephony-quality source audio, so expect 80-88% accuracy similar to phone recordings, not the 95-97% you'd get from clean studio WAV.

How to convert WAV to text in 3 steps

Three steps. The upload usually takes longer than the transcription itself for large WAV files.

  1. 1

    Upload your WAV file

    Drag-drop any PCM (8/16/24/32-bit), IEEE float, ADPCM, or μ-law WAV up to 5 GB. No conversion needed before upload — but if your file is multi-GB and your connection is slow, downsample to 16 kHz mono first.

  2. 2

    AI transcribes the audio

    File is resampled to 16 kHz mono and run through Whisper Large-v3 at ~25-30× real-time. A 1-hour WAV finishes in ~2 minutes after upload completes.

  3. 3

    Edit and export

    Review in the synced editor, label speakers if needed, then download as TXT, DOCX, SRT, VTT, or JSON. Word-level timestamps included.

For multi-speaker WAV files (interviews recorded in a DAW with each speaker on a separate channel), enable speaker diarization at upload — VexaScribe uses channel separation to improve speaker labels. To generate subtitle files from the same WAV, see the SRT generator.

When to use WAV (and when not to)

WAV is the right answer when you need to preserve the original signal — for editing, broadcast, or evidence. It's the wrong answer when upload speed or storage cost dominates and the transcript is the only output you care about.

Studio podcast masters

Yes — use WAV

Keep the master as WAV for re-edits and re-bouncing. Transcribe directly without converting; uploads are slow but transcripts are identical to MP3.

Broadcast deliverables

Yes — use WAV

Broadcast specs (BBC, NPR, EBU) typically require 48 kHz 24-bit WAV. Transcribe from the same file you deliver.

Forensic / legal recordings

Yes — use WAV

Chain-of-custody integrity. The recording's admissibility depends on preserving the original signal. AI transcript is a draft; pair with human review for evidence.

Voice memos to share

No — use M4A or MP3

iPhone Voice Memos default to M4A. Don't convert to WAV — you'll just make the file 10× bigger without improving accuracy.

Long uploads on slow connections

No — convert first

If you're on hotel Wi-Fi or mobile data, convert WAV to 192 kbps MP3 before uploading. Same transcript, 10-15× faster upload.

Long-term archives

FLAC, not WAV

FLAC is lossless like WAV but typically 40-60% smaller. If you don't need real-time DAW playback, archive in FLAC and upload either format to VexaScribe.

Privacy and data handling

WAV uploads are encrypted in transit (TLS 1.2+) and at rest. We do not use customer audio to train Whisper or any other model.

  • Encryption: TLS 1.2+ in transit; AES-256 at rest in AWS eu-west-2.
  • No training: Customer audio is never used to train, fine-tune, or evaluate any model. Whisper runs in inference mode only.
  • Self-serve deletion: Delete any file or your full account from the dashboard.

For forensic or legal workflows where the audio cannot leave your environment, hosted services are the wrong fit — use self-hosted Whisper instead. See our full privacy policy and editorial standards.

Export formats

Five export formats from your WAV transcript. JSON preserves word-level timestamps; SRT and VTT are ready for video captioning.

FormatBest forTimestampsSpeakers
TXTPlain reading, LLM pipelinesNoIf enabled
DOCXWord-processor editingOptionalYes
SRTVideo captions (YouTube, Vimeo)Yes (segment-level)Yes
VTTHTML5 video, web playersYes (segment + cue metadata)Yes
JSONProgrammatic processing, custom UIsYes (word + segment)Yes

WAV to Text — Frequently Asked Questions

Does WAV give better transcription accuracy than MP3?

Usually not. Whisper Large-v3 — the model VexaScribe (formerly NovaScribe) runs — resamples every input to 16 kHz mono before transcription. At MP3 bitrates of 128 kbps or higher, the perceptible speech information you lose to lossy compression is far below what affects WER. Real-world testing shows MP3 vs WAV differences of 0.1-0.5 percentage points on clean podcast audio. WAV does win on three specific edge cases: noisy field recordings, heavily accented speakers, and dense multi-speaker crosstalk — places where the codec discards information the recognizer needs.

What sample rate does Whisper use internally? Should I downsample WAV first?

Whisper resamples everything to 16 kHz mono internally. Uploading a 96 kHz / 24-bit / stereo WAV doesn't give you better transcripts — it just gives you a 6× larger file to upload. If your bandwidth is the bottleneck, downsample first with FFmpeg: ffmpeg -i input.wav -ar 16000 -ac 1 -sample_fmt s16 output.wav. This produces the exact format Whisper consumes, ~115 MB/hour instead of ~600 MB/hour.

Can I transcribe a 24-bit or 32-bit WAV file?

Yes. VexaScribe accepts PCM 8/16/24/32-bit WAV plus IEEE 32-bit float WAV. The transcription pipeline reads any of these and resamples internally. 24-bit WAV files are common from broadcast and recording studios; 32-bit float files are typical from DAWs (Pro Tools, Logic, Reaper). All produce identical transcripts since the downstream model sees the same 16 kHz mono samples.

Does mono vs stereo matter for transcription accuracy?

For single-speaker audio: no difference. Whisper mixes stereo to mono before processing. For multi-speaker recordings where each speaker is on a separate channel (e.g., a podcast where host and guest were recorded into different channels), stereo is genuinely better — VexaScribe can use channel separation to improve speaker diarization. For shared-mic recordings where both speakers hit the same channel, mono and stereo produce identical results.

What's the maximum WAV file size?

5 GB per upload — enough for roughly 8 hours of 44.1 kHz stereo 16-bit WAV or 4 hours of 96 kHz stereo 24-bit. For longer recordings, downsample to 16 kHz mono first (the format Whisper resamples to anyway), which fits ~43 hours into 5 GB. Files larger than 5 GB need to be split before upload.

How long does it take to upload a 600 MB WAV file?

Upload time depends on your connection's upload speed, not VexaScribe. On a 50 Mbps uplink, 600 MB takes about 100 seconds; on a 10 Mbps home connection, ~8 minutes; on a hotel Wi-Fi capped at 2 Mbps, ~40 minutes. Transcription itself is fast (1-hour file in ~2 minutes) — upload is usually the slow part. For mobile uploads or slow connections, convert to MP3 192 kbps first: speech sounds identical, the file shrinks 10×, and the resulting transcript is indistinguishable.

Can I transcribe a WAV file with multiple speakers from a DAW?

Yes. If your DAW exported each speaker to a separate channel, stereo WAV uploads will use that channel information to label speakers more accurately. If you bounced everything to a single mixed stereo file, VexaScribe falls back to audio-based diarization (analyzing voice characteristics) — still good for 2-4 speakers, weaker on crosstalk. The cleanest workflow: export individual speaker tracks as separate WAVs and upload them as one multi-channel file rather than a pre-mixed stereo bounce.

Is uncompressed WAV better for forensic or legal transcription?

Yes — but for evidence-preservation reasons, not accuracy. WAV preserves the original signal so the audio file itself can stand as evidence with chain-of-custody integrity. MP3's lossy compression makes the file inadmissible as a primary recording in some jurisdictions. For the transcription quality itself, the same caveats apply: Whisper resamples both formats to 16 kHz mono before recognition. Pair AI transcription with human review for any transcript that may enter legal proceedings.

How do I convert WAV to MP3 to upload faster?

Use FFmpeg: ffmpeg -i input.wav -c:a libmp3lame -b:a 192k -ac 1 output.mp3. This produces mono MP3 at 192 kbps — about 1/15th the size of a 44.1 kHz stereo 16-bit WAV. Speech intelligibility and Whisper accuracy are indistinguishable from the original WAV at this bitrate. If you don't have FFmpeg, online converters work too — just verify the output is at least 128 kbps and ideally mono for speech.

Do ADPCM, μ-law, or A-law WAV files work?

Yes, with caveats. ADPCM (typically 4-bit, used in some Windows recorders and older voicemail systems) and G.711 μ-law/A-law (8 kHz, 8-bit, used in legacy telephony) are accepted. Both are narrowband formats — the audio quality is fundamentally limited, so expect 80-88% accuracy similar to phone recordings rather than the 95-97% you'd get from clean studio WAV. Modern recorders almost never produce these formats; if you have one, it's typically from a voicemail dump or a forensic archive.

Transcribe your first WAV free

30 minutes free, no card required. All PCM variants supported. Files up to 5 GB. Whisper Large-v3 accuracy.