Formerly NovaScribe — same team, same product, refreshed name. Read the announcement →

Audio to Text Converter

Convert audio to text online in 99 languages. Upload any audio or video file — get accurate transcripts with speaker labels, timestamps, and AI summaries in minutes.

VexaScribe (formerly NovaScribe) is a free online audio-to-text converter that transcribes audio and video files into accurate, timestamped text using OpenAI's Whisper Large-v3 model. Upload MP3, WAV, M4A, MP4, MOV, FLAC, and 14 other formats up to 5 GB. Transcripts arrive in 5–10 minutes for a one-hour file with 95% accuracy on clear English audio and support for 99 languages with automatic detection. Free tier includes 30 minutes; paid plans start at $2/month for 200 minutes.

Use VexaScribe to transcribe audio recordings to text from interviews, podcasts, voice memos, lectures, Zoom calls, and dictation. Every transcript includes speaker diarization (Speaker 1, Speaker 2…), word-level timestamps, and an editable transcript view — so the transcription of audio to text is ready to paste, share, or export to TXT, DOCX, SRT, VTT, and JSON.

30 minutes freeNo credit card99 languagesSpeaker labels

How to Transcribe Audio to Text

Three steps from upload to finished transcript — audio-to-text transcription with no setup and no software to install.

  1. 1

    Upload your file

    Drag and drop or browse for an audio or video file. We accept MP3, WAV, M4A, MP4, MOV, FLAC, OGG, AAC, AIFF, WMA, AVI, MKV, WebM, and 7 more formats. Up to 5 GB and 10 hours per file.

  2. 2

    AI transcribes in minutes

    VexaScribe runs OpenAI's Whisper Large-v3 model on your audio. A 60-minute recording typically completes in 5–10 minutes. Close the tab and come back — we'll keep processing.

  3. 3

    Edit, export, share

    Review the transcript in our built-in editor. Rename speakers, fix any errors, then export to TXT, DOCX, SRT, VTT, or JSON. Share via link or download.

Supported Audio and Video Formats

17 formats covering virtually every recording device and tool. Files up to 5 GB and 10 hours per upload.

Audio Formats

  • MP3Most common
  • WAVLossless
  • M4AiPhone default
  • FLACLossless
  • OGGOpen format
  • AACApple/streaming
  • AIFFPro audio
  • WMAWindows
  • AMRMobile
  • OPUSModern web

Video Formats

  • MP4Most common
  • MOVApple/QuickTime
  • AVIWindows legacy
  • MKVHigh-quality
  • WebMWeb video
  • FLVFlash legacy
  • WMVWindows

Audio is extracted automatically from video files. Video itself is not retained after transcription.

File limits: 5 GB per file, 10 hours per file. No monthly upload limit beyond your plan's included minutes.

Format-specific deep dives: MP3 to text · SRT generator (audio → subtitles) · transcript to summary

What Can You Transcribe?

If it has audio, VexaScribe can transcribe it. Common use cases:

Podcast episodes

Show notes, blog posts, SEO content, searchable archives. Solo and multi-host shows supported with speaker labels.

Interviews

Journalism, qualitative research, HR. Multi-speaker diarization separates interviewer from subject automatically.

Lectures and classes

Students capturing lectures for review. Teachers generating written course notes from recorded sessions.

Meetings

Zoom, Google Meet, Microsoft Teams calls. Upload the recording or send VexaScribe's meeting bot to join.

Phone calls

Sales calls, customer interviews, support recordings. Record on any device, upload, get a transcript with speakers.

Video content

YouTube videos, training videos, course content. Generate SRT/VTT subtitles with word-level timestamps.

Transcribe in 99 Languages — With Automatic Detection

No need to select language manually. VexaScribe auto-detects the spoken language from the audio. Accuracy varies by language tier:

Tier 1

~5% Word Error Rate (highest accuracy)

EnglishSpanishFrenchGermanItalianPortugueseDutchPolishRussianJapanese
Tier 2

~8–12% Word Error Rate

ArabicChineseKoreanHindiTurkishVietnameseThaiIndonesianHebrewCzechSwedishNorwegianDanishFinnishGreekUkrainian

+ 73 more languages

Including Welsh, Swahili, Filipino, Bengali, Punjabi, Tamil, Telugu, Marathi, Urdu, Persian, Romanian, Hungarian, Bulgarian, Croatian, and many more. Accuracy varies by language and audio quality.

What You Get With Every Transcript

Every transcription includes these features at no extra cost on every paid plan.

Speaker diarization

Automatic speaker detection and labeling. Multiple speakers appear as Speaker 1, Speaker 2, Speaker 3, and so on. Rename them in the editor (e.g., "Host", "Guest", actual names).

Word-level timestamps

Every word is timestamped to the millisecond. Click any word in the transcript editor to jump to that moment in the audio. Essential for video subtitles and quote verification.

Multiple export formats

TXT (plain text), DOCX (Word document), SRT (video subtitles), VTT (web subtitles), and JSON (developers). All formats available on every paid plan with no upgrade required.

AI summaries

Optional AI-generated summary with key points, decisions, action items, and chapter markers. Available on all paid plans. Useful for meeting notes, podcast show notes, and lecture review.

Audio File to Text — Every Recording Type We Handle

Whatever the source of your recording, VexaScribe converts it from audio to text transcription in the same upload-and-go workflow. Free audio transcription to text starts with 30 minutes on signup — no credit card required.

Voice memos & dictation

iPhone Voice Memos, Android voice recorder, Otter live capture, hardware dictaphones — drop the .m4a, .mp3, or .wav file straight into VexaScribe. The transcription of audio to text preserves punctuation and paragraph breaks so your dictation reads like prose, not a wall of words.

Recorded interviews

Whether your interview was recorded on a Zoom call, a field recorder, a smartphone, or a USB lavalier mic, VexaScribe transcribes audio recordings to text with speaker labels. Two-, three-, and four-person conversations get separated automatically — rename Speaker 1 to the interviewee in the editor and export.

Podcast episodes & raw RSS

Upload the final mix or the raw multitrack stem. Audio-to-text transcription for a 60-minute episode finishes in about 7 minutes with timestamps you can drop into show notes, chapter markers, or YouTube descriptions.

Lecture & meeting recordings

Long-form recordings from classrooms, university lectures, board meetings, town halls, and webinars. Files up to 10 hours and 5 GB work in a single upload — no chunking needed. Transcription from audio to text comes back with auto-generated AI summaries on paid plans.

Phone calls & voicemails

Compressed phone audio (8 kHz mono telephony) is supported. Accuracy lands at 85–92% on clear calls and degrades on heavily compressed VoIP — review noisy stretches in the editor before exporting the audio to text transcript.

WhatsApp & Telegram voice notes

OPUS-encoded voice notes from messaging apps transcribe cleanly. Forward the .ogg or .opus file to yourself, save it, then drag it into VexaScribe to convert audio to text in under a minute for most short messages.

Field recordings & research interviews

Qualitative researchers, journalists, and ethnographers upload long-form .wav or .flac field recordings. The transcript-from-audio-to-text output is timestamped to the second so you can re-listen to any quote, and JSON export ships clean data into NVivo, ATLAS.ti, or MAXQDA.

Court hearings & legal depositions

Multi-speaker legal recordings (depositions, witness statements, hearing audio) work well with VexaScribe's speaker diarization. Outputs preserve verbatim timing — useful when you need to cite a moment in evidence. Always have a certified human verify before filing.

Need a specific format guide? See MP3 to text, WAV to text, M4A to text, or OGG to text.

How Accurate Is VexaScribe Transcription?

VexaScribe (formerly NovaScribe) achieves 95% accuracy (5% Word Error Rate) on clear English audio with a single speaker.

Real-world accuracy varies by audio condition:

  • Clear podcast audio: 3–6% WER (94–97% accurate)
  • Noisy interviews, background music: 8–15% WER (85–92% accurate)
  • Strong accents, technical jargon, multiple overlapping speakers: 10–20% WER (80–90% accurate)

We recommend reviewing transcripts before publishing critical content — no AI tool achieves the 99%+ accuracy of human transcription, but VexaScribe is 20–100× cheaper than human services like Rev ($1.50/min). For a deeper breakdown of Whisper accuracy by model size, language, and audio condition, see How Accurate Is Whisper in 2026?

Methodology: Word Error Rate (WER) is calculated as (Substitutions + Insertions + Deletions) / Total Words. We use the industry-standard formula. See our editorial standards for full testing methodology.

Simple, Transparent Pricing

Pay for what you use. No per-seat fees, no hidden charges. Cancel anytime.

Starter

$2/month

200 min/month

Solo creators

Basic

$5/month

1,000 min/month

Regular podcasters

Pro

$10/month

2,500 min/month

Heavy use

Frequently Asked Questions

How does VexaScribe transcribe audio to text?

VexaScribe (formerly NovaScribe) uses OpenAI's Whisper Large-v3 model to convert speech to text. Upload an audio or video file, and the AI processes the entire recording — adding speaker labels, word-level timestamps, and optional AI summaries. A 60-minute file typically completes in 5-10 minutes.

What audio and video formats can I transcribe?

VexaScribe accepts MP3, WAV, M4A, FLAC, OGG, AAC, AIFF, WMA, AMR, OPUS for audio, and MP4, MOV, AVI, MKV, WebM, FLV, WMV for video. Files can be up to 5 GB and 10 hours long. For video files, we extract the audio track automatically.

How long does it take to transcribe a 1-hour audio file?

Most 1-hour files complete in 5-10 minutes. Processing speed depends on audio quality, current load, and file format. You can close the browser tab and return — the transcript will be waiting in your dashboard when it's ready.

Is VexaScribe free to use?

Yes, you get 30 minutes of transcription free with no credit card required. After the free tier, paid plans start at $2/month for 200 minutes (Starter), $5/month for 1,000 minutes (Basic), $10/month for 2,500 minutes (Pro), and $20/month for 6,000 minutes (Studio). Cancel anytime.

How accurate is VexaScribe transcription?

VexaScribe achieves around 95% accuracy (5% Word Error Rate) on clear English audio with a single speaker. Real-world accuracy varies: clear podcast audio averages 3-6% WER, noisy interviews 8-15% WER, and audio with strong accents or technical jargon 10-20% WER. We recommend reviewing transcripts before publishing critical content.

What languages are supported?

99 languages including English, Spanish, French, German, Italian, Portuguese, Dutch, Russian, Polish, Japanese, Chinese, Korean, Arabic, Turkish, Hindi, Vietnamese, Thai, and many more. Language is detected automatically — no need to select it manually before each upload.

Can I transcribe video files?

Yes. Upload MP4, MOV, AVI, MKV, WebM, FLV, or WMV files and we extract the audio track automatically. The transcript includes timestamps so you can sync with your video editing tool, generate subtitles (SRT/VTT export), or repurpose video content into blog posts.

Does VexaScribe identify multiple speakers?

Yes, automatic speaker diarization is included on every transcript. Multiple speakers are labeled Speaker 1, Speaker 2, Speaker 3, and so on. You can rename speakers in the built-in editor (e.g., "Host", "Guest", actual names) for clarity in the final transcript.

Is my audio data private and secure?

Audio files transit over TLS 1.2+ encryption and are stored encrypted at rest in AWS eu-west-2. We do not train AI models on your audio. We do not sell user data. You can delete files at any time from your dashboard, and account deletion is self-serve.

How do I export the transcript?

VexaScribe exports to TXT (plain text), DOCX (Word document), SRT (video subtitles), VTT (web subtitles), and JSON (structured data for developers). All formats are available on every paid plan. SRT and VTT include word-level timestamps for video editors.

How do I transcribe audio recordings to text?

Sign up at VexaScribe (30 minutes free, no credit card), drag your recording onto the upload area, and wait — the audio-to-text transcription completes in roughly 8–12% of the file's duration. A 60-minute recording is ready in 5–10 minutes. The output includes speaker labels, word-level timestamps, and an editable transcript view, and you can export the audio to text transcript as TXT, DOCX, SRT, VTT, or JSON. Works for voice memos, interviews, podcasts, lectures, Zoom calls, and phone recordings.

Is there a free audio transcription to text option?

Yes. Every new account gets 30 minutes of free audio transcription to text with all features enabled — speaker diarization, 99-language support, timestamps, and export to TXT, DOCX, SRT, VTT, and JSON. No credit card required to start. After the free tier, paid plans start at $2/month for 200 minutes (about $0.01 per minute), which is significantly cheaper than typical pay-per-minute transcription services charging $0.10–$0.25 per minute.

What's the difference between an audio file to text converter and a transcription service?

A bare audio-to-text converter usually returns a wall of raw text with no speakers, no timestamps, and no editor — you have to clean it up yourself. A transcription service like VexaScribe returns a structured transcript: speakers are labeled (Speaker 1, Speaker 2…), every word is timestamped to the second, the text is paragraph-broken for readability, you can edit and re-export in-browser, and AI summaries with action items are generated automatically on paid plans. Same upload, much more usable output.

Start Transcribing in 30 Seconds

30 minutes of free transcription, no credit card required. Upload any audio file and see the result yourself.

Learn more

MP3 to text

The most common consumer audio format — bitrate guide inside

WAV to text

Uncompressed PCM — when WAV actually beats MP3 for accuracy

Whisper transcription

Hosted Whisper Large-v3 — 99 languages, no GPU

SRT generator

Generate .srt subtitle files with timestamps

Video to SRT

4-step workflow — upload video, get .srt subtitles

Video to text

Plain-text transcripts from any video file — MP4, MOV, MKV, WebM

MP4 to text

MP4 video to TXT, DOCX, JSON transcript

M4A to text

iPhone Voice Memos to transcript — 30-min free trial

OGG to text

WhatsApp voice notes, Discord recordings, Linux audio

Transcribe Spanish audio

All regional dialects + Spanish-to-English translation

Transcrever áudio em texto (Português)

Brazilian Portuguese guide — Whisper Tier 1, LGPD-friendly, BRL pricing

Transcription for qualitative research

Methodology, IRB, CAQDAS — for academic researchers

How to add subtitles to a video

Step-by-step guide: YouTube, Premiere, CapCut, iPhone

Interview transcription

For researchers, journalists, podcasters & HR — workflow + cost math

Lecture transcription

For students, MOOC learners & academics — AI study guides + 99 languages

Speaker labels — how they work

Pipeline mechanics, DER benchmarks, SRT/VTT/DOCX format examples

YouTube transcript downloader

Paste a YouTube URL, download SRT/VTT/TXT in seconds

TikTok transcript generator

Paste a TikTok URL, get transcript in 6 formats

Instagram transcript generator

Reels, Posts, IGTV — get transcript in 6 formats

Transcribe and translate

Translate transcripts in 133 languages

How accurate is Whisper?

WER benchmarks across LibriSpeech & FLEURS

AI transcription — full guide

How it works, accuracy, tools landscape, pricing models

13 Best transcription software 2026

13 tools tested — Otter, VexaScribe, Rev, Descript, Granola, AssemblyAI, Deepgram

Pricing

All plans, side-by-side

Editorial standards

How we test and disclose