Video to Transcript Generator — Generate a Transcript from Any Video

Key takeaways

●File upload or URL paste — both work. Nine video file formats (MP4, MOV, WebM, MKV, AVI, M4V, MPEG, FLV, WMV) plus five URL platforms (TikTok, Instagram, Vimeo, Loom, direct video links). Audio is extracted server-side; no ffmpeg on your end.
●Free preview before signup. The first 200 characters of any transcript render immediately with the same timestamps and speaker labels the full output uses. See exactly what you'll get before creating an account.
●20–40 seconds for a typical 3-minute video. A 30-minute podcast video finishes in 4–7 minutes. AI runs at 4–10× real-time depending on load.
●95% accuracy on clean audio, 80–95% on real-world video. Studio and podcast content lands at the top of that range; vlogs and TikTok with music underneath speech land at the bottom.
●Language auto-detects from your browser. Turkish browser → Turkish pre-selected, Spanish browser → Spanish, and so on. Fixes the "why is my English video's transcript in Spanish" bug that happens when tools blindly pick the first available caption track.
●Speaker labels on every transcript. Automatic diarization for up to eight distinct speakers. Rename in the editor and the change applies across the whole transcript.
●YouTube goes to a different tool. /tools/youtube-transcript is faster and cheaper for YouTube specifically because we pull YouTube's caption track when it exists. This page is for everything else.

How it works (3 steps)

The workflow is the same for a 15-second TikTok and a 3-hour lecture recording. Paste or drop, preview, sign up to unlock.

1
Paste a URL or drop a file
Drop any supported video file into the uploader, or paste a public URL from TikTok, Instagram, Vimeo, Loom, or any direct video link. Language auto-detects from your browser; override in the dropdown if the video is in a different language than your locale.
2
Preview the first 200 characters
The server fetches the video (URL) or reads the file (upload), extracts the audio track, and runs Whisper Large-v3. In 20–40 seconds you see the first 200 characters of the transcript with timestamps and speaker labels — enough to confirm the language, quality, and speaker structure are what you expect.
3
Sign up to unlock and export
A free account (email only, no credit card) reveals the full transcript, the editor with search-and-replace, and export to TXT, DOCX, SRT, VTT, and JSON. Translate to any of 133 languages from the same editor. Free tier includes 30 minutes/month; paid from $2/month.

Supported formats and URLs

Two ways to hand VexaScribe a video: upload a file, or paste a URL. Anything on the two lists below works without pre-processing.

File formats

Format	Common source
.mp4	The universal default — H.264 / H.265 codec, used by TikTok, Instagram, most cameras, screen recorders, and iPhone (newer)
.mov	QuickTime container — older iPhones, Final Cut Pro / DaVinci Resolve ProRes exports
.webm	Web-native (VP9/VP8 + Opus) — Chrome and OBS screen recorders, browser captures, Loom exports
.mkv	Matroska — open container, common for desktop screen recorders and downloaded video archives
.avi	Legacy Microsoft — older screen recorders, archived footage
.m4v	iTunes and iOS video — same H.264 payload as MP4, different container header
.mpeg	Older camcorders and broadcast archives
.flv	Legacy Flash video — older lecture libraries and corporate training archives
.wmv	Windows Media Video — Windows Movie Maker and older corporate recordings

Video URLs

Platform	URL pattern	Notes
TikTok	tiktok.com/@handle/video/…, vt.tiktok.com/…	Public videos. Music-only reels transcribe the vocals; instrumental-only reels return an empty transcript.
Instagram Reels	instagram.com/reel/…	Public reels. Requires no login.
Instagram posts / IGTV	instagram.com/p/…, instagram.com/tv/…	Public video posts. Photo-only posts return an error.
Vimeo	vimeo.com/…	Public and unlisted-with-URL videos. Password-protected videos return an error — download the video and drop the file instead.
Loom	loom.com/share/…	Publicly-shared Loom videos. Workspace-private Looms require login and don't work by URL.
Direct video URL	https://…/anything.mp4 (or .mov, .webm)	Any public direct video link. If the URL 404s from an incognito browser, it won't work here.

Not supported by this tool

Source	Why / what to do instead
YouTube	Handled by the dedicated tool at /tools/youtube-transcript — same Whisper engine, but optimized for YouTube's caption API where captions already exist (faster and cheaper on our end).
Twitch VODs	Not currently supported — the VOD URL structure requires a session cookie our fetcher doesn't handle. Download the VOD and upload the file.
Live streams	This is a file-based tool. For live transcription, use /meeting-transcription.
Zoom cloud recordings behind a password	The share link requires the recording password to fetch. Download the .mp4 and drop it in the uploader — that always works.
Private / login-required videos	Any video that requires authentication (private Vimeo, unlisted YouTube, workspace-only Loom) can't be fetched by our server. Download and upload the file.

Audio-only files also work: MP3, M4A, WAV, AAC, FLAC, OGG, OPUS. For dedicated audio guides see MP3 to text, M4A to text, WAV to text, and OGG to text.

What your transcript looks like

Every transcript includes segment-level timestamps and speaker labels by default. Word-level timestamps are available on paid plans. Below is a real (anonymized) excerpt of what you'll see in the editor and in the exported TXT/DOCX.

[00:00:03] Speaker 1: So the thing about video transcription is
[00:00:06] Speaker 1: most people don't realize it's just audio underneath.
[00:00:11] Speaker 2: Right, the container format doesn't really matter.
[00:00:14] Speaker 2: MP4, MOV, MKV — they're all wrappers.
[00:00:19] Speaker 1: Exactly. Whisper only sees the audio track.

Prefer plain text without timestamps or speaker labels? Toggle them off before export — the DOCX and TXT outputs can be paragraph-only if that's what you need for a blog post or article draft. SRT and VTT preserve timing by definition (they're subtitle formats). JSON keeps the raw word-level timing array for developers building custom video editors or search interfaces.

Language auto-detection

The language dropdown pre-selects based on your browser locale. A Turkish speaker on a Turkish browser sees Turkish selected by default; a Portuguese speaker sees Portuguese; a Chinese speaker sees Mandarin. This is deliberate, not cosmetic — it fixes a bug that affects most third-party transcript tools built on Supadata's caption API.

The bug: when a video has multiple auto-generated caption tracks (TikTok and Instagram often have three or four — Vietnamese, Indonesian, English, plus the original language), a tool that asks the caption API for "auto" can get a random language back. That's where the notorious "why is my English video's transcript in Spanish" complaint comes from. Pre-selecting the user's actual browser language cuts the miss rate significantly.

All 99 Whisper Large-v3 languages are supported. Non-Latin scripts (Chinese, Japanese, Arabic, Hindi, Thai) render cleanly in the editor and every export format. Right-to-left scripts (Arabic, Hebrew, Persian) preserve reading direction in DOCX and TXT. Auto-detection doesn't work perfectly on very short clips (under 3 seconds) — for a 5-second Reel, override manually if needed.

Accuracy by content type

Accuracy varies with audio quality, not video quality. A 4K studio production with laptop-mic audio is worse than a 480p phone recording with a lavalier close to the speaker. Numbers below are typical ranges from our internal testing across ~800 videos and match the general findings on the Hugging Face Open ASR Leaderboard for Whisper Large-v3.

Content type	Accuracy	Editing time
Studio explainer / single-speaker tutorial (treated room, good mic)	95-97%	5-10 min/hr
Podcast-style video (two speakers, dedicated mics)	93-96%	10-15 min/hr
Zoom / Google Meet recording (built-in laptop mics)	91-95%	10-15 min/hr
Lecture / classroom recording (ceiling or wireless mic)	89-94%	10-20 min/hr
TikTok / Instagram Reel (in-app recording with music underneath)	82-90%	15-25 min/hr
Vlog (outdoor, handheld, ambient noise)	80-88%	20-30 min/hr
Phone-recorded interview (compressed audio, near-field)	85-92%	15-20 min/hr
Multi-language / heavy accents	82-90%	15-25 min/hr

Proper nouns — brand names, product SKUs, technical jargon, foreign names — miss more often than regular vocabulary (20–30% error rate even on otherwise clean audio). The editor's search-and-replace fixes all instances in one pass; that's where most of the editing time goes.

Limits, pricing, and the free preview

Honest limits, stated upfront so nothing surprises you at signup:

●Free preview: first 200 characters of any transcript, no signup, no card. Enough to confirm language, quality, and speaker structure.
●Free account: 30 minutes/month of full transcription with all features. No credit card required. Enough for ten 3-minute TikToks, two 15-minute podcasts, or one 30-minute lecture per month.
●Paid plans: start at $2/month for 200 minutes; details at /pricing.
●Max file size: 500 MB per upload.
●Max duration: 4 hours per video.
●URL fetches: we time out at 30 seconds. If a video takes longer to fetch (large Vimeo files, slow CDN), download it and upload the file instead.

Roughly $0.01 per transcribed minute on the Pro plan — cheaper than typical per-minute transcription services at $0.10–$0.25/min, and 100–500× cheaper than human transcription at $1.50–$2.50/min. Human transcription is still the standard for legal filings, broadcast captions requiring ADA-compliant certification, and other cases where a stamp of certification matters — for research, notes, SEO content, subtitles, and social clips, AI is fine.

This tool vs the YouTube-specific tool

We run two tools that share the same Whisper backend but differ in the front door:

Aspect	This page (/video-to-transcript)	YouTube tool (/tools/youtube-transcript)
Sources	File upload + TikTok / Instagram / Vimeo / Loom / direct URLs	YouTube URLs only
Engine	Whisper Large-v3	Whisper Large-v3 — plus caption-track fallback when YouTube already has a track
Typical speed (3-min video)	20–40 seconds	2–10 seconds if captions exist; 20–40 seconds if we fall back to Whisper
Accuracy	Whisper accuracy (95% on clean audio)	YouTube's own captions when available (~85%), Whisper accuracy on fallback
Cost to us	Full Whisper compute per video	Near-zero when using YouTube's own captions
Best for	Any video that isn't on YouTube	YouTube videos where speed matters more than a few accuracy points

Both tools produce the same downstream output (same editor, same export formats, same account). Paste a YouTube URL here by mistake and we'll suggest the YouTube tool — no data loss.

Common workflows

What people actually do with the transcript once it's generated:

Turn a Vimeo talk into a blog post

Paste the Vimeo URL, wait ~90 seconds, download the DOCX. Edit into a 1,200-word article with H2 sections pulled from the transcript timestamps. Solo-creator standard workflow — one 20-minute video becomes one long-form post plus 3–5 social clips.

Extract quotes from an Instagram Reel or TikTok

Paste the Reel/TikTok URL, preview the first 200 characters free to confirm it's the right video, sign up to unlock the full text. Search the transcript for the phrase you remembered, copy the timestamp, cite it in your article or use it as a soundbite marker.

Subtitles for a Loom explainer you're sharing externally

Loom's own transcript is English-focused. Paste your Loom share URL, generate the transcript, export as SRT. Upload the SRT alongside the video wherever you host it (or bake it in with a video editor). Works for any of 99 languages Whisper supports.

Turn an MP4 lecture recording into study notes

Drop the MP4 into the uploader, generate the transcript, then open the AI summary tool (paid plans) to get a bullet-point outline of key points. Students search the transcript instead of scrubbing the video. Works well for 20–90 minute lectures.

Compare speaker turn-taking in a multi-speaker meeting recording

Upload the Zoom or Meet recording, let the diarization tag each speaker. In the editor, rename Speaker 1 → Alice, Speaker 2 → Bob, etc. Search for each name to see how often they spoke. Useful for hiring debriefs, interview transcripts, and multi-stakeholder meeting notes.

When to use a different tool

This page is the generic uploader — good for most cases. A few specialized tools handle their niche better:

If you need	Use this instead	Why
A step-by-step how-to guide	/how-to-transcribe-a-video	3 methods compared (AI tool, YouTube built-in, manual) with format-specific instructions for MP4, YouTube, TikTok, Vimeo, and Zoom recordings.
Just YouTube	/tools/youtube-transcript	Same Whisper engine, optimized for YouTube's caption API. Faster when captions already exist.
Just audio (no video wrapper)	/transcribe-audio-to-text	For MP3, M4A, WAV files — same accuracy, focused UX for audio use cases.
Just voicemail	/voicemail-to-text	Same engine, tuned for short-form phone audio. Handles Visual Voicemail exports.
Live meeting as it happens	/meeting-transcription	Real-time transcription for Zoom, Meet, Teams via the meeting bot. This page is for finished recordings.
Many videos at once	/bulk-transcription	Batch upload with a single click. Same per-minute rate as regular transcription.
A subtitle file for editing	/video-to-srt	Same transcript, packaged as an .srt file for Premiere / DaVinci Resolve / CapCut.

Frequently Asked Questions

How do I get a transcript from a video?

Two paths on this page. Drop an MP4, MOV, WebM, MKV, AVI, M4V, FLV, MPEG, or WMV file into the uploader, or paste a public video URL from TikTok, Instagram (Reels, IGTV, posts), Vimeo, Loom, or any direct video link. The audio track is extracted server-side (no ffmpeg on your end), OpenAI Whisper Large-v3 runs on it, and you see the first 200 characters immediately as a free preview. Sign up to unlock the full transcript, download in TXT, DOCX, SRT, VTT, or JSON, and edit inline. Total time from paste to preview is 20–40 seconds for a typical 3-minute video. YouTube videos are handled by a dedicated tool at /tools/youtube-transcript — same Whisper engine, optimized for YouTube's caption pipeline.

Can I convert a video to a transcript for free?

The preview is free with no signup — first 200 characters of any transcript, showing timestamps and speaker labels exactly as they will appear in the full output. A free VexaScribe account (email only, no credit card) unlocks 30 minutes of full transcription per month, plus the editor, TXT/DOCX/SRT/VTT/JSON export, and translation into 133 languages. That's enough for ten 3-minute TikToks, two 15-minute podcasts, or one 30-minute lecture per month. Beyond the free tier, paid plans start at $2/month for 200 minutes. Free alternatives without any signup include YouTube's built-in transcript (English-primary, ~85% accuracy, YouTube-only) and self-hosted OpenAI Whisper (Python + GPU required).

What video formats and URLs work?

File formats: MP4 (the universal default — H.264/H.265), MOV (QuickTime, iPhone), WebM (browser and screen recorders), MKV (Matroska, OBS), AVI (legacy Windows), M4V (iTunes), FLV (legacy Flash), MPEG, and WMV (Windows Media). Maximum file size is 500 MB and maximum duration is 4 hours per upload. Video URLs we handle directly: TikTok, Instagram (Reels, IGTV, posts), Vimeo, Loom, and any direct hosted video URL that ends in .mp4/.mov/.webm. Not supported by this tool: YouTube (use /tools/youtube-transcript — it's faster and cheaper on our end), Twitch VODs, live streams, Zoom cloud recordings behind a share link with a password, and private/unlisted videos that require login. If a URL fails, download the video and drop the file instead — that always works.

How accurate is the transcript?

Roughly 95% (5% Word Error Rate) on clear speech from a good mic in a treated room, dropping to 80–88% on outdoor handheld phone recordings with wind and traffic noise. Content-type breakdown: single-speaker explainer 95–97%, podcast-style two-speaker 93–96%, Zoom recording with laptop mics 91–95%, lecture with ceiling mic 89–94%, TikTok/Reels with music underneath speech 82–90%, vlog with ambient noise 80–88%. Proper nouns — brand names, technical jargon, foreign names, product SKUs — miss more often (20–30% error rate) even on otherwise clean audio; the editor's search-and-replace fixes all instances in one pass. Budget 5–10 minutes of light editing per hour of video before publishing.

How is this different from your YouTube transcript tool?

Same Whisper Large-v3 backend, different front door. The YouTube tool at /tools/youtube-transcript is optimized for YouTube's caption API — when YouTube already has a caption track (creator-uploaded or auto-generated), we can pull it in ~2 seconds without running Whisper. That's faster and cheaper on our infrastructure. This page is for everything else: any video file on your machine, and any URL that isn't YouTube — TikTok, Instagram, Vimeo, Loom, and direct hosted videos. Both pages let you preview free and unlock with the same free account. If you paste a YouTube URL here by mistake, the tool will suggest the YouTube-specific page — no data loss.

Does the transcript include timestamps and speaker labels?

Every transcript includes segment-level timestamps by default (a timestamp every few seconds at natural sentence breaks) and speaker diarization for up to eight distinct speakers, tagged Speaker 1, Speaker 2, Speaker 3, and so on. You can rename speakers in the editor — the rename applies across the entire transcript in one pass. Word-level timestamps (every word tagged with a start time) are available on paid plans and are useful for building custom video editors, generating YouTube chapters from transcript search, or exporting subtitle files with frame-accurate timing. The default export formats preserve timing: SRT and VTT are caption-file ready for video editors, TXT and DOCX bracket timestamps inline for readability, JSON contains the raw timing array for developers.

What languages are supported?

99 languages via OpenAI Whisper Large-v3. Language auto-detects from your browser preference — so a Turkish speaker sees Turkish pre-selected in the dropdown, a Spanish speaker sees Spanish, and so on. This matters more than it sounds: Supadata's caption fallback can pick a random language when a video has multiple auto-generated caption tracks, so pre-selecting the actual language avoids the classic 'why is my English video's transcript in Spanish' bug. Override manually if the video is in a different language than your browser locale. Non-Latin scripts (Chinese, Japanese, Arabic, Hindi, Thai) render cleanly in the transcript editor and export formats. Right-to-left scripts (Arabic, Hebrew) preserve reading direction in DOCX and TXT exports.

How long does it take to transcribe a 30-minute video?

For a file upload: 3–6 minutes of AI processing (4–10× real-time depending on server load) plus ~30 seconds of audio extraction from the video container. For a URL paste (TikTok, Instagram, Vimeo, Loom): add 5–15 seconds for our server to fetch the video before extraction starts. A typical 30-minute podcast video finishes end-to-end in 4–7 minutes and lands in your dashboard. You can close the browser tab — we email you when the transcript is ready. For comparison, a professional human transcription service like Rev takes 12–24 hours turnaround at $1.25–$2.50 per minute ($37.50–$75.00 for a 30-minute video). Human review is the standard for legal/broadcast use; AI is fine for research, notes, SEO content, subtitles, and social clips.

Can I edit the transcript after it's generated?

Yes, the full editor is included on the free plan. Search-and-replace with case-sensitivity toggles, per-segment editing, speaker rename (applies across the entire transcript), paragraph splitting and merging, timestamp adjustment, and inline notes. Click any word to jump to that moment in the video for verification. The editor is autosave — nothing to lose if the browser closes. Editing a transcript for a 30-minute podcast to production quality typically takes 5–10 minutes: fix proper nouns via search-and-replace, rename Speaker 1/2 to real names, split any run-on paragraphs. Once you're happy, export as TXT, DOCX, SRT, VTT, or JSON.

Is my video private?

Uploads travel over TLS 1.2+ and are stored encrypted at rest in AWS eu-west-2 (Ireland). We do not train AI models on your video, audio, or transcript. We do not sell user data. You can delete the file and the transcript from your dashboard at any time; account deletion is self-serve. For sensitive content — internal meetings, client interviews, confidential recordings — delete the file as soon as you've downloaded the transcript. Editorial and privacy details are published at /about/editorial-standards and /privacy. For URL-based transcriptions (TikTok, Instagram, Vimeo, Loom), we fetch the video from the public source, process it, and delete the intermediate copy within one hour of the transcript completing.

How do I generate a transcript from a video?

Three ways depending on where the video lives. (1) File on your device: drop the MP4, MOV, WebM, MKV, AVI, M4V, FLV, MPEG, or WMV into the uploader here. Whisper Large-v3 runs on the extracted audio and returns a timestamped transcript in 3-6 minutes for a 30-minute video. (2) Public URL: paste a TikTok, Instagram Reel/post, Vimeo, Loom, or direct video URL — same processing pipeline, adds 5-15 seconds to fetch the video first. (3) YouTube: use the dedicated /tools/youtube-transcript page — it's faster because we pull existing YouTube captions when available. All three methods produce the same output shape: transcript with speaker labels, timestamps, and export to TXT/DOCX/SRT/VTT/JSON.

How do I generate a transcript from an MP4 file?

Drop the MP4 into the uploader above. The video container is opened server-side (no ffmpeg on your machine), the audio track is extracted, and Whisper Large-v3 transcribes it. Files up to 500 MB and 4 hours work directly; for larger files, plans include upload-URL support for direct-from-cloud transfers. Output is the same as any other video format: speaker labels, timestamps, editable transcript, and export to TXT/DOCX/SRT/VTT/JSON. If your MP4 has multiple audio tracks (some Zoom recordings do), we use the first English/target-language track by default — the editor lets you re-process on a different track if needed.

Methodology and disclosure

Accuracy numbers: derived from internal testing across ~800 videos spanning single-speaker explainers, podcast recordings, Zoom meetings, TikTok/Instagram Reels, lectures, and vlogs. Whisper Large-v3 baseline accuracy is consistent with published results on the Hugging Face Open ASR Leaderboard (verified July 2026).

Whisper model documentation: OpenAI Whisper announcement and the Whisper GitHub repo (verified July 2026).

Human transcription rate reference: Rev.com published pricing at $1.25/min for AI + human review, ~$1.50/min for standard human transcription (verified July 2026). Numbers may drift; the order of magnitude is the point.

Editorial standards: we don't claim "100% accuracy" or "free forever." The preview is free, the free account has a monthly cap, and paid plans start at $2/mo. Full disclosure at /about/editorial-standards.

Ready to try it?

Free preview above — no signup. If you like what you see, a free account gets you 30 minutes/month with the full editor and every export format.

Create a free account →See all plans