Video to SRT & MP4 to SRT Converter — Create SRT Files Online

Key takeaways

•AI generates .srt subtitle files from raw video — no manual audio extraction step.
•Accepts MP4, MOV, MKV, WebM, AVI directly; max file size 5 GB on VexaScribe.
•Output is timestamped, line-broken, UTF-8 encoded — drops into YouTube, Premiere, DaVinci, VLC, CapCut without conversion.
•Accuracy 92-97% on clean audio; review proper nouns and technical terms before publishing.
•Cost $0.20-$0.60 per video hour AI; $90-$300 per video hour human (Rev, 3PlayMedia).
•Processing 5-15 minutes per video hour AI; 12-48 hours human turnaround.
•Diarization is optional — useful for interviews and panels, irrelevant for single-speaker explainers.

How to convert video to SRT (4 steps)

1
Upload the video file
MP4, MOV (iPhone default), MKV, WebM, AVI up to 5 GB per file (approximately 8-10 hours of compressed video). Audio is extracted from the video container automatically — no manual conversion to MP3 or WAV required. Free trial accepts the first 30 minutes of any file.
2
Choose source language and diarization
Select source language from 99 supported languages, or use auto-detect for clean monolingual audio. Toggle speaker diarization on for multi-speaker video (interviews, panels, podcasts) — labels speakers as Speaker 1, Speaker 2 for renaming later. Diarization is included on every paid plan with no tier gating.
3
Wait for processing
AI runs at 4-10× real-time. A 30-60 minute video processes in 5-15 minutes. VexaScribe emails you when the SRT is ready. While waiting, you can queue additional video uploads — useful for batch caption generation across a YouTube channel or course.
4
Download the .srt and review
Download the .srt file (UTF-8 encoded, timestamped, line-broken to ~42 chars/line). Quick proofread for proper nouns, brand names, technical terms (5-15 min/hr). Drop into YouTube Studio (Subtitles → Upload file), Premiere Pro (Captions panel → Import), DaVinci Resolve, CapCut, or VLC. DOCX and TXT also exported for downstream workflows.

Supported video formats

VexaScribe accepts the most common video containers directly — no manual audio extraction step. The platform extracts the audio track using ffmpeg internally and routes the audio through the Whisper Large-v3 transcription pipeline.

MP4 (H.264 / H.265)

Most common video format — YouTube downloads, exported from any editor, phone recordings. Universal compatibility.

MOV (QuickTime)

iPhone default camera output, Apple ecosystem. Same H.264/H.265 codecs as MP4 in a different container.

MKV (Matroska)

Open-source container favored by editors and torrent ecosystems. Supports multi-track audio.

WebM

Web video standard, VP8/VP9/AV1 codecs. Common for HTML5 video and screen recordings (OBS WebM output).

AVI (legacy)

Legacy Windows container. Older recordings, archived broadcast captures. Still supported.

Audio formats also accepted

MP3, WAV, M4A, FLAC, OGG — for audio-only transcription with SRT timestamps. See transcribe audio to text.

Max file size: 5 GB per upload — covers approximately 8-10 hours of typical compressed video at 720p-1080p. For longer files, split with a free tool like LosslessCut before uploading.

Self-hosted Whisper users must extract audio first since base Whisper accepts only audio formats: ffmpeg -i input.mp4 -vn -acodec libmp3lame audio.mp3, then whisper audio.mp3 --output_format srt.

What is an SRT file?

SRT (SubRip Subtitle) is the oldest and most widely supported subtitle format. It originated with SubRip, a freeware Windows tool from the early 2000s that "ripped" subtitles from DVDs — the format was designed for simplicity: a plain UTF-8 text file with numbered blocks, each containing an index, a timestamp range, and one or two lines of caption text. That simplicity is why it outlasted every competitor: no binary encoding, no proprietary spec, no renderer required. It's natively supported by YouTube, Vimeo, Premiere Pro, DaVinci Resolve, CapCut, VLC, Windows Media Player, every social platform, and every streaming service.

1
00:00:00,000 --> 00:00:03,500
Welcome to the introduction lecture.

2
00:00:03,500 --> 00:00:07,200
Today we'll cover three main topics.

3
00:00:07,200 --> 00:00:11,800
Let's start with the first one.

Anatomy of a block:

→Index: sequential integer starting at 1. Must be unique and ordered.
→Timestamp range: HH:MM:SS,mmm --> HH:MM:SS,mmm. Note the comma (not period) before milliseconds — common cross-platform compatibility issue.
→Text lines: one or two lines per block. Broadcast standard caps at 42 characters per line; mobile-friendly drops to 32-36.
→Blank line: separates each block. Critical — missing blank lines break parsing in some players.
→Encoding: UTF-8 without BOM is the safe default. UTF-8 with BOM works in most players but trips a few legacy ones.

4 formatting mistakes that break SRT playback:

1.Period instead of comma in timestamps. 00:00:03.500 instead of 00:00:03,500. VLC and most desktop players tolerate this; YouTube Studio does not — upload fails silently with no error message.
2.Missing blank line between blocks. Some editors strip trailing blank lines on save. Result: the parser reads two blocks as one, producing doubled subtitles with wrong timing.
3.UTF-8 BOM on byte 1. The BOM (EF BB BF) is treated as a text character on line 1 by some parsers, so the first subtitle index becomes 1 instead of 1. Fix: save as UTF-8 without BOM (Notepad++ → Encoding → UTF-8 without BOM).
4.Overlapping timestamps. A block that starts before the previous one ends. Some players merge the cues; others drop one. Generated by AI when VAD boundaries are too aggressive — VexaScribe enforces a 10ms minimum gap between cues.

SRT vs other subtitle formats:

Format	Styling	Best for	Supported by YouTube
.srt (SRT)	Basic (bold, italic, color via HTML tags)	Universal — video editors, social platforms, streaming	Yes ★
.vtt (VTT/WebVTT)	Rich (position, size, line, region, CSS cues)	HTML5 video players needing positioned captions	Yes
.ass / .ssa (SSA)	Advanced (animations, karaoke timing, per-word fx)	Anime fansubs, karaoke, complex overlay effects	No
.ttml (TTML)	XML-based, full layout control	Broadcast delivery (EBU-TT-D), Netflix TTML	Partial

Verdict: Use SRT for 99% of workflows — video editors, YouTube, social media, streaming. VTT only when you need positioned captions on a custom HTML5 player. ASS/SSA for anime subtitles. TTML for broadcast delivery specs.

How VexaScribe builds readable cues

Transcription accuracy is one half of a usable SRT. The other half is cue quality — splitting long speaker turns into subtitles that read naturally on screen and stay in sync with speech. A 30-second monologue can't be one subtitle. A two-word reaction shouldn't flash on screen for a quarter of a second. VexaScribe handles this with a word-level cue splitter built specifically for editor-ready output.

● Word-accurate timing. Cue start and end times come from the engine's word-level timestamps, not linear interpolation across a segment. Subtitles appear when speech starts and disappear when it ends.
● 80-character / 5-second target. The readable-web-subtitle range used by Descript, Sonix, and Vimeo. Tighter than broadcast TV (42 chars), looser than YouTube auto-captions. Hard ceiling of 10 seconds for compatibility with every standard player.
● Natural splitting points. The splitter prefers sentence-ending punctuation (. ! ?), then commas, then word boundaries — never mid-word.
● Dramatic pauses preserved. When a speaker pauses 2-3 seconds mid-thought (a beat in a speech, a breath in an audiobook), the cue holds across the silence rather than producing a sub-second flash followed by a blank screen.
● Speaker labels carry through. If diarization is on, every cue keeps the [SPEAKER_00] prefix, honoring per-word speaker changes within a segment when present.

The result: an SRT that drags into Premiere Pro, Final Cut Pro, DaVinci Resolve, or CapCut and works without manual cleanup. For deeper detail on the splitting algorithm and Netflix/BBC timing standards, see our SRT generator page.

Accuracy by video type

Whisper Large-v3 (the model VexaScribe uses) hits 95-97% accuracy on clean single-speaker video but degrades predictably with audio conditions. Plan your review time based on the type of video you're subtitling.

Video type	AI accuracy	Review time
Single-speaker explainer / tutorial (clean)	95-97%	5-10 min/hr
Interview, 2 speakers (clean mic, treated room)	92-96%	10-15 min/hr
Podcast-style video (mic'd, treated room)	95-97%	5-10 min/hr
Webinar / Zoom recording	91-95%	10-15 min/hr
Lecture / classroom recording	89-94%	10-20 min/hr
Documentary with B-roll narration	90-94%	10-15 min/hr
Vlog (outdoor, ambient noise)	80-88%	20-30 min/hr
Heavily accented English (non-native speaker)	82-90%	15-25 min/hr

Where AI consistently misses: proper nouns (brand names, product names, technical terms) at 20-30% error rate even on clean audio; numbers spelled vs digits ("twenty twenty six" vs "2026"); homophones (their/there/they're); rapid-fire counts and lists. Always proofread before publishing public-facing captions.

For accuracy methodology, see how accurate is Whisper? with WER benchmarks across LibriSpeech and FLEURS.

Cost: per-video and bulk math

AI subtitle generation is genuinely cheap — typically $0.20-$0.60 per video hour on consumer apps. Human captioning is 150-1,500× more expensive. The cost math only flips toward human if you specifically need verbatim, court-grade, or ADA-certified broadcast captions.

Tool	Per video hour	Entry plan	Best for
VexaScribe	$0.20-$0.60	$2/mo (200 min)	Most batch SRT workflows — multi-format upload, 99 languages, speaker labels
Rev AI	~$6/hr ($0.10/min)	PAYG	Developer/API integration
Descript	~$1.60 effective	$16/mo (10 hrs)	Video creators who edit and caption in the same tool
Self-hosted Whisper	$0 forever	n/a	Technical users with GPU + ffmpeg
Human captioning (Rev, 3PlayMedia)	$90-$300/hr	per-minute	Verbatim, court-grade, broadcast/ADA-certified

Bulk math example: a YouTube channel publishing 4 videos/month at ~15 minutes each = 1 hour of new video monthly. SRT generation costs $0.20-$0.60 on AI versus $90-$300 with human captioning — a 150-1,500× difference. For a 40-episode course (~20 hours of video total), AI runs $4-$12 versus $1,800-$6,000 human.

For full cost analysis across the 14-tool transcription market, see how much does transcription cost? with verified 2026 pricing and an interactive calculator.

Multi-language SRT (99 languages)

VexaScribe supports SRT generation in 99 languages via Whisper Large-v3, including all major European, East Asian, and Middle Eastern languages — Spanish, French, German, Italian, Portuguese, Japanese, Korean, Mandarin, Arabic, Russian, Hindi, Turkish, Vietnamese, Polish, Dutch, Swedish, plus 83 more. Source language is auto-detected from the audio or manually selectable.

Workflow: SRT in two languages

Upload the video, generate the source-language SRT (e.g. Spanish).
Use the built-in translation widget to translate the transcript to a target language (133 languages supported as translation targets).
Export the translated transcript as a second SRT — timestamps preserved from the source so both SRTs sync to the same video.
Upload both .srt files to YouTube as separate language tracks; viewers pick from the captions menu.

Common workflows: Spanish-language creator generates Spanish SRT for native audience plus English SRT for international viewers. Japanese tutorial creator generates Japanese + English + Spanish SRTs for global reach. Documentary filmmaker generates SRTs in source language plus festival-circuit target languages (English, French, German). See also transcribe and translate audio.

Common SRT errors and fixes

Most SRT problems come from one of five issues. Here's how to recognize and fix each one.

Timestamps drift after editing the video

Cause. You cut or trimmed the source video after generating the SRT — all timestamps after the cut are now off by the duration removed.

Fix. Easiest: regenerate the SRT from the final cut. Manual fix: open the SRT in Subtitle Edit, select the affected section, use Adjust → Synchronization to shift timing.

Framerate mismatch (subtitles get progressively further out of sync)

Cause. The SRT was generated against one framerate (e.g. 23.976 fps) but applied to a video at a different rate (29.97 fps). Common when re-encoding for delivery.

Fix. Open in Subtitle Edit → Tools → Change frame rate. Pick the source and target rates; timestamps auto-recalculate.

Garbage characters at the start of the file (ï»¿ or random symbols)

Cause. UTF-8 BOM (Byte Order Mark) at the file start — some players misinterpret it as a visible character on line 1 and push timing.

Fix. Re-save the SRT without BOM. In VS Code: bottom-right encoding → Save with Encoding → UTF-8 without BOM. In Notepad++: Encoding → Convert to UTF-8 (without BOM).

Lines too long for mobile viewers

Cause. AI output may exceed broadcast guidelines (42 chars/line max, 32-36 for mobile). Lines wrap awkwardly on small screens.

Fix. Open in Subtitle Edit → Tools → Auto-balance lines. Or manually split long lines and adjust line break positions for readability.

Special characters render as ? or boxes

Cause. File encoding is not UTF-8 — common when SRT was saved as ANSI or Latin-1, losing characters like é, ñ, ü, 漢.

Fix. Re-save as UTF-8 in your text editor. Always specify UTF-8 explicitly when exporting from any tool that asks.

Recommended free editor: Subtitle Edit (Windows) — handles framerate conversion, line balancing, encoding fixes, sync drift, and translation memory in one tool. For cross-platform editing, Aegisub. For quick edits, the built-in caption editors in YouTube Studio, Premiere Pro, and DaVinci Resolve all work.

How to create an SRT file for a video

The 4-step workflow above is the tool version. Here's the version that answers "how to create an SRT file for a video" as a general question — including the DIY-with-Whisper path if you don't want to use a cloud tool.

Path A — Cloud AI (fastest, easiest)

Upload video to a cloud AI transcription tool (VexaScribe, Descript, TurboScribe, Sonix) OR paste a YouTube/direct URL.
Select source language (auto-detect works for clean monolingual audio) and toggle speaker diarization on for multi-speaker video.
Wait 5-15 minutes per video hour for AI processing.
Review the auto-generated .srt in the built-in editor — fix proper nouns (10-20% error rate on brand names), split lines over 42 characters, adjust any timing that drifts.
Download the .srt (UTF-8 encoded). Upload to YouTube or import into Premiere/DaVinci/CapCut.

Path B — Self-hosted Whisper (free, technical)

Install Python 3.10+ and OpenAI Whisper: pip install openai-whisper
Extract audio from video with ffmpeg: ffmpeg -i input.mp4 -vn -acodec libmp3lame audio.mp3
Transcribe with SRT output: whisper audio.mp3 --model large-v3 --output_format srt
Whisper writes audio.srt to the current directory. Rename to match your video filename for auto-loading in VLC and most players.
Note: base Whisper doesn't include speaker diarization. Pair with pyannote.audio 3.1 or use WhisperX for combined ASR + diarization + word-level alignment.

Path C — YouTube auto-captions (free, English-primary)

Upload the video to YouTube as unlisted, wait 2-6 hours for YouTube auto-captions to generate, then in YouTube Studio go to Subtitles → download the .srt via the three-dot menu. Accuracy runs 82-92% for English, lower for other languages. Adequate for accessibility drafts; not adequate for broadcast-quality captions without heavy editing.

Video to SRT vs alternatives

We position VexaScribe honestly: it's the right pick for most batch SRT workflows (direct video upload, 99 languages, speaker labels, $2/mo entry). Other tools win specific lanes — here's the honest read.

Tool	Best for	Entry price	Direct video upload?
VexaScribe	Most batch SRT generation — direct video upload + 99 languages	$2/mo or 30-min free	Yes (MP4/MOV/MKV/WebM/AVI)
Descript	Video creators who edit and caption in the same tool	$16/mo (10 hrs)	Yes
Otter.ai	Live meeting captions (audio-first product)	$8.33/mo annual	No (audio-only ingest)
YouTube auto-captions	Free, English-primary, lower accuracy	$0	n/a (post-upload only)
Self-hosted Whisper	Technical users at scale, free forever	$0	Yes (with ffmpeg extraction)

When to pick something other than VexaScribe. If you're editing the video and need captions integrated with the editor, Descript is the right call — the caption editor sits inside the video editor. If your channel is English-only and you don't care about accuracy quality, YouTube auto-captions are free and require zero workflow steps (the captions appear automatically after upload). If you have technical skills, a GPU, and high-volume needs, self-hosted Whisper is free forever — pay the setup cost once, run unlimited.

Embedding the SRT in your video

Once you have the .srt file, embedding it is straightforward. Quick reference for the most common destinations:

YouTube

YouTube Studio → Subtitles → select language → Upload file → choose "With timing" → pick the .srt. Captions appear within minutes. Upload multiple language SRTs for international viewers.

Premiere Pro

Window → Captions → click the menu icon → Import captions from file → select the .srt. Captions appear on the timeline as a separate caption track. Style and export with the video.

DaVinci Resolve

Edit page → right-click in the timeline → Import Subtitle → pick the .srt. Subtitles appear as a new track. Edit text and styling inline.

CapCut (mobile / desktop)

Captions → Import captions → select the .srt from your device. Adjust positioning, font, and color inside CapCut. Export the video with burned-in or soft captions.

VLC (playback only)

Place the .srt in the same folder as the video with the same filename (e.g. video.mp4 and video.srt) — VLC auto-loads it. Or Subtitle → Add Subtitle File manually.

For step-by-step embedding tutorials across platforms (including iPhone, Final Cut Pro, and burning subtitles permanently into video), see how to add subtitles to a video.

FAQ

Frequently Asked Questions

How do I convert video to SRT?

Four steps. (1) Upload the video file (MP4, MOV, MKV, WebM, AVI up to 5 GB) — OR paste a YouTube, TikTok, Instagram, or direct video URL and skip the upload entirely (rolled out July 2026). VexaScribe extracts audio automatically, no manual conversion step. (2) Choose source language (auto-detect works for clean monolingual audio) and toggle speaker diarization on for multi-speaker video (interviews, panels, podcasts). (3) Wait 5-15 minutes per video hour — AI runs at 4-10× real-time. (4) Download the .srt file (UTF-8 encoded, timestamped, line-broken) and drop it into YouTube, Premiere, DaVinci Resolve, CapCut, or VLC. Total time from upload/URL to ready-to-publish SRT: 10-25 minutes for a typical 30-60 minute video.

How do I create an SRT file for a video?

Same 4-step workflow above. The 'how to create an SRT file for video' query is asking for the practical steps rather than the marketing pitch, so here's the honest version: (a) get the video into a transcription tool — upload the file or paste a URL to bypass download. (b) Let the AI transcribe with word-level timestamps (5-15 min for a 60-min video). (c) Review the auto-generated cues in the editor — fix proper nouns (brand names, technical terms have 10-20% error rates), split lines over 42 characters, adjust any timing that's off. (d) Export as .srt (UTF-8, no BOM). The critical proofread step separates broadcast-quality SRT from raw AI output — plan 5-15 minutes per video hour for review.

What's the best AI video to SRT tool?

Depends on what you're optimizing for. For 99 languages + speaker labels + URL paste + $2-$20/mo pricing: VexaScribe. For SRT generation integrated with a full video editor: Descript ($16/mo Hobbyist). For unlimited annual-prepay generation at scale: TurboScribe Unlimited ($120/year). For technical users who want free forever: self-hosted OpenAI Whisper Large-v3 with the --output_format srt flag and ffmpeg to extract audio from video. Most non-technical creators pick VexaScribe or Descript depending on whether they need the video editor integration. Verified pricing July 2026.

Can I convert video to SRT for free?

Yes, three options. (1) VexaScribe 30-minute free trial — one-time, no credit card, covers a single short video at production accuracy. (2) Self-hosted OpenAI Whisper — free forever with a GPU and Python skills (use ffmpeg to extract audio, then whisper command with --output_format srt). (3) YouTube auto-captions — upload your video to YouTube, let YouTube generate captions, then download as SRT via Subtitle Edit or a browser extension (~85% English accuracy, English-primary). For ongoing video subtitle generation, paid plans start at $2/month covering 200 minutes (about 3-4 short videos).

What's the best AI tool to make SRT from video?

Depends on workflow. For batch SRT generation from finished videos with multi-format support and 99 languages: VexaScribe ($2-$20/mo, MP4/MOV/MKV/WebM/AVI direct upload, speaker diarization included on every plan). For video creators who edit and caption in the same tool: Descript ($16/mo, captions integrated with the video editor). For technical users at scale: self-hosted Whisper Large-v3 with ffmpeg (free forever, unlimited). For free auto-captions on English-primary content uploaded to YouTube: YouTube's built-in caption generator. Most non-technical creators use VexaScribe or Descript depending on whether they need the video editor in the same tool.

How accurate is AI-generated SRT from video?

95-97% accuracy on clean single-speaker explainer videos (treated room, good mic). Drops to 91-95% on Zoom/webinar recordings, 89-94% on classroom or lecture video, 90-94% on documentary narration, and 80-88% on outdoor/vlog content with ambient noise. Proper nouns — brand names, technical terms, product names — have 20-30% error rates even on otherwise clean audio. Plan 5-15 minutes of proofreading per video hour to fix proper nouns and adjust line breaks for readability before publishing. For YouTube and social media, this proofread is essential — broadcast-quality captions require 100% accuracy.

What video formats can I convert to SRT?

VexaScribe accepts MP4 (most common, H.264/H.265), MOV (QuickTime, iPhone default), MKV (Matroska), WebM (web video), and AVI (legacy). Audio formats also accepted: MP3, WAV, M4A, FLAC, OGG. Max file size 5 GB per upload — covers approximately 8-10 hours of typical compressed video. The audio is extracted from the video container automatically; no manual conversion to MP3/WAV is required. Self-hosted Whisper users must extract audio first with ffmpeg (ffmpeg -i input.mp4 -vn -acodec libmp3lame audio.mp3) since base Whisper accepts only audio formats.

Can I generate SRT in a language other than English?

Yes — VexaScribe supports 99 languages via Whisper Large-v3, including Spanish, French, German, Italian, Portuguese, Japanese, Korean, Mandarin, Arabic, Russian, Hindi, and 87 more. Source language is auto-detected or manually selectable. Translation to 133 target languages is included on every paid plan — generate the source-language SRT first, then translate to English (or any of the 133 supported languages) for a second SRT file. Common workflows: Spanish creator generates Spanish SRT for native audience + English SRT for international viewers; Japanese tutorial creator generates Japanese SRT + English/Spanish for global reach.

How long does it take to generate SRT from a 1-hour video?

5-15 minutes of processing time for AI transcription plus 5-15 minutes for review (proper nouns, line breaks, sync). Total end-to-end: 10-30 minutes for a 1-hour video. AI runs at 4-10× real-time depending on infrastructure. For comparison: human captioning via Rev or 3PlayMedia takes 12-48 hours turnaround at $90-$300 per video hour — almost never justified unless broadcast/ADA compliance requires verbatim certified captions. Self-hosted Whisper on a consumer GPU (RTX 3060 or better) processes a 1-hour video in 10-20 minutes locally, free.

Can I edit the SRT before publishing?

Yes — SRT files are plain UTF-8 text and open in any text editor (VS Code, Notepad++, TextEdit). For visual editing with sync preview, use free tools: Subtitle Edit (Windows, most powerful), Aegisub (cross-platform), or the built-in caption editors in YouTube Studio, Premiere Pro, and DaVinci Resolve. Common edits before publishing: split long lines (broadcast standard is 42 characters/line max, 32-36 for mobile), fix proper nouns the AI got wrong, adjust line timing for readability (~150-180 words per minute target), and add speaker labels for multi-speaker video. VexaScribe also exports DOCX and TXT for non-SRT downstream workflows.

Why are my SRT timestamps off?

Three common causes. (1) Framerate mismatch — the SRT was generated against one framerate (e.g. 23.976 fps) but applied to a video at a different rate (29.97 fps). Fix: open in Subtitle Edit → Tools → Change frame rate. (2) Editing drift — if you cut or trimmed the video after generating the SRT, all timestamps after the cut are off. Fix: regenerate the SRT from the final cut, or manually shift sections in Subtitle Edit (Adjust → Synchronization). (3) UTF-8 BOM issues — some players misinterpret the byte order mark as a character on line 1, pushing timing. Fix: re-save without BOM in your text editor (Save with encoding → UTF-8 without BOM).

Should I use SRT or VTT for my video?

SRT for most cases. SRT (.srt) is universally supported — YouTube, Vimeo, Premiere, DaVinci, CapCut, VLC, Windows Media Player, and every social platform. VTT (.vtt) is HTML5 video's native caption format with extra styling support (positioning, colors, font, regions) — use VTT only when you specifically need those styling features on a custom HTML5 video player. For YouTube uploads, SRT is standard and YouTube auto-converts it internally. VexaScribe exports SRT by default; for VTT, generate SRT then convert with a free online tool or Subtitle Edit.

What is an SRT file and how do I open it?

An SRT file is a plain-text subtitle file in SubRip format. It contains numbered subtitle blocks — each block has an index number, a timestamp range (HH:MM:SS,mmm --> HH:MM:SS,mmm), and one or two lines of text, separated by a blank line. The file is encoded as UTF-8. To open an SRT: any text editor works (Notepad, VS Code, TextEdit). To view it synced with video: VLC (Subtitles → Add subtitle file), YouTube Studio (upload as a .srt caption track), or Subtitle Edit for visual editing with sync preview. Most video players auto-load an SRT if it has the same filename as the video in the same folder.

How do I convert MP4 to SRT?

Upload the MP4 to VexaScribe and download the .srt output. The workflow: (1) Upload the MP4 file (up to 5 GB) or paste a YouTube/video URL. (2) Audio is extracted automatically — no manual MP3 conversion needed. (3) Whisper Large-v3 transcribes with word-level timestamps. (4) Download the .srt file, ready for YouTube Studio, Premiere Pro, DaVinci Resolve, or CapCut. A 30-minute MP4 produces an SRT in 5-10 minutes. For self-hosted users: ffmpeg -i input.mp4 -vn audio.mp3 followed by whisper audio.mp3 --output_format srt produces the equivalent output locally for free.

Can I use my SRT file for YouTube?

Yes. YouTube natively imports .srt files. Go to YouTube Studio → Select video → Subtitles → Add language → Upload file → select .srt. YouTube's auto-captions can then use your imported SRT instead of generating its own. Benefits: higher accuracy (AI + proofread vs YouTube's raw auto-captions), keyword indexing of the full transcript text for search ranking, and accessibility compliance. If your SRT has speaker labels (e.g. [HOST] or [GUEST]), YouTube displays them inline — useful for interview and panel videos.

Methodology & disclosure

Verification window. Accuracy figures derived from the Whisper Large-v3 paper (Radford et al., OpenAI 2022) and the Open ASR Leaderboard (Hugging Face, current state as of May 2026). Pricing verified against VexaScribe, Descript, Otter.ai, Rev, and 3PlayMedia pricing pages between May 14 and May 26, 2026.

Conflict of interest. VexaScribe is our product. We've disclosed pricing for every comparable tool and honestly identified scenarios where competitors win — Descript for integrated video editing, YouTube auto-captions for free English-primary workflows, self-hosted Whisper for technical users at scale, human captioning for broadcast/ADA-certified output.

Inherited model accuracy. VexaScribe uses Whisper Large-v3 (Radford et al., OpenAI 2022) as the upstream ASR engine. Accuracy claims reflect upstream Whisper benchmarks plus our internal evaluation on creator-supplied video samples; we don't claim independent benchmark improvements over upstream Whisper.

What changed since last update? First publication, May 26, 2026. Future updates will be reflected in the "Verified" badge and datePublished/dateModified schema fields.

Editorial standards. Full disclosure policy at editorial standards.

Video to SRT & MP4 to SRT Converter — Generate AI Subtitles Online

Paste a URL instead of uploading (July 2026)

Key takeaways

How to convert video to SRT (4 steps)

Upload the video file

Choose source language and diarization

Wait for processing

Download the .srt and review

Supported video formats

MP4 (H.264 / H.265)

MOV (QuickTime)

MKV (Matroska)

WebM

AVI (legacy)

Audio formats also accepted

What is an SRT file?

How VexaScribe builds readable cues

Accuracy by video type

Cost: per-video and bulk math

Multi-language SRT (99 languages)

Workflow: SRT in two languages

Common SRT errors and fixes

Timestamps drift after editing the video

Framerate mismatch (subtitles get progressively further out of sync)

Garbage characters at the start of the file (ï»¿ or random symbols)

Lines too long for mobile viewers

Special characters render as ? or boxes

How to create an SRT file for a video

Path A — Cloud AI (fastest, easiest)

Path B — Self-hosted Whisper (free, technical)

Path C — YouTube auto-captions (free, English-primary)

Video to SRT vs alternatives

Embedding the SRT in your video

YouTube

Premiere Pro

DaVinci Resolve

CapCut (mobile / desktop)

VLC (playback only)

FAQ

Frequently Asked Questions

Methodology & disclosure

Related VexaScribe resources

Video caption generator

What is an SRT file?

How to create an SRT file

SRT generator

How to add subtitles to a video

Best subtitle generators 2026

Transcribe audio to text

Transcribe & translate

MP4 to text

Video to text

Captions vs subtitles