Key takeaways
- •AI generates .srt subtitle files from raw video — no manual audio extraction step.
- •Accepts MP4, MOV, MKV, WebM, AVI directly; max file size 5 GB on VexaScribe.
- •Output is timestamped, line-broken, UTF-8 encoded — drops into YouTube, Premiere, DaVinci, VLC, CapCut without conversion.
- •Accuracy 92-97% on clean audio; review proper nouns and technical terms before publishing.
- •Cost $0.20-$0.60 per video hour AI; $90-$300 per video hour human (Rev, 3PlayMedia).
- •Processing 5-15 minutes per video hour AI; 12-48 hours human turnaround.
- •Diarization is optional — useful for interviews and panels, irrelevant for single-speaker explainers.
How to convert video to SRT (4 steps)
- 1
Upload the video file
MP4, MOV (iPhone default), MKV, WebM, AVI up to 5 GB per file (approximately 8-10 hours of compressed video). Audio is extracted from the video container automatically — no manual conversion to MP3 or WAV required. Free trial accepts the first 30 minutes of any file.
- 2
Choose source language and diarization
Select source language from 99 supported languages, or use auto-detect for clean monolingual audio. Toggle speaker diarization on for multi-speaker video (interviews, panels, podcasts) — labels speakers as Speaker 1, Speaker 2 for renaming later. Diarization is included on every paid plan with no tier gating.
- 3
Wait for processing
AI runs at 4-10× real-time. A 30-60 minute video processes in 5-15 minutes. VexaScribe emails you when the SRT is ready. While waiting, you can queue additional video uploads — useful for batch caption generation across a YouTube channel or course.
- 4
Download the .srt and review
Download the .srt file (UTF-8 encoded, timestamped, line-broken to ~42 chars/line). Quick proofread for proper nouns, brand names, technical terms (5-15 min/hr). Drop into YouTube Studio (Subtitles → Upload file), Premiere Pro (Captions panel → Import), DaVinci Resolve, CapCut, or VLC. DOCX and TXT also exported for downstream workflows.
Supported video formats
VexaScribe accepts the most common video containers directly — no manual audio extraction step. The platform extracts the audio track using ffmpeg internally and routes the audio through the Whisper Large-v3 transcription pipeline.
MP4 (H.264 / H.265)
Most common video format — YouTube downloads, exported from any editor, phone recordings. Universal compatibility.
MOV (QuickTime)
iPhone default camera output, Apple ecosystem. Same H.264/H.265 codecs as MP4 in a different container.
MKV (Matroska)
Open-source container favored by editors and torrent ecosystems. Supports multi-track audio.
WebM
Web video standard, VP8/VP9/AV1 codecs. Common for HTML5 video and screen recordings (OBS WebM output).
AVI (legacy)
Legacy Windows container. Older recordings, archived broadcast captures. Still supported.
Audio formats also accepted
MP3, WAV, M4A, FLAC, OGG — for audio-only transcription with SRT timestamps. See transcribe audio to text.
Max file size: 5 GB per upload — covers approximately 8-10 hours of typical compressed video at 720p-1080p. For longer files, split with a free tool like LosslessCut before uploading.
Self-hosted Whisper users must extract audio first since base Whisper accepts only audio formats: ffmpeg -i input.mp4 -vn -acodec libmp3lame audio.mp3, then whisper audio.mp3 --output_format srt.
What is an SRT file?
SRT (SubRip Subtitle) is the universal subtitle format — a plain UTF-8 text file with numbered subtitle blocks. Each block contains an index, a timestamp range, and one or more lines of text. Universally supported: YouTube, Vimeo, Premiere Pro, DaVinci Resolve, CapCut, VLC, Windows Media Player, every social platform, and every streaming service.
1 00:00:00,000 --> 00:00:03,500 Welcome to the introduction lecture. 2 00:00:03,500 --> 00:00:07,200 Today we'll cover three main topics. 3 00:00:07,200 --> 00:00:11,800 Let's start with the first one.
Anatomy of a block:
- →Index: sequential integer starting at 1. Must be unique and ordered.
- →Timestamp range:
HH:MM:SS,mmm --> HH:MM:SS,mmm. Note the comma (not period) before milliseconds — common cross-platform compatibility issue. - →Text lines: one or two lines per block. Broadcast standard caps at 42 characters per line; mobile-friendly drops to 32-36.
- →Blank line: separates each block. Critical — missing blank lines break parsing in some players.
- →Encoding: UTF-8 without BOM is the safe default. UTF-8 with BOM works in most players but trips a few legacy ones.
SRT vs VTT? SRT is the default for nearly every workflow. VTT (.vtt, WebVTT) is HTML5 video's native caption format with extra styling support (positioning, colors, font, regions) — use VTT only when those styling features matter for a custom HTML5 video player. For YouTube uploads, SRT is standard; YouTube auto-converts internally.
Accuracy by video type
Whisper Large-v3 (the model VexaScribe uses) hits 95-97% accuracy on clean single-speaker video but degrades predictably with audio conditions. Plan your review time based on the type of video you're subtitling.
| Video type | AI accuracy | Review time |
|---|---|---|
| Single-speaker explainer / tutorial (clean) | 95-97% | 5-10 min/hr |
| Interview, 2 speakers (clean mic, treated room) | 92-96% | 10-15 min/hr |
| Podcast-style video (mic'd, treated room) | 95-97% | 5-10 min/hr |
| Webinar / Zoom recording | 91-95% | 10-15 min/hr |
| Lecture / classroom recording | 89-94% | 10-20 min/hr |
| Documentary with B-roll narration | 90-94% | 10-15 min/hr |
| Vlog (outdoor, ambient noise) | 80-88% | 20-30 min/hr |
| Heavily accented English (non-native speaker) | 82-90% | 15-25 min/hr |
Where AI consistently misses: proper nouns (brand names, product names, technical terms) at 20-30% error rate even on clean audio; numbers spelled vs digits ("twenty twenty six" vs "2026"); homophones (their/there/they're); rapid-fire counts and lists. Always proofread before publishing public-facing captions.
For accuracy methodology, see how accurate is Whisper? with WER benchmarks across LibriSpeech and FLEURS.
Cost: per-video and bulk math
AI subtitle generation is genuinely cheap — typically $0.20-$0.60 per video hour on consumer apps. Human captioning is 150-1,500× more expensive. The cost math only flips toward human if you specifically need verbatim, court-grade, or ADA-certified broadcast captions.
| Tool | Per video hour | Entry plan | Best for |
|---|---|---|---|
| VexaScribe | $0.20-$0.60 | $2/mo (200 min) | Most batch SRT workflows — multi-format upload, 99 languages, speaker labels |
| Rev AI | ~$6/hr ($0.10/min) | PAYG | Developer/API integration |
| Descript | ~$1.60 effective | $16/mo (10 hrs) | Video creators who edit and caption in the same tool |
| Self-hosted Whisper | $0 forever | n/a | Technical users with GPU + ffmpeg |
| Human captioning (Rev, 3PlayMedia) | $90-$300/hr | per-minute | Verbatim, court-grade, broadcast/ADA-certified |
Bulk math example: a YouTube channel publishing 4 videos/month at ~15 minutes each = 1 hour of new video monthly. SRT generation costs $0.20-$0.60 on AI versus $90-$300 with human captioning — a 150-1,500× difference. For a 40-episode course (~20 hours of video total), AI runs $4-$12 versus $1,800-$6,000 human.
For full cost analysis across the 14-tool transcription market, see how much does transcription cost? with verified 2026 pricing and an interactive calculator.
Multi-language SRT (99 languages)
VexaScribe supports SRT generation in 99 languages via Whisper Large-v3, including all major European, East Asian, and Middle Eastern languages — Spanish, French, German, Italian, Portuguese, Japanese, Korean, Mandarin, Arabic, Russian, Hindi, Turkish, Vietnamese, Polish, Dutch, Swedish, plus 83 more. Source language is auto-detected from the audio or manually selectable.
Workflow: SRT in two languages
- Upload the video, generate the source-language SRT (e.g. Spanish).
- Use the built-in translation widget to translate the transcript to a target language (133 languages supported as translation targets).
- Export the translated transcript as a second SRT — timestamps preserved from the source so both SRTs sync to the same video.
- Upload both .srt files to YouTube as separate language tracks; viewers pick from the captions menu.
Common workflows: Spanish-language creator generates Spanish SRT for native audience plus English SRT for international viewers. Japanese tutorial creator generates Japanese + English + Spanish SRTs for global reach. Documentary filmmaker generates SRTs in source language plus festival-circuit target languages (English, French, German). See also transcribe and translate audio.
Common SRT errors and fixes
Most SRT problems come from one of five issues. Here's how to recognize and fix each one.
Timestamps drift after editing the video
Cause. You cut or trimmed the source video after generating the SRT — all timestamps after the cut are now off by the duration removed.
Fix. Easiest: regenerate the SRT from the final cut. Manual fix: open the SRT in Subtitle Edit, select the affected section, use Adjust → Synchronization to shift timing.
Framerate mismatch (subtitles get progressively further out of sync)
Cause. The SRT was generated against one framerate (e.g. 23.976 fps) but applied to a video at a different rate (29.97 fps). Common when re-encoding for delivery.
Fix. Open in Subtitle Edit → Tools → Change frame rate. Pick the source and target rates; timestamps auto-recalculate.
Garbage characters at the start of the file ( or random symbols)
Cause. UTF-8 BOM (Byte Order Mark) at the file start — some players misinterpret it as a visible character on line 1 and push timing.
Fix. Re-save the SRT without BOM. In VS Code: bottom-right encoding → Save with Encoding → UTF-8 without BOM. In Notepad++: Encoding → Convert to UTF-8 (without BOM).
Lines too long for mobile viewers
Cause. AI output may exceed broadcast guidelines (42 chars/line max, 32-36 for mobile). Lines wrap awkwardly on small screens.
Fix. Open in Subtitle Edit → Tools → Auto-balance lines. Or manually split long lines and adjust line break positions for readability.
Special characters render as ? or boxes
Cause. File encoding is not UTF-8 — common when SRT was saved as ANSI or Latin-1, losing characters like é, ñ, ü, 漢.
Fix. Re-save as UTF-8 in your text editor. Always specify UTF-8 explicitly when exporting from any tool that asks.
Recommended free editor: Subtitle Edit (Windows) — handles framerate conversion, line balancing, encoding fixes, sync drift, and translation memory in one tool. For cross-platform editing, Aegisub. For quick edits, the built-in caption editors in YouTube Studio, Premiere Pro, and DaVinci Resolve all work.
Video to SRT vs alternatives
We position VexaScribe honestly: it's the right pick for most batch SRT workflows (direct video upload, 99 languages, speaker labels, $2/mo entry). Other tools win specific lanes — here's the honest read.
| Tool | Best for | Entry price | Direct video upload? |
|---|---|---|---|
| VexaScribe | Most batch SRT generation — direct video upload + 99 languages | $2/mo or 30-min free | Yes (MP4/MOV/MKV/WebM/AVI) |
| Descript | Video creators who edit and caption in the same tool | $16/mo (10 hrs) | Yes |
| Otter.ai | Live meeting captions (audio-first product) | $8.33/mo annual | No (audio-only ingest) |
| YouTube auto-captions | Free, English-primary, lower accuracy | $0 | n/a (post-upload only) |
| Self-hosted Whisper | Technical users at scale, free forever | $0 | Yes (with ffmpeg extraction) |
When to pick something other than VexaScribe. If you're editing the video and need captions integrated with the editor, Descript is the right call — the caption editor sits inside the video editor. If your channel is English-only and you don't care about accuracy quality, YouTube auto-captions are free and require zero workflow steps (the captions appear automatically after upload). If you have technical skills, a GPU, and high-volume needs, self-hosted Whisper is free forever — pay the setup cost once, run unlimited.
See also SRT generator (format deep dive) and transcription tool alternatives.
Embedding the SRT in your video
Once you have the .srt file, embedding it is straightforward. Quick reference for the most common destinations:
YouTube
YouTube Studio → Subtitles → select language → Upload file → choose "With timing" → pick the .srt. Captions appear within minutes. Upload multiple language SRTs for international viewers.
Premiere Pro
Window → Captions → click the menu icon → Import captions from file → select the .srt. Captions appear on the timeline as a separate caption track. Style and export with the video.
DaVinci Resolve
Edit page → right-click in the timeline → Import Subtitle → pick the .srt. Subtitles appear as a new track. Edit text and styling inline.
CapCut (mobile / desktop)
Captions → Import captions → select the .srt from your device. Adjust positioning, font, and color inside CapCut. Export the video with burned-in or soft captions.
VLC (playback only)
Place the .srt in the same folder as the video with the same filename (e.g. video.mp4 and video.srt) — VLC auto-loads it. Or Subtitle → Add Subtitle File manually.
For step-by-step embedding tutorials across platforms (including iPhone, Final Cut Pro, and burning subtitles permanently into video), see how to add subtitles to a video.
FAQ
Frequently Asked Questions
How do I convert video to SRT?
Four steps. (1) Upload the video file (MP4, MOV, MKV, WebM, AVI up to 5 GB) to an AI transcription tool — VexaScribe extracts audio automatically, no manual conversion step. (2) Choose source language (auto-detect works for clean monolingual audio) and toggle speaker diarization on for multi-speaker video (interviews, panels, podcasts). (3) Wait 5-15 minutes per video hour — AI runs at 4-10× real-time. (4) Download the .srt file (UTF-8 encoded, timestamped, line-broken) and drop it into YouTube, Premiere, DaVinci Resolve, CapCut, or VLC. Total time from upload to ready-to-publish SRT: 10-25 minutes for a typical 30-60 minute video.
Can I convert video to SRT for free?
Yes, three options. (1) VexaScribe 30-minute free trial — one-time, no credit card, covers a single short video at production accuracy. (2) Self-hosted OpenAI Whisper — free forever with a GPU and Python skills (use ffmpeg to extract audio, then whisper command with --output_format srt). (3) YouTube auto-captions — upload your video to YouTube, let YouTube generate captions, then download as SRT via Subtitle Edit or a browser extension (~85% English accuracy, English-primary). For ongoing video subtitle generation, paid plans start at $2/month covering 200 minutes (about 3-4 short videos).
What's the best AI tool to make SRT from video?
Depends on workflow. For batch SRT generation from finished videos with multi-format support and 99 languages: VexaScribe ($2-$20/mo, MP4/MOV/MKV/WebM/AVI direct upload, speaker diarization included on every plan). For video creators who edit and caption in the same tool: Descript ($16/mo, captions integrated with the video editor). For technical users at scale: self-hosted Whisper Large-v3 with ffmpeg (free forever, unlimited). For free auto-captions on English-primary content uploaded to YouTube: YouTube's built-in caption generator. Most non-technical creators use VexaScribe or Descript depending on whether they need the video editor in the same tool.
How accurate is AI-generated SRT from video?
95-97% accuracy on clean single-speaker explainer videos (treated room, good mic). Drops to 91-95% on Zoom/webinar recordings, 89-94% on classroom or lecture video, 90-94% on documentary narration, and 80-88% on outdoor/vlog content with ambient noise. Proper nouns — brand names, technical terms, product names — have 20-30% error rates even on otherwise clean audio. Plan 5-15 minutes of proofreading per video hour to fix proper nouns and adjust line breaks for readability before publishing. For YouTube and social media, this proofread is essential — broadcast-quality captions require 100% accuracy.
What video formats can I convert to SRT?
VexaScribe accepts MP4 (most common, H.264/H.265), MOV (QuickTime, iPhone default), MKV (Matroska), WebM (web video), and AVI (legacy). Audio formats also accepted: MP3, WAV, M4A, FLAC, OGG. Max file size 5 GB per upload — covers approximately 8-10 hours of typical compressed video. The audio is extracted from the video container automatically; no manual conversion to MP3/WAV is required. Self-hosted Whisper users must extract audio first with ffmpeg (ffmpeg -i input.mp4 -vn -acodec libmp3lame audio.mp3) since base Whisper accepts only audio formats.
Can I generate SRT in a language other than English?
Yes — VexaScribe supports 99 languages via Whisper Large-v3, including Spanish, French, German, Italian, Portuguese, Japanese, Korean, Mandarin, Arabic, Russian, Hindi, and 87 more. Source language is auto-detected or manually selectable. Translation to 133 target languages is included on every paid plan — generate the source-language SRT first, then translate to English (or any of the 133 supported languages) for a second SRT file. Common workflows: Spanish creator generates Spanish SRT for native audience + English SRT for international viewers; Japanese tutorial creator generates Japanese SRT + English/Spanish for global reach.
How long does it take to generate SRT from a 1-hour video?
5-15 minutes of processing time for AI transcription plus 5-15 minutes for review (proper nouns, line breaks, sync). Total end-to-end: 10-30 minutes for a 1-hour video. AI runs at 4-10× real-time depending on infrastructure. For comparison: human captioning via Rev or 3PlayMedia takes 12-48 hours turnaround at $90-$300 per video hour — almost never justified unless broadcast/ADA compliance requires verbatim certified captions. Self-hosted Whisper on a consumer GPU (RTX 3060 or better) processes a 1-hour video in 10-20 minutes locally, free.
Can I edit the SRT before publishing?
Yes — SRT files are plain UTF-8 text and open in any text editor (VS Code, Notepad++, TextEdit). For visual editing with sync preview, use free tools: Subtitle Edit (Windows, most powerful), Aegisub (cross-platform), or the built-in caption editors in YouTube Studio, Premiere Pro, and DaVinci Resolve. Common edits before publishing: split long lines (broadcast standard is 42 characters/line max, 32-36 for mobile), fix proper nouns the AI got wrong, adjust line timing for readability (~150-180 words per minute target), and add speaker labels for multi-speaker video. VexaScribe also exports DOCX and TXT for non-SRT downstream workflows.
Why are my SRT timestamps off?
Three common causes. (1) Framerate mismatch — the SRT was generated against one framerate (e.g. 23.976 fps) but applied to a video at a different rate (29.97 fps). Fix: open in Subtitle Edit → Tools → Change frame rate. (2) Editing drift — if you cut or trimmed the video after generating the SRT, all timestamps after the cut are off. Fix: regenerate the SRT from the final cut, or manually shift sections in Subtitle Edit (Adjust → Synchronization). (3) UTF-8 BOM issues — some players misinterpret the byte order mark as a character on line 1, pushing timing. Fix: re-save without BOM in your text editor (Save with encoding → UTF-8 without BOM).
Should I use SRT or VTT for my video?
SRT for most cases. SRT (.srt) is universally supported — YouTube, Vimeo, Premiere, DaVinci, CapCut, VLC, Windows Media Player, and every social platform. VTT (.vtt) is HTML5 video's native caption format with extra styling support (positioning, colors, font, regions) — use VTT only when you specifically need those styling features on a custom HTML5 video player. For YouTube uploads, SRT is standard and YouTube auto-converts it internally. VexaScribe exports SRT by default; for VTT, generate SRT then convert with a free online tool or Subtitle Edit.
Methodology & disclosure
Verification window. Accuracy figures derived from the Whisper Large-v3 paper (Radford et al., OpenAI 2022) and the Open ASR Leaderboard (Hugging Face, current state as of May 2026). Pricing verified against VexaScribe, Descript, Otter.ai, Rev, and 3PlayMedia pricing pages between May 14 and May 26, 2026.
Conflict of interest. VexaScribe is our product. We've disclosed pricing for every comparable tool and honestly identified scenarios where competitors win — Descript for integrated video editing, YouTube auto-captions for free English-primary workflows, self-hosted Whisper for technical users at scale, human captioning for broadcast/ADA-certified output.
Inherited model accuracy. VexaScribe uses Whisper Large-v3 (Radford et al., OpenAI 2022) as the upstream ASR engine. Accuracy claims reflect upstream Whisper benchmarks plus our internal evaluation on creator-supplied video samples; we don't claim independent benchmark improvements over upstream Whisper.
What changed since last update? First publication, May 26, 2026. Future updates will be reflected in the "Verified" badge and datePublished/dateModified schema fields.
Editorial standards. Full disclosure policy at editorial standards.