Bulk Transcription — Flat-Rate AI for Agencies, Labs & Podcast Catalogs

The essentials

●Up to 50 files per batch, mixed audio and video formats. No daily cap beyond your plan's monthly minutes.
●Whisper Large-v3 on specialized ML inference infrastructure — 92–95% accuracy on Tier 1 languages, 99 languages total.
●ZIP delivery with original filenames preserved + CSV manifest of per-file metadata (duration, language, speaker count).
●Flat-rate $2–$20/month. At 100h/month: $20 vs $1,500 Rev AI vs $11,940 Rev Human. For pure-API consumers, AssemblyAI batch still wins at $15.
●Storage in AWS eu-west-2 (London). No AI training on user data. Inference partner disclosed in DPA. Full data residency and CLOUD Act caveat in the compliance section below.
●Honest scope: UI-first today. Drag-and-drop, dashboard, ZIP download. No programmatic API yet — if you need one, AssemblyAI or Deepgram are stronger fits.
●DPA + NDA available for research labs (IRB), legal firms (privileged data), and corporate procurement.

Who actually needs bulk transcription

“Bulk transcription” is a B2B-leaning query — almost no consumer searches it. The six personas below cover ~95% of real bulk customers. Each has a distinct pain that shapes which tool wins.

Content / marketing agency

5–30 client podcasts or videos per week

Per-minute pricing kills margin, manual one-by-one uploads burn the producer's day, needs branded DOCX + SRT deliverables per client.

Academic / UX research team

Just finished a study with 50–200 interviews

Needs verbatim transcripts, consistent speaker labels, NVivo / ATLAS.ti / MAXQDA import, IRB-compliant data residency, one-shot ZIP delivery.

Legal / paralegal team

Deposition or discovery batches

Confidentiality (no AI training), retention controls, accurate timestamps, DPA + NDA on file before any data moves.

Media archive / newsroom

Years of interviews, broadcasts, podcasts to digitize

Wants to make a back catalog searchable, EU residency for European partnerships, predictable monthly cost rather than per-minute archive billing.

B2B product builder

Building an app where transcription is one feature

Per-minute API math, webhook reliability, predictable monthly cap. Honest note: if you need a true bulk API today, AssemblyAI ($0.15/hr) or Deepgram ($0.26/hr) are stronger fits — VexaScribe bulk is UI-first.

Educator / course producer

Library of lectures, webinars, or training sessions

Accuracy on technical jargon, batch SRT export for WCAG/EAA accessibility compliance, multiple language tracks for international cohorts.

Across all six personas, the common pain is the same in different language: per-minute pricing destroys budget predictability as volume grows, one-by-one uploads waste hours, compliance documentation is needed for procurement, and consistent output matters for downstream automation. Bulk transcription is not a discount feature — it's a workflow feature.

How VexaScribe bulk works (3 steps)

1
Drag a folder
Select up to 50 files at once from your desktop, or drop a folder onto the upload area. Mixed formats — MP3 + M4A + MP4 + WAV in the same batch is fine. Audio is extracted from video automatically.
2
Pick languages, formats, options
Auto-detect language per file (recommended) or force a single language. Choose output formats — pick one or several: TXT, DOCX, SRT, VTT, PDF, JSON. Toggle diarization on/off.
3
Walk away, return to a ZIP
Processing happens in parallel. Status dashboard shows per-file progress. When complete, download a single ZIP with original filenames preserved plus a CSV manifest. Partial-batch download is supported if some files fail.

Typical processing speed: 5–10 minutes per hour of audio when GPU capacity is available. A 50-file batch of one-hour interviews completes in roughly 60–90 minutes (parallelized) — not 50 hours of sequential processing.

The math: flat-rate vs per-minute (100-hour project)

The load-bearing question for buyers. We picked 100 hours because it's the realistic size of one research study, one media archive sprint, or one quarter of a marketing agency's podcast workload. Numbers verified June 2026.

Provider	Calculation	100h total	Note
Rev AI	100h × $0.25/min × 60	$1,500	Per-minute AI; human is 8× more
Rev Human	100h × $1.99/min × 60	$11,940	Verbatim certified for legal/broadcast
Otter Business	$30/user/mo, capped 6,000 min/user	$30 (fits 100h)	No Portuguese support
AssemblyAI batch	100h × $0.0025/min × 60 + add-ons	~$15 base, ~$54 loaded	API only, no UI
Trint BulkScribe	Custom enterprise contract from ~$0.20/h	~$20+, contract required	Enterprise sales cycle
VexaScribe Studio★	Flat monthly subscription	$20 / month	UI-first, ZIP delivery, AWS eu-west-2 (London)

Honest read of this table

Flat-rate dominates per-minute pricing once you exceed ~5 hours/month, and dominates seat-based pricing once you exceed one user's needs. For pure-API consumers (developers building a transcription feature into their own product), AssemblyAI's $15 base is genuinely cheaper than VexaScribe Studio's $20 — we don't compete on that segment, and we'll tell you that honestly rather than pretend our UI matters to a developer building a SaaS. For everyone else — agencies, researchers, legal teams, media archives, educators — the managed UI plus flat-rate plus London storage with procurement-ready disclosure is the better deal.

Batch mechanics: queues, parallelism, failure handling

The boring details that matter when you're processing 50 files at once and one of them is corrupted.

●Mixed-format support. MP3, M4A, WAV, FLAC, OGG, OPUS, MP4, MOV, MKV, WEBM, AVI in one batch. Audio is extracted from video automatically — no pre-conversion needed.
●Parallel processing. Multiple files transcribe simultaneously, not sequentially. A 50-file batch of one-hour interviews finishes in 60–90 minutes total, not 50 hours.
●Per-file status dashboard. Each file shows: queued, uploading, processing, complete, or failed-with-reason. Refresh anytime; nothing is lost if you close the tab.
●Independent failure handling. If 3 of 50 files fail (corrupted header, no audio track, exceeds 10-hour cap), the other 47 finish normally. Failed files retry once free; you can fix and re-upload manually after that.
●Speaker labels — honest limit. Diarization is consistent within each file (Speaker 1 through Speaker 10) but NOT auto-matched across files. The bulk-rename UI applies consistent labels across all files in a batch in one operation — typical research study with one interviewer and many participants takes ~30 seconds total to relabel.
●Partial-batch download. Don't wait for stragglers — download a ZIP of completed files anytime. Failed files can finish later and be downloaded separately.

Output formats and the ZIP manifest

Pick one or several formats per batch — every file in the batch gets every format you selected. The ZIP delivery preserves your original filenames, with extension matched to the output (interview.mp3 → interview.docx, interview.srt, etc).

Format	Contains	Typically used by
TXT	Plain text, raw transcript	Quick read, copy-paste, LLM prompt input
DOCX	Word document with speakers, timestamps	Researchers (NVivo/ATLAS.ti import), journalists, legal teams
SRT	Subtitle file with timing	Video creators, YouTube, Premiere, DaVinci, CapCut
VTT	Web subtitle (HTML5 video)	Web players, browser-native captions
PDF	Formatted, print-ready transcript	Client deliverables, legal exhibits, archives
JSON	Structured with word-level timestamps	Developers, search indexing, custom downstream tools

Every ZIP includes a CSV manifest with one row per file: original filename, duration, detected language, number of speakers, processing timestamp, word count. The manifest is the bridge for downstream automation — pipe it into a script that uploads to your CAQDAS tool, CMS, or shared drive without manual mapping.

Compliance and security (the procurement section)

Everything procurement, legal, and IT typically asks about. Depth matters here — buyers need to forward this section to a security review.

Storage residency

AWS eu-west-2 (London, UK). Audio, transcripts, and account data stored under UK-GDPR (post-Brexit EU-GDPR equivalent).

Transcription processing

Whisper Large-v3 runs on specialized ML inference infrastructure. Inference partner is US-domiciled — full name and role disclosed in our DPA. For end-to-end EU-only processing, ask about our dedicated infrastructure option.

LGPD compliance

Lei 13.709/2018. London storage satisfies cross-border requirements for Brazilian clients via EU/UK adequacy posture. DPO contact provided on request.

No AI training

Contractual commitment in our Terms and DPA. Your audio and transcripts never train any model — ours or our inference partner's, per their published Terms of Service.

Encryption

TLS 1.2+ in transit (upload, inference call, download). AES-256 at rest in S3. Per-bucket encryption keys. Audit logs available on Studio + Enterprise plans.

Retention

30-day default post-transcription retention. Auto-delete on transcript download available on Pro and Studio. On-request immediate deletion always honored.

DPA + NDA

Standard GDPR Article 28 DPA on request, typically 1–2 business days. Project-specific NDAs for sensitive batches (legal investigations, media archives) on 2–5 day turnaround.

Sub-processors: AWS (primary storage and application compute, eu-west-2 London) and our ML inference partner (Whisper Large-v3 and pyannote.audio inference). Both are named with their specific role and location in our DPA. We do not use the OpenAI API, AssemblyAI, or Deepgram. Honest disclosure for procurement: because our inference partner is US-domiciled, the transcription step is within the legal reach of the US CLOUD Act even though storage is in London. For adversarial legal contexts — sources under US legal pressure, sealed legal evidence with US-government adversaries — get in touch about our dedicated infrastructure option.

Languages: 99 supported, Tier 1 PT-BR

Whisper Large-v3 supports 99 languages with accuracy that tiers by training data volume. Bulk batches can mix languages — language is auto-detected per file from the first 30 seconds.

Tier 1 (92–95%)

English, Spanish, French, German, Italian, Dutch, Russian, Polish, Portuguese (PT and BR), Japanese, Mandarin, Korean.

Tier 2 (88–92%)

Arabic, Turkish, Hindi, Vietnamese, Thai, Indonesian, Ukrainian, Czech, Hungarian, Romanian, Swedish, Danish, Finnish.

Tier 3 (75–88%)

Swahili, Bengali, Punjabi, Tamil, Telugu, Welsh, and other lower-resource languages. Sample test with your audio recommended before committing a large batch.

Notable PT-BR differentiator: Otter.ai does NOT support Portuguese in 2026 — its official supported languages are English, French, and Spanish only. For Brazilian agencies, Portuguese-language researchers, and LATAM media operations, VexaScribe is the practical choice over Otter for bulk Portuguese workloads. We cover this in depth in our PT-BR transcription guide.

Frequently asked questions

How many files can I upload at once in a bulk batch?

Up to 50 files per batch on every paid plan. The 50-file limit is generous for most workflows: a research lab with 30 hour-long interviews fits in one batch; a podcast agency with weekly client deliverables runs one batch per client. If you need more, run consecutive batches — there's no per-day or per-account cap beyond your plan's monthly minutes. The 50-file ceiling exists because larger batches degrade UI responsiveness; processing 200+ files via API is on the roadmap for developer use cases.

What's the maximum file size and total batch size limit?

Per file: 5 GB and 10 hours (whichever comes first). Per batch: 50 files. There is no hard total-batch-size limit beyond the per-file cap × 50 — so a theoretical maximum batch is 50 files × 5 GB = 250 GB, though we recommend keeping batches under ~25 GB in practice for upload reliability over typical office internet. Long files (3-10 hours) work fine — common with full-day depositions, half-day workshops, or oral-history projects. For files over 10 hours, split before upload using a free tool like ffmpeg or Audacity.

Do you support mixed formats (MP3, M4A, MP4, WAV) in one batch?

Yes — mix any combination of MP3, M4A, WAV, FLAC, OGG, OPUS (audio) plus MP4, MOV, MKV, WEBM, AVI (video) in a single batch. Audio is extracted from video automatically — no need to pre-convert. Each file is detected, processed, and transcribed independently; the batch waits for all files to finish before delivering the ZIP. Mixed-format support matters for real workflows: an agency receives MP3s from one client, M4As from iPhones, MP4s from Zoom recordings — they shouldn't have to pre-process everything just to transcribe.

What happens if one file fails mid-batch — do I lose the others?

No. Each file is processed independently. If 3 of 50 files fail (corrupted audio, unsupported codec, exceeds duration cap), the other 47 finish and you can download a partial ZIP containing the successful files. Failed files appear in the dashboard with the specific error reason — corrupted header, no audio track detected, file exceeds 10 hours, etc. You can retry failed files individually (free retry within 24 hours) or fix the source and re-upload. The batch never silently swallows failures, and successful work is never blocked by a single bad file.

Can I download the whole batch as a ZIP with original filenames?

Yes. The ZIP preserves your original filenames — “interview-maria-2026-06-10.mp3” becomes “interview-maria-2026-06-10.docx” (or .srt, .txt, etc., depending on the format you selected). The ZIP also includes a CSV manifest with per-file metadata: original filename, duration, detected language, number of speakers, processing timestamp, word count. The manifest makes downstream automation easy — pipe it into a script that uploads to your CAQDAS tool (NVivo, ATLAS.ti), CMS, or shared drive without manual mapping. Multiple output formats per file are supported in the same batch (one ZIP with both .docx and .srt for every file).

Are speaker labels consistent across files in the same batch?

Speaker labels are consistent WITHIN each file (Speaker 1, Speaker 2... up to 10) but NOT auto-matched ACROSS files. This is an honest technical limitation of diarization in 2026: cross-file speaker identification requires voice-print enrollment, which adds complexity and privacy implications we've chosen not to ship. Workaround: use the bulk-rename UI in the dashboard to apply consistent labels (Speaker 1 → “Interviewer”, Speaker 2 → “Participant”) across all files in a batch in one operation. For research studies where the same interviewer appears in all 50 files, this takes ~30 seconds total. We're transparent about this because over-promising cross-file matching is a common industry trap.

Do you train your AI models on my files?

No. We contractually commit to never training models on user audio or transcripts — verifiable in our Terms and DPA. We use OpenAI's Whisper Large-v3 (open-source, MIT license) for transcription and pyannote.audio for diarization. Inference runs on specialized ML compute infrastructure (our inference partner is disclosed by name in the DPA). Our inference partner is contractually committed to not training models on inference data per their published Terms of Service. This is a deliberate differentiator vs. providers like Otter.ai, which trains on user audio by default with manual opt-out.

Where is my audio stored and processed?

Storage: AWS eu-west-2 (London, UK). All audio, transcripts, and account data live in London under UK-GDPR (the post-Brexit equivalent to EU-GDPR), with AES-256 encryption at rest and TLS 1.2+ in transit. Processing: during transcription, audio is sent from our London infrastructure to an ML inference partner that runs Whisper Large-v3 on specialized GPU infrastructure, then results return to London for storage and delivery. Honest disclosure: our inference partner is US-domiciled, so the transcription step is within reach of the US CLOUD Act even though all storage is in London. For workloads where end-to-end EU residency is non-negotiable — adversarial legal contexts, sensitive journalistic sources under US legal pressure — get in touch about our dedicated infrastructure option. We retain audio for 30 days post-transcription by default; auto-delete on transcript download is available on Pro and Studio plans, and on-request immediate deletion is always supported.

Is there a bulk API for S3 / webhook workflows?

Not yet. As of July 2026, bulk transcription is UI-first — drag-and-drop folder upload, dashboard status, ZIP download. Programmatic API access (POST a list of S3 URLs, receive webhooks per file completion) is on the roadmap but not shipped. If your workflow strictly requires API/webhook integration with S3 sync, we honestly recommend AssemblyAI ($0.15/hr Universal with diarization included), Deepgram Nova-3 ($0.46/hr + $0.12/hr diarization = $0.58/hr), or Azure Speech Fast Transcription batch ($0.18/hr with diarization free) — all have mature batch APIs. For everyone else (agencies, researchers, legal teams, podcasters), the UI workflow is faster than wrangling API code: drag a folder, walk away, return to a ZIP. We'll announce API access here when it ships.

How does flat-rate pricing compare to Rev, Otter, AssemblyAI, or Deepgram for 100 hours?

For 100 hours (6,000 minutes) in a single month, verified July 2026: Rev AI $0.25/min = $1,500. Rev Human $1.99/min = $11,940. Otter Business $30/user/month, but Pro was cut from 6,000 to 1,200 min/mo in 2025 without a price drop — you'd need Business tier at 6,000 min/mo, or a second seat. AssemblyAI Universal $0.15/hour with diarization included = $15 (genuinely cheaper if you only need API access and can wire up your own workflow). Deepgram Nova-3 $0.46/hour base + $0.12/hour diarization = $58 total for 100 hours. Azure Speech Fast Transcription (batch) $0.18/hour with diarization free = $18. Trint BulkScribe starts around $0.20/hour but requires enterprise contract negotiation. VexaScribe Studio is $20/month flat for 6,000 minutes = 100 hours, with a UI-first workflow (drag folder → walk away → download ZIP) instead of API code. The pattern: flat-rate dominates per-minute once you exceed ~5 hours/month, and dominates seat-based once you exceed one user's needs. For pure API consumers wiring their own S3 + webhook pipeline, AssemblyAI or Azure batch may still win on raw cost — we don't compete on that segment.

Is bulk transcription supported in Portuguese, Spanish, and other languages?

Yes — all 99 Whisper Large-v3 languages are supported in bulk. Tier 1 languages (92-95% accuracy on clean audio): English, Spanish, French, German, Italian, Portuguese (PT and BR), Dutch, Russian, Polish, Japanese, Mandarin, Korean. Tier 2 (88-92%): Arabic, Turkish, Hindi, Vietnamese, Thai, Indonesian, Ukrainian, Czech, Hungarian, Romanian. Tier 3 (75-88%): Swahili, Bengali, Tamil, Welsh, and lower-resource languages. A single batch can contain mixed languages — language is auto-detected per file from the first 30 seconds. Notable: Otter.ai does NOT support Portuguese in 2026 (English/French/Spanish only), making VexaScribe a practical choice for Brazilian agencies, Portuguese researchers, and LATAM media operations.

Can I bulk-transcribe an entire podcast catalog?

Yes — this is one of the most common bulk use cases in 2026. Typical workflow: export MP3s from your host (Buzzsprout, Transistor, Libsyn, Spotify for Podcasters all support bulk download), drop up to 50 episodes into a single batch, walk away, come back to a ZIP of .docx + .srt for every episode plus a CSV manifest with episode duration, speaker count, and word count. For catalogs over 50 episodes, run sequential batches — a 200-episode back catalog is typically 4 batches over an afternoon. Format tip: request .srt output alongside .docx if you plan to add captions to YouTube uploads or generate audiograms; the CSV manifest makes it trivial to pipe transcripts into a search index or a Notion/Airtable database for episode search. Cost comparison for a 100-episode catalog at ~45 min/episode (75 hours total): Rev AI $1,125, Otter Business ~$60/mo (but 6,000 min cap forces 2 months = ~$60), VexaScribe Studio $20 flat for the month. The pattern: flat-rate wins hard the moment you have a real back catalog rather than a single episode.

Can I get a DPA or sign an NDA for a legal or research batch?

Yes. We provide a standard Data Processing Agreement (DPA) on request to any paid account — typical for research labs needing IRB documentation, legal firms with privileged client data, and corporate buyers requiring procurement review. NDAs for specific projects (large media archives, sensitive corporate audio, legal investigations) are signed on a case-by-case basis. Email legal@vexascribe.com with your batch details (estimated volume, sensitivity level, retention requirements). The standard DPA covers GDPR Article 28 processor obligations, full sub-processor disclosure (AWS for storage and our ML inference partner — both named, with their roles and locations), London storage residency, and the no-AI-training clause. Turnaround is typically 1-2 business days for DPA, 2-5 days for custom NDA.

Methodology and sources

● Pricing verified June 2026 on vendor sites: Rev.com, Otter.ai, AssemblyAI, Deepgram, Trint, OpenAI Whisper API. Pricing changes — always confirm on the source.
● Whisper Large-v3: OpenAI, November 2023. MIT license. Paper: Radford et al. “Robust Speech Recognition via Large-Scale Weak Supervision” (2022).
● Diarization: pyannote.audio 3.1 (Apache 2.0). Bredin et al., Université du Mans / IRIT.
● GDPR: Regulation (EU) 2016/679. UK-GDPR: Data Protection Act 2018 + UK-GDPR (post-Brexit).
● LGPD: Lei 13.709/2018 (Brazil). ANPD guidance 2024-2025 on cross-border transfers.
● US CLOUD Act: 18 U.S.C. § 2713 (2018).
● Otter.ai PT-BR support: verified on otter.ai/languages June 2026 — English, French, Spanish only.

The essentials

Who actually needs bulk transcription

Content / marketing agency

Academic / UX research team

Legal / paralegal team

Media archive / newsroom

B2B product builder

Educator / course producer

How VexaScribe bulk works (3 steps)

Drag a folder

Pick languages, formats, options

Walk away, return to a ZIP

The math: flat-rate vs per-minute (100-hour project)

Honest read of this table

Batch mechanics: queues, parallelism, failure handling

Output formats and the ZIP manifest

Compliance and security (the procurement section)

Storage residency

Transcription processing

LGPD compliance

No AI training

Encryption

Retention

DPA + NDA

Languages: 99 supported, Tier 1 PT-BR

Tier 1 (92–95%)

Tier 2 (88–92%)

Tier 3 (75–88%)

Frequently asked questions

Methodology and sources

Related guides

Interview transcription

Podcast transcription

Qualitative research

SRT generator

Sermon transcription

Best subtitle tools 2026

AI transcription

Transcription cost

Otter alternatives

Fathom alternatives

Best diarization tools

Speaker labels — how they work

Pricing

Features

About