Updated June 23, 2026

Transcription Timestamps: Word-Level, Sentence-Level, and How to Add or Remove Them

By VexaScribe Editorial · Published June 23, 2026

Transcription timestamps are markers that tie text to specific moments in the original audio. They're free (every modern transcription tool includes them), but the granularity, format, and export options vary widely. This page is a reference for both directions of the problem — how to get timestamps in a transcript and how to remove them when you want clean prose. Word-level vs sentence-level is the most important distinction. SRT, VTT, JSON, and TXT bracket notation each have different syntax. And eight use cases by profession need different precision. Updated June 2026, with current tool-by-tool support.

Key takeaways

  • Word-level vs sentence-level is the distinction that matters. Word-level: one timestamp per word (~10-50ms precision). Sentence-level: one per sentence (~3-7 seconds). Phrase-level: ~3-5 word chunks. Most consumer tools default to sentence-level; Whisper-based tools (VexaScribe, OpenAI Whisper API, AssemblyAI, Deepgram) give word-level.
  • Format syntax differs by file type. SRT uses commas (00:00:01,500), VTT uses periods (00:00:01.500), JSON uses fractional seconds, TXT uses bracket notation. The SRT/VTT decimal separator difference catches developers off guard regularly.
  • Timestamps cost nothing extra in 2026. Every modern transcription tool includes them. The difference is granularity (word vs sentence) and which export formats they appear in (some tools include timestamps in SRT but not in TXT).
  • To remove timestamps, re-export without them. Most tools have a TXT or DOCX export option that omits timestamps. This is cleaner than find-replace in Word, which works but breaks on edge cases (commas in dialogue, bracket characters in text).
  • Word-level matters for subtitles. Cue boundaries on real word starts and ends sync precisely with speech. Interpolated sentence-level timing causes visible subtitle drift. See our SRT cue splitter for how this is applied in practice.
  • You generally can't add timestamps to text alone. Timestamps come from aligning text to audio during transcription. If you only have text, the practical answer is re-transcribe the audio with timestamps enabled.

What is a timestamp in transcription?

A timestamp is a marker that ties a piece of transcribed text to a specific moment in the original audio. It typically takes the form of a time offset from the start of the recording — 00:01:23,450 means “1 minute, 23 seconds, 450 milliseconds from the start.” In a transcript file, timestamps appear in different places depending on the format: at the beginning of each subtitle cue (SRT, VTT), as numeric properties on each word (JSON), or in brackets before a paragraph (TXT).

Disambiguation. “Timestamp” in transcription is different from a few related concepts. A timecode in video editing usually refers to SMPTE timecode (HH:MM:SS:FF, frames-based) used for video synchronization. An offset is the raw seconds-from-start number (e.g., 83.45) used internally by transcription engines. A cue is the subtitle/caption unit that contains a timestamp range (start and end), not a single moment. For transcription work, “timestamp” usually means “an HH:MM:SS,mmm formatted marker tying text to audio time.”

Modern transcription engines produce timestamps automatically as a byproduct of the speech recognition process — the model aligns each detected word against the audio waveform as part of generating the text. There's no separate “timestamping” step in 2026; if you have a transcript from a current tool, the timing data exists, even if the export you're looking at doesn't show it.

Word-level vs sentence-level vs phrase-level

The most important distinction on this page. Granularity is how often a timestamp appears in the transcript — once per word, once per sentence, once per phrase, or once per paragraph. The right level depends on what you're going to do with the transcript.

GranularityExamplePrecisionUsed by
Word-level{ "hello": [1.50, 1.89], "world": [1.91, 2.34] }~10-50ms per wordVexaScribe, OpenAI Whisper API, AssemblyAI, Deepgram, self-hosted Whisper
Sentence-level[00:01.50] Hello, world. How are you?One timestamp per sentence (~3-7 seconds)Otter, Trint, Sonix, Rev (default exports)
Phrase-level[00:01.50 - 00:04.20] Hello, world. How are you?One timestamp per 3-5 word chunkYouTube auto-captions, some legacy services
Paragraph-level[00:01.50] Hello, world. How are you? I'm doing great...One timestamp per paragraph (~15-60 seconds)Manual transcriptions, podcast show notes

When word-level is worth it. Three cases: (1) subtitle generation, where cue boundaries need to land on real word starts and ends to sync with speech; (2) interactive transcripts (word-by-word highlighting as audio plays, jump-to-word UX); (3) video editing where you're cutting on specific words. For these, sentence-level timing interpolated across a sentence is noticeably wrong.

When sentence-level is fine. Reading the transcript as prose. Qualitative research coding in NVivo/ATLAS.ti/MAXQDA. Show notes with chapter markers at structural breakpoints. Meeting summaries. Most note-taking workflows. Adding word-level data here adds noise without analytical benefit.

The honest catch. Most consumer transcription tools default to sentence-level in their UI even when the engine produces word-level timing data. Otter, Sonix, Trint, and Rev all use Whisper-class engines (or proprietary equivalents) that produce word-level timing internally, but the default exports show sentence-level. If you need word-level, check your tool's export options — the data usually exists; it's a checkbox away.

Timestamp format syntax — SRT, VTT, JSON, TXT

Reference table for each format. The decimal separator difference between SRT and VTT is the most common cause of conversion bugs.

FormatSyntaxDecimalGranularityUse for
SRT00:00:01,500 --> 00:00:04,200CommaSentence or phraseUniversal subtitles (YouTube, video editors)
WebVTT00:00:01.500 --> 00:00:04.200PeriodSentence or phraseHTML5 video, styled web captions
JSON{ "word": "hello", "start": 1.5, "end": 1.89 }PeriodWord-level (typical)Developer pipelines, search indexing
TXT (bracket)[00:01.50] Hello, welcome to the...Period or commaSentence or paragraphHuman-readable transcripts, notes
SCC (broadcast)01:00:00;00 9420 9420 ...Semicolon (drop-frame)Caption-levelUS broadcast TV (CEA-608)

SRT format example

1
00:00:01,500 --> 00:00:04,200
Hello, welcome to the show. Today we're going

2
00:00:04,201 --> 00:00:07,850
to talk about something genuinely useful.

SRT (SubRip) was developed in the late 1990s. Each cue has a sequential number, a timestamp range, and one or more lines of text. The comma decimal separator is the historical default. SRT is the universal subtitle format — every video editor, YouTube, and standalone player accepts it.

VTT (WebVTT) format example

WEBVTT

00:00:01.500 --> 00:00:04.200
Hello, welcome to the show. Today we're going

00:00:04.201 --> 00:00:07.850
to talk about something genuinely useful.

WebVTT is the W3C standard for HTML5 video. Notice the WEBVTT header line, no sequential cue number (optional in VTT), and the period decimal separator. VTT also supports cue styling (positioning, fonts, voice tags) that SRT doesn't. For HTML5 <track> elements, VTT is required.

JSON format example (word-level)

{
  "text": "Hello, welcome to the show.",
  "words": [
    { "word": "Hello,", "start": 1.500, "end": 1.890 },
    { "word": "welcome", "start": 1.920, "end": 2.380 },
    { "word": "to", "start": 2.410, "end": 2.530 },
    { "word": "the", "start": 2.560, "end": 2.680 },
    { "word": "show.", "start": 2.710, "end": 3.150 }
  ]
}

JSON output preserves the most data — typically word-level timestamps as fractional seconds. This is the format developers use for search indexing, interactive transcripts, clip extraction, and word-by-word highlighting. Each transcription tool's JSON schema differs slightly (Whisper, AssemblyAI, Deepgram all have their own structures), but the pattern is consistent: a words array with start and end times.

TXT bracket notation

[00:01.50] Hello, welcome to the show. Today we're going to
talk about something genuinely useful.

[00:07.85] Let's get started with the first topic.

Plain text with bracket-style timestamps is the most human-readable format. Typically used for note-taking, podcast show notes, and qualitative research transcripts. There's no formal standard — different tools use slightly different bracket conventions ([HH:MM:SS], [HH:MM:SS.mmm], [HH:MM]) — but the format is robust enough to import into Microsoft Word, Notion, or research software with minimal cleanup.

Why SRT and VTT use different decimal separators

Historical and regional. SRT was developed by a French author in the late 1990s, and the comma is the decimal separator in French and most European number conventions. WebVTT was standardized by the W3C between 2010-2014 and chose the period to align with JavaScript, JSON, and broader web conventions. Both are valid in their respective ecosystems — but converting between them requires finding all commas in timestamp lines and replacing with periods (and vice versa). Tools that export both formats handle this automatically; manual conversion in a text editor with find-replace is fragile because commas in dialogue (“Hello, world”) will be caught and broken.

How to add timestamps to a transcript

In modern transcription tools, timestamps are generated automatically as part of the transcription — you don't add them separately. The question is usually whether they appear in your specific export. Here's the practical workflow:

  1. Upload your audio or video to a transcription tool (VexaScribe, Otter, Sonix, Rev, etc.) and wait for processing.
  2. When exporting, check the timestamp options. Most tools default to including timestamps in SRT and VTT exports automatically. For TXT and DOCX exports, the “include timestamps” checkbox is often unchecked by default — toggle it on.
  3. Choose the right granularity if your tool supports it. Word-level is best for subtitles and developer use; sentence-level for prose reading and research coding.
  4. For JSON output (developer use): export to JSON; word-level timing is standard.

What if you only have the text (no audio)? You generally can't add real timestamps to text alone — timing data comes from aligning the text against the audio waveform during transcription, and that step requires the audio file. Two workarounds:

  • Re-transcribe the audio with timestamps enabled. This is the practical answer for most workflows. If the original audio is available, run it through your transcription tool again with timestamp output configured. You can then use the new transcript's timestamps with your edited text.
  • Forced alignment — upload existing text plus the audio, get word-by-word timing matched. Uncommon in consumer tools but available in some research-focused platforms (Montreal Forced Aligner, FAVE, some academic Whisper wrappers). Setup is non-trivial; the re-transcribe path is usually faster.

How to remove timestamps from a transcript

The reverse-intent search. You have a transcript with bracket-style timestamps or SRT timing lines, and you want clean prose for publishing, blog posts, or quote extraction. Three methods, ranked from cleanest to most fragile:

Method 1: Re-export without timestamps (cleanest)

Open the original transcription in your tool's editor (VexaScribe, Otter, Sonix, Rev, Trint) and export again — this time as TXT or DOCX with the “include timestamps” option unchecked. Most tools produce clean prose paragraphs in this mode, with paragraph breaks at speaker turns. No regex required. This is the right path if you have access to the original transcription.

Method 2: Find & replace in Microsoft Word with regex

When you only have the file (not access to the original transcription tool). Open the document in Word, press Cmd/Ctrl + H for Find and Replace, click More, check Use wildcards. Common patterns:

  • ● Bracket-style: search \[[0-9]@:[0-9]@\] or \[[0-9:.,]@\], replace with nothing
  • ● SRT timing line: [0-9]@:[0-9]@:[0-9]@,[0-9]@ --> [0-9]@:[0-9]@:[0-9]@,[0-9]@
  • ● Cue numbers (SRT): search ^[0-9]@^p

Word's wildcards use @ for “one or more” (not +) and don't support full PCRE syntax. The patterns above work for most timestamp formats but break on edge cases — verify on a copy first.

Method 3: Command line with sed (TXT files at scale)

For developers or anyone comfortable with a terminal. On macOS or Linux:

# Remove bracket-style timestamps
sed -E 's/\[[0-9:.,]+\]//g' transcript.txt > clean.txt

# Remove SRT timing lines and cue numbers
sed -E '/^[0-9]+$/d; /^[0-9:.,]+ --> [0-9:.,]+$/d' subtitles.srt > prose.txt

The grep/sed approach is the most reliable for batch processing. Test on a copy first; sed's in-place editing (-i) is platform-dependent and can lose data if interrupted.

When you want timestamps removed

Publishing a transcript as a blog post or article. Extracting quotes for citation. Importing into a CMS or word processor where the timestamps are noise. Sharing a clean transcript with a non-technical collaborator. Note: if you might want timestamps back later, keep the original timestamped version. Removing timestamps is lossy.

Use cases by profession

Different audiences need different timestamp granularity. The pattern below is what we see in practice from the workflows VexaScribe users describe.

Journalists

Granularity: Word-level preferred

Why: Click any timestamp to jump to the exact word and verify before quoting. Sentence-level gets you within a few seconds of the quote; word-level gets you exact.

Video editors

Granularity: Word-level required

Why: Subtitle generation needs cue boundaries on real word starts and ends. Sentence-level timing interpolated across a long sentence causes visible subtitle drift relative to speech.

Qualitative researchers (NVivo, ATLAS.ti, MAXQDA)

Granularity: Sentence-level sufficient

Why: Coding happens at the segment level, not the word level. Word-level adds noise without analytical benefit. Most CAQDAS tools import sentence-level cleanly.

Podcasters

Granularity: Sentence or paragraph

Why: Show notes with chapter markers use timestamps at structural breakpoints (00:05:30 "Topic 2 starts here"), not at every word. Sentence-level export to TXT with bracket timestamps is the typical workflow.

Lawyers (discovery review)

Granularity: Word-level preferred

Why: Quote-checking a witness statement or recorded call against the original audio requires word-accurate timing. AI transcripts are drafts, not court records — but word-level timing makes the draft review-able.

Developers building products

Granularity: Word-level (JSON)

Why: Search indexing, clip extraction, interactive transcripts, and word-by-word highlighting all require word-level timing data in JSON. Sentence-level is too coarse for product UX.

Translators and subtitle creators

Granularity: Word-level for cue-splitting, sentence-level for translation work

Why: Translation happens at the sentence level (you translate ideas, not words). But the cue boundaries downstream must be word-accurate for subtitle sync.

Accessibility (WCAG compliance)

Granularity: Sentence-level minimum

Why: WCAG 1.2.2 requires synchronized captions for prerecorded video. Sentence-level timing typically satisfies the standard; word-level is best practice for read-aloud-style highlighting.

Why word-level timestamps matter for subtitles

Subtitle generation is the use case where word-level vs sentence-level granularity has the most visible impact. Here's why.

The drift problem. When a transcription tool only has sentence-level timing (one timestamp per sentence), it must interpolate across the sentence to figure out where to break subtitle cues. The interpolation is linear: if a 6-second sentence is split into two cues at the comma, the first cue gets ~3 seconds and the second cue gets ~3 seconds, regardless of where the comma actually falls in the speech. If the speaker said the first half quickly and paused before the second half, the subtitle for the first half stays on screen too long (~3 seconds vs the actual ~1.5 seconds of speech) and the second half appears late. Viewers see subtitles drift visibly relative to speech.

The word-level fix. When each word has its own start and end time, cue boundaries can land on real word starts and ends. If the comma falls after “world,” at 2.4 seconds, the first cue ends at 2.4 seconds — not at the interpolated halfway point of 3 seconds. The subtitle appears and disappears in sync with what's being said.

Dramatic pauses. Word-level timing also handles dramatic pauses correctly. In a motivational speech with 2-3 seconds of silence between sentences, sentence-level interpolation produces a sub-second subtitle flash followed by a blank screen. Word-level lets a cue stay on screen across the silence (the cue ends at the real end of the last word and the next cue starts at the real start of the next word — closer to professional captioner behavior).

See our SRT cue splitter for the full algorithm — 80-char/5-second target, sentence-end preference, dramatic pause handling, trailing merge — and the before/after examples on real audio.

Which transcription tools give which granularity

Honest snapshot of where the major tools stand as of June 2026. Verify with each vendor before committing — defaults change.

ToolDefault granularityExport formatsNote
VexaScribeWord-levelTXT, DOCX, SRT, VTT, JSONWord-level timing from Whisper Large-v3 used for SRT cue splitting
OpenAI Whisper APIWord-levelJSON (with timestamp_granularities=word)Same Whisper model, dev-API access
AssemblyAI Universal-2Word-levelJSON, SRT, VTT, TXTWord-level + LLM features (LeMUR) for downstream tasks
Deepgram Nova-3Word-levelJSON, SRT, VTTWord-level + lowest streaming latency in the category
Self-hosted Whisper (Large-v3)Word-level (with WhisperX or faster-whisper)JSON, SRT, VTTFree + open source; word-level requires WhisperX or faster-whisper wrappers
Otter.aiSentence-level (default), word-level (paid)TXT, DOCX, SRT, VTT, MP3, PDFSentence-level standard; word-level available in some plans
SonixSentence-level (default)TXT, DOCX, SRT, VTT, JSONSentence-level standard; structured editor for cue editing
Rev (AI tier)Sentence-levelTXT, DOCX, SRT, VTT, JSON, PDFSentence-level on AI tier; human tier uses time-coded markers per speaker turn
TrintSentence-levelTXT, DOCX, SRT, VTTSentence-level standard; sentence-level editing UI
YouTube auto-captionsPhrase-level (3-5 words)SRT, VTT (via download)Coarser than sentence-level; designed for live caption display

The honest summary: word-level timestamps are not unique to any single tool in 2026. Every Whisper-based tool (VexaScribe, Whisper API, AssemblyAI, Deepgram, self-hosted) exposes them. The differentiator is in how the word-level data is used downstream — specifically, whether the SRT/VTT exports use word-level timing for cue splitting or fall back to sentence-level interpolation. That's where downstream subtitle quality genuinely varies.

Editing timestamps when they're off

Sometimes a transcript's timestamps are close but not quite right — drift, a missed cue boundary, a translated subtitle that needs shifting. Three tiers of editing tools:

Plain text editors (any platform)

SRT, VTT, and JSON are all plain text. Open in Notepad, VS Code, or Sublime Text and edit timestamp values directly. Use for one-off corrections of specific cues. Caveat: changing one timestamp doesn't cascade to neighbors — you can end up with overlapping cues if you're not careful.

Dedicated subtitle editors (free)

Subtitle Edit (Windows, free) and Aegisub (cross-platform, free) are purpose-built for SRT/VTT work. They show timing visually against a waveform, let you shift entire blocks (“move all cues forward by 500ms”), scale timing (“cues are 1.04× too slow, stretch them”), and resync against the audio. For more than a handful of corrections, use one of these.

In-app editor in the transcription tool

Most transcription tools have a built-in editor where you can edit text and timestamps together. VexaScribe, Otter, Sonix, Rev, Trint all support this. The advantage: timestamps stay synced as you edit text (deletions and insertions don't break timing). The disadvantage: limited to the tool's editing UX; advanced operations (bulk shifts, regex transforms) require export to a dedicated subtitle editor.

Frequently asked questions

How do I remove timestamps from a transcript?

Three options depending on where your transcript is. (1) Re-export from the transcription tool without timestamps. Most platforms (VexaScribe, Otter, Sonix, Rev) have a TXT or DOCX export option that lets you toggle timestamps off. This is the cleanest path — no manual cleanup needed. (2) Find and replace in Microsoft Word using regex. Open Find > Advanced Find > More > Use wildcards, then search for patterns like \[[0-9]+:[0-9]+\] for bracket-style timestamps or [0-9]+:[0-9]+:[0-9]+,[0-9]+ --> [0-9]+:[0-9]+:[0-9]+,[0-9]+ for SRT timing lines. Replace with nothing. (3) For TXT files at scale: use sed on macOS/Linux (sed -E 's/\[[0-9]+:[0-9]+\]//g' transcript.txt) or PowerShell on Windows. The TXT re-export route is always cleanest if it's available — Word's regex syntax is finicky and breaks on edge cases.

What's the standard timestamp format for transcription?

There isn't one universal standard — the format depends on the file type. SRT (SubRip) uses HH:MM:SS,mmm with a comma as the decimal separator: 00:00:01,500 --> 00:00:04,200. WebVTT (W3C standard for HTML5 video) uses HH:MM:SS.mmm with a period: 00:00:01.500 --> 00:00:04.200. JSON typically uses fractional seconds as a number: { "word": "hello", "start": 1.500, "end": 1.890 }. Plain text often uses bracket notation: [00:01:30] or [00:01.50]. The SRT comma versus VTT period distinction is a real source of bugs — converting between them requires sed/find-replace, not a one-click tool in many editors.

Are word-level timestamps worth it over sentence-level?

It depends on what you're doing. For reading a transcript as prose, sentence-level is fine — you don't need to know exactly when each word was spoken. For subtitle generation, word-level is meaningfully better because cue boundaries land on real word starts and ends instead of being interpolated across a sentence (subtitles drift visibly when sentence-level timing is interpolated). For video editing, word-level matters when you're cutting on specific words. For quote-checking in journalism, word-level lets you jump to the exact word; sentence-level gets you within a few seconds. For research coding in NVivo/ATLAS.ti/MAXQDA, sentence-level is usually sufficient. The rule of thumb: if downstream use involves syncing to specific moments (subtitles, video edits, exact-word quotes), word-level. Otherwise sentence-level is fine.

Can I add timestamps to a transcript I already have?

Generally no — at least not accurately. Timestamps come from aligning the transcript text against the original audio during transcription; the timing data is generated at that step. If you only have the text (no audio), there's no way to recover real timestamps. If you have the original audio plus the text, you can re-run the audio through a transcription tool and use the new transcript's timestamps. Some tools also offer forced alignment — uploading an existing transcript plus the audio and getting timing matched word-by-word — but this is uncommon in consumer transcription tools. For most workflows, the practical answer is: re-transcribe the audio with timestamps enabled, then use that as your new master file.

Why do SRT and VTT use different decimal separators?

Historical and regional. SRT (SubRip) was developed in the late 1990s by a French developer; the comma is the decimal separator in French and most European number conventions. WebVTT was standardized by the W3C in 2010-2014 and chose the period to match JavaScript/JSON and broader web conventions. Both formats are valid in their respective ecosystems. The practical implication: converting an SRT file to VTT requires replacing all commas in timestamp lines with periods (and adding a WEBVTT header), and vice versa. Tools that export both formats handle this automatically; manual conversion in Word or Notepad is fragile because find-replace tends to catch commas in subtitle text too.

Do timestamps cost extra in transcription services?

No — timestamps are universally included with transcription at no extra cost in 2026. They're a property of the transcription engine output, not a separate billable feature. Where pricing varies is granularity: sentence-level timestamps are standard everywhere (Otter, Rev, Sonix, Trint), but word-level timestamps were historically a premium feature on some platforms. As of 2026, word-level is also standard on Whisper-based tools (VexaScribe, OpenAI Whisper API, Deepgram, AssemblyAI, self-hosted Whisper). The remaining differentiation is in export formats: some tools include timestamps in SRT/VTT but not in plain TXT exports, requiring you to re-export to get the timestamps you want.

Which transcription tools give word-level versus sentence-level timestamps?

Word-level: VexaScribe (Whisper Large-v3), OpenAI Whisper API, AssemblyAI Universal-2, Deepgram Nova-3, self-hosted Whisper, WhisperX, faster-whisper. Sentence-level by default (with word-level available in some plans): Otter, Trint, Sonix, Rev AI. Phrase-level (groups of 3-5 words): YouTube auto-captions, some legacy services. The word-level vs sentence-level distinction has narrowed in 2026 because Whisper-based tools all expose word-level timing — but the consumer UI on many tools still defaults to sentence-level in exports. If you need word-level, check the export options in your transcription tool's UI; the data usually exists, it's just a setting away.

Can I edit timestamps manually if they're slightly off?

Yes, in any text editor or specialized subtitle tool. SRT and VTT are plain text formats — you can open them in Notepad, VS Code, or Notepad++ and edit the HH:MM:SS,mmm values directly. For larger adjustments, dedicated subtitle editors (Subtitle Edit on Windows, Aegisub cross-platform, both free) let you shift entire blocks, scale timing, or resync against the audio. JSON output is similarly editable in any text editor. The main caveat: changing one timestamp doesn't automatically cascade to neighbors. If you shift cue 5 by 200ms, cue 6 still starts at its original time — you might end up with overlap. Most subtitle editors detect this and warn you.

Methodology & disclosure

Sources: SRT format conventions verified against the original SubRip specification (1996) and the Matroska container wiki. WebVTT format verified against the W3C WebVTT 1.0 specification. Word-level timestamp capabilities verified against vendor API documentation: OpenAI Whisper API docs, AssemblyAI docs, Deepgram docs, Whisper paper (arXiv:2212.04356). Tool-by-tool granularity defaults verified against each vendor's help documentation as of June 2026.

Disclosure: This page is published by VexaScribe. Word-level timestamps are not unique to us — every Whisper-based tool exposes them. The differentiator we describe (using word-level data for SRT cue splitting) is documented on our SRT generator page; the same approach is feasible for any tool built on Whisper or equivalent engines.

Editorial standards: See our editorial standards.

Related guides