Transcribe Video for Captions
The transcript is step one of every captioning workflow. Get a clean, timestamped transcript first — then convert to SRT or VTT for accessibility-compliant captions.
Generate Video Captions →Transcribing for captions: the accessibility team's workflow
If you're on an accessibility, video production, or compliance team, captioning is never just "add subtitles." It's a four-step workflow: transcribe the audio, time-align each phrase, format as SRT or VTT, then quality-check against the WCAG and ADA requirements your organization follows. The transcript is the foundation — every captioning decision downstream depends on having an accurate, verbatim record of spoken audio plus any meaningful non-speech audio (music cues, laughter, sound effects). Skipping straight to auto-captions from YouTube or Instagram is fine for casual viewing, but it does not meet WCAG 2.1 Level AA, Section 508, or the requirements of the 21st Century Communications and Video Accessibility Act (CVAA) for closed captioning. Auto-captions routinely miss speaker identification, mis-punctuate sentences, drop proper nouns, and skip non-speech audio entirely — all of which are formal failure conditions under WCAG Success Criterion 1.2.2 (Captions, Prerecorded). A clean transcript lets your captioner correct errors before time-aligning, which is dramatically faster than fixing baked-in caption files. Captioning teams that work from a verified transcript first produce ADA-compliant SRT/VTT files in roughly half the time of teams that try to edit auto-captions in place. This page explains how to get from a video URL to a transcript to a compliant caption file, what tools to use at each step (Aegisub, Subtitle Edit, the SRT/VTT export built into this site), and the most common compliance mistakes accessibility teams make.
Captions vs subtitles vs SDH — the accessibility distinction matters
The three terms are often used interchangeably outside the accessibility field, but they describe three different things, and conflating them is the most common compliance mistake we see in accessibility audits.
Subtitles
Subtitles assume the viewer can hear the audio. They translate spoken dialogue into another language (Spanish subtitles on an English film) or render speech as text for viewers who don't speak the audio language fluently. Subtitles typically omit non-speech audio because the assumption is that the viewer hears it — they only need help with the words.
Captions (closed captions)
Captions assume the viewer cannot hear the audio. They render all meaningful audio as text — dialogue plus speaker identification, sound effects, music cues, off-screen voices, and tonal information ([sarcastic], [whispered], [laughing]). Captions are the format required by the ADA, Section 508, the CVAA, and WCAG 2.1 SC 1.2.2. If your organization is required to caption video for accessibility, you need captions — not subtitles.
SDH (subtitles for the deaf and hard of hearing)
SDH is a streaming-era hybrid: it includes the non-speech audio of captions but uses subtitle styling (bottom of screen, no positioning) instead of traditional caption styling (positioned near the speaker, sometimes with speaker color coding). Netflix, Disney+, and Apple TV use SDH on most originals. SDH meets WCAG and ADA requirements in most cases, but some compliance reviewers prefer true CEA-608/CEA-708 closed captions for broadcast.
For a transcript-first workflow, the distinction matters because your transcript needs to capture different things depending on the output format. Building captions means transcribing every meaningful non-speech sound and identifying speakers; building subtitles means dialogue only. Build the transcript for the format you'll publish.
From transcript to SRT/VTT: the conversion workflow
Once you have a clean transcript with timestamps, conversion to a caption file is mechanical. Both SRT and VTT are plain-text formats that any captioning tool can import. The TranscribeVideo.ai output gives you timestamps in the transcript — your captioner uses them as anchor points and adjusts the in/out times for reading speed.
SRT format (SubRip)
The oldest and most widely supported caption format. Used by YouTube, Vimeo, Facebook, LinkedIn, and most video editing software. Each cue is numbered, with start and end timestamps in HH:MM:SS,mmm format and the caption text on the next line(s).
1 00:00:00,000 --> 00:00:03,200 [upbeat music playing] 2 00:00:03,500 --> 00:00:06,800 SARAH: Welcome to the show.
VTT format (WebVTT)
The modern standard, designed for HTML5 video. Supports positioning, styling via CSS, and metadata. Used by HBO, Netflix, and any HTML5 player using the <track> element. Functionally similar to SRT but timestamps use HH:MM:SS.mmm (with a period instead of a comma) and the file starts with a WEBVTT header.
Reading speed limits
The WCAG-recommended maximum reading speed for captions is 160 words per minute (about 17 characters per second). Captioning teams that go faster — common when they paste a transcript directly into SRT — produce captions viewers cannot read in time. Every captioning tool we recommend below shows a reading-speed warning when you exceed this limit.
Tools for the conversion step
- Subtitle Edit (free, Windows): The standard for captioning teams on a budget. Imports transcripts, auto-segments by sentence, and flags reading-speed violations. Exports SRT, VTT, SCC, STL, and more.
- Aegisub (free, cross-platform): Originally built for fan-subtitled anime, now the most-used free SRT/VTT tool on macOS and Linux. Excellent waveform view for precise timing.
- TranscribeVideo.ai SRT/VTT export: For straightforward dialogue-only captions, our exporter generates an SRT file directly from the transcript with auto-segmentation at sentence boundaries. Best for content that doesn't require speaker IDs or sound effect annotations.
- Adobe Premiere Pro caption workflow: Imports a transcript, lets you adjust timing on the timeline, and exports embedded or sidecar captions. Standard in broadcast and agency workflows.
- 3Play Media, Rev: Outsourced human-verified captioning when your compliance bar requires a human in the loop (Section 508 federal contracts often require this).
WCAG 2.1, ADA, Section 508, CVAA — what your captions actually need to meet
Captioning compliance is governed by overlapping regulations, and which one applies depends on who publishes the video, where it's distributed, and who the audience is. Most accessibility teams operate against the strictest applicable standard — WCAG 2.1 Level AA — because meeting it also satisfies the others.
WCAG 2.1 Success Criterion 1.2.2 (Level A): Captions, Prerecorded
Captions must be provided for all prerecorded audio in synchronized media. This is the baseline for any video on a website that needs to meet WCAG. Captions must be synchronized with the audio and include identification of speakers and non-speech sound when needed for understanding.
WCAG 2.1 SC 1.2.4 (Level AA): Captions, Live
Captions for live content (live-streamed events, webinars). Most teams use real-time auto-captioning from Zoom, Teams, or a CART writer.
ADA Title III (private businesses) and Title II (public entities)
The ADA doesn't specify WCAG by name in regulation text, but DOJ guidance and court rulings (Domino's, Winn-Dixie, etc.) have repeatedly held WCAG 2.1 AA as the de facto standard for web accessibility. Public-facing video on a US business website is subject to ADA.
Section 508 (US federal agencies and contractors)
The 2017 refresh aligned Section 508 with WCAG 2.0 AA. Federal websites and federal contractor deliverables require captioning that meets WCAG criteria. For federal video, captions usually need to be CEA-608 or CEA-708 (broadcast) in addition to WCAG-compliant.
CVAA (21st Century Communications and Video Accessibility Act)
Requires captioning for video programming previously broadcast on US TV when re-distributed online. Streaming services and any site re-publishing previously-broadcast content fall under this.
Common compliance mistakes we see in audits
- Auto-captions submitted as compliance: YouTube's auto-captions are not WCAG-compliant. They miss speakers, drop punctuation, and skip non-speech audio. Using them as-is is a documented Section 508 failure.
- Missing speaker identification: When multiple speakers are visible or audible, captions must identify which speaker said what. Off-screen speakers need explicit identification.
- Missing non-speech audio: [door slams], [crowd cheering], [phone ringing] — if it's meaningful to understanding the scene, it must be captioned.
- Reading speed too fast: Captions that exceed ~160wpm fail SC 1.2.2 because viewers can't read them in time.
- Caption file not synced: SRT timestamps drift if the source video is re-rendered. Always verify timing on the final published video.
- No transcript provided separately: WCAG 1.2.3 requires a separate transcript or audio description for some content. Captions alone don't satisfy this for media that conveys information visually.
Workflow checklist: video URL to compliant caption file
Here is the step-by-step accessibility-team workflow we recommend for caption deliverables. Use it as a checklist for QA before sign-off.
- Identify the compliance bar. WCAG 2.1 AA, Section 508, CVAA — whichever is strictest determines what the captions must contain.
- Get the verbatim transcript. Paste the video URL into TranscribeVideo.ai. The transcript is the source of truth for everything downstream. Verify accuracy before time-aligning — fixing words in a transcript is 5× faster than fixing them in an SRT.
- Add non-speech audio and speaker identification. Walk through the transcript and annotate [music], [laughter], [door slam], speaker names. This is the step auto-captions skip.
- Import into Subtitle Edit, Aegisub, or your team's tool. Use the transcript timestamps as anchor points and adjust in/out times for reading speed.
- Segment for readability. Maximum two lines per caption, maximum 42 characters per line, maximum ~7 seconds per cue. Break at natural sentence boundaries — never mid-clause.
- QA reading speed. Run the reading-speed checker. Anything above 160wpm must be split or shortened.
- Export SRT or VTT. SRT for legacy platforms (YouTube, Vimeo, Facebook), VTT for HTML5 video and Apple/Netflix-style platforms.
- Embed or sidecar. Embedded captions are part of the video file (CEA-608/708); sidecar files are loaded separately by the player. Most web video uses sidecars.
- Sign off against WCAG 1.2.2. Document the compliance review: who QA'd, what bar (AA), date, and any exceptions noted.
This is a roughly 1-to-3 ratio: a 10-minute video takes about 30 minutes to caption from scratch using the transcript-first workflow. Trying to fix auto-captions in place runs 2-3× longer with more errors.
How It Works
- 1.Paste the TikTok, YouTube, Shorts, or Instagram Reel URL into the transcription tool. The transcriber pulls audio from the public source — no download to your machine, no upload step.
- 2.Generate the verbatim transcript with timestamps in under 30 seconds. Review for accuracy before captioning — proper nouns, technical terms, and speaker turns are the most common AI errors and the cheapest to fix in plain text.
- 3.Annotate non-speech audio and speaker identification in the transcript. This is the step auto-captions skip and the reason WCAG-compliant captioning has to start from a clean transcript rather than a baked-in caption file.
- 4.Import the transcript into Subtitle Edit, Aegisub, or use the built-in SRT/VTT export. Time-align cues, enforce reading-speed limits (max ~160wpm), and segment at natural sentence boundaries.
- 5.Export as SRT for legacy platforms (YouTube, Vimeo, Facebook) or VTT for HTML5 video and Apple/Netflix workflows. Upload alongside the video file or embed as CEA-608/708 for broadcast deliverables.
Why Use This Tool?
- ✓Captioning starts with a clean transcript — fixing errors in plain text is 5× faster than fixing them in a baked-in SRT/VTT file, which is why every professional captioning workflow goes transcript-first.
- ✓Auto-captions from YouTube and Instagram are not WCAG 2.1 AA compliant; they miss speakers, mis-punctuate, and drop non-speech audio. A clean transcript lets your team produce compliant captions in roughly half the time.
- ✓Timestamps in the transcript become the in/out anchor points for your SRT/VTT file. Subtitle Edit and Aegisub both import this format directly, so the conversion step is mechanical.
- ✓Verbatim accuracy matters for legal accessibility deliverables — Section 508 and ADA compliance both require captions that reflect what was actually said. AI plus a human transcript review is the standard workflow.
- ✓Free for up to 2 videos at a time with no login. Accessibility teams can QA a workflow before committing to a paid tier or outsourced captioning vendor.
Use Cases
- —Higher-ed accessibility office captioning a lecture series — get the transcript, add speaker IDs, export SRT, upload to the LMS alongside the video for WCAG 2.1 AA compliance.
- —Corporate L&D team building captions for internal training videos — transcript first, then human-reviewed SRT for the LMS, complying with ADA Title I if the videos are part of onboarding.
- —Federal agency contractor delivering a video to a Section 508 audit — transcript-first workflow, CEA-608 captions embedded in MP4, separate SRT sidecar and plain-text transcript for the deliverable.
- —Marketing video producer adding captions for LinkedIn and Instagram autoplay — fast SRT export from a clean transcript, no manual typing of every line.
- —Podcast network captioning back-catalog episodes for ADA-compliant transcripts on the website — bulk transcribe, post-edit, export each episode's transcript as a downloadable file plus VTT captions for the embedded audio player.
- —Documentary editor preparing SDH for a streaming submission — verbatim dialogue plus non-speech audio cues in VTT, meeting the platform's accessibility specs.
Frequently Asked Questions
Can I use this transcript directly as captions?
Not directly. A transcript is the source text; captions are time-aligned, segmented cues with speaker IDs and non-speech audio annotations. Use the transcript as input to Subtitle Edit or Aegisub (free), or use our built-in SRT export for dialogue-only captions that don't require speaker IDs or sound effects.
Are auto-captions ADA or WCAG compliant?
No. YouTube and Instagram auto-captions are explicitly not WCAG 2.1 AA compliant — they miss punctuation, speaker identification, and non-speech audio. Multiple court rulings (Domino's, Winn-Dixie) have held WCAG 2.1 AA as the de facto ADA standard, so relying on auto-captions for a public-facing US business video is a documented compliance risk.
What's the difference between SRT and VTT?
SRT is the older, broader-compatibility format used by YouTube, Vimeo, Facebook, and most video editors. VTT is the HTML5 standard with optional styling and positioning, used by Apple, Netflix, and HTML5 players via the <track> element. Functionally they're nearly identical; pick based on your player.
Do I need captions or subtitles for accessibility?
Captions. Subtitles assume the viewer can hear the audio and translate dialogue only. Captions assume the viewer cannot hear and include speaker IDs, non-speech audio, and tonal cues. ADA, Section 508, CVAA, and WCAG 2.1 all require captions, not subtitles, for accessibility.
How fast should captions read?
WCAG-recommended maximum is ~160 words per minute (about 17 characters per second). Faster captions fail SC 1.2.2 because viewers can't read them in time. Subtitle Edit and Aegisub both flag reading-speed violations during QA.
Can I export SRT or VTT directly from this tool?
Yes. We provide an SRT/VTT export from the transcript that auto-segments at sentence boundaries and respects reading-speed limits. For captions that require speaker IDs and non-speech audio annotations, use Subtitle Edit or Aegisub after exporting the base SRT.
What about Section 508 captioning for federal deliverables?
Section 508 (2017 refresh) aligns with WCAG 2.0 AA. Federal deliverables typically require CEA-608 or CEA-708 broadcast captions in addition to a WCAG-compliant SRT/VTT sidecar. Most federal contractors use a human-reviewed captioning vendor (3Play, Rev, Verbit) for the final deliverable — but the transcript-first workflow we describe here is still the source of truth.
Is the tool free for accessibility teams?
Yes for individual use — free for up to 2 videos at once with no account. For bulk captioning of a back catalog or a course library, the $10/month Pro plan covers 10 videos per session and batch processing.
Related Tools
Related Pages
Ready to get started?
Generate Video Captions →