How to Transcribe an Interview: 5 Methods + Transcript Examples

Transcribing an interview used to take 4-6 hours per hour of audio. In 2026, AI does it in minutes — but with trade-offs that matter for research, journalism, and legal work. Here are five real methods, with cost and accuracy benchmarks for each.

By TranscribeVideo.ai Editorial TeamJanuary 28, 2026Updated July 28, 2026

Before you start: pick the right transcription style

Three transcription styles exist. Picking the wrong one wastes time at the end when you have to re-edit.

Strict verbatim — every "um," every false start, every cough. Required for legal, qualitative research, linguistics, forensic work.
Intelligent verbatim — fillers and false starts removed, but speaker's vocabulary preserved exactly. Best for journalism, business, podcasts, and most general use.
Clean read — grammar fixed, lightly paraphrased for readability. Best for repurposing into prose (blog posts, books, marketing copy).

If you're not sure, default to intelligent verbatim. It's the most readable, the most useful for analysis, and the easiest to convert to clean read later. Full reference on transcription styles.

Method 1: AI URL-based transcription (fastest, ~$0)

If your interview is on YouTube, Zoom recording, or Vimeo — pasting the URL into an AI tool is the fastest path. No upload, no install, no signup for casual use.

Workflow

Upload the interview to YouTube as a private or unlisted video, or use the existing public URL if it's already published.
Open TranscribeVideo.ai.
Paste the URL.
Click Transcribe — get the transcript in 10-30 seconds.
Download as plain text or SRT.

Accuracy

90-95% on clear single-speaker audio. Drops to 80-90% on overlapping speech, heavy accents, or noisy environments. AI returns intelligent verbatim by default — you'll need to add filler words manually if you need strict verbatim.

Cost and time

Free for the first 2 videos per session, $10/mo Pro for batch use. Time: 30 seconds per hour of audio.

Best for

Interviews already published on YouTube, Vimeo, or other public platforms
Casual interviews where intelligent verbatim is sufficient
Quick draft transcripts you'll review and refine yourself
Content creators who interviewed someone for their show

Method 2: AI file-upload transcription (Otter, Rev Auto, Whisper)

If your interview audio is a local file (.mp3, .wav, .m4a, .mp4), you'll need a tool that accepts file uploads. The dominant options:

Otter.ai

Upload a file in the browser or app. Otter transcribes in roughly half the audio's length (a 1-hour interview takes ~30 minutes). Free tier: 300 minutes/month. Pro: $10/mo. Strong real-time captions during live interviews. Excellent speaker diarization for 2-3 speaker interviews.

Rev.com auto-transcription

$0.25 per audio minute for AI-only transcription. Faster than Otter — typically delivers in 5-10 minutes for an hour of audio. Cleaner output formatting for easy export to Word/PDF.

Whisper (OpenAI, free, open-source)

The most flexible option for technical users. Install on your machine: pip install openai-whisper. Run: whisper interview.mp3 --model large-v3 --output_format srt. Free, runs locally, no data leaves your machine. Highest accuracy of any AI option in 2026 — particularly on accented speech. Trade-off: requires command-line comfort and a decent computer (8GB+ RAM, ideally a GPU for speed).

For a one-line install via Homebrew on Mac: brew install ffmpeg openai-whisper and you're ready to go.

Cost and time comparison

Tool	Cost (1 hr audio)	Speed	Accuracy
Otter.ai	Free up to 300 min/mo, $10/mo Pro	30 min	92-95%
Rev Auto	$15	5-10 min	93-96%
Whisper local (large-v3)	Free	10-30 min on GPU	94-97%
OpenAI Whisper API	$0.36 ($0.006/min)	2-5 min	94-97%

Method 3: Human transcription service (slowest, most accurate)

Human transcription is still the gold standard for high-stakes interviews. Use when:

The interview is for legal proceedings or court submission
The interview is for academic research subject to IRB requirements
You need certified accuracy for a published quote or claim
Audio quality is too poor for AI to handle reliably
The content is technical/medical/scientific with vocabulary AI struggles with

Major services

Rev.com human transcription: $1.50/audio min for standard, $2.50/min for verbatim. Returns in 24-48 hours typically. 99%+ accuracy.
GoTranscript: $0.84-2.20/min depending on speed and accuracy tier. Cheaper than Rev for non-urgent work.
3PlayMedia: Premium service, $2.50-4/min. Handles complex multi-speaker, technical, and broadcast-grade work.
SpeakWrite: $1.95-2.75/min depending on turnaround.
Scribie: Hybrid model with AI draft + human polish. $0.80-2/min.

Cost example

A 1-hour interview at Rev.com standard: $90. At GoTranscript economy: $50. At 3PlayMedia premium: $180+. AI alternatives: free to $15. The human cost premium reflects accuracy, speaker diarization quality, and proper noun handling that AI still struggles with.

Method 4: Hybrid workflow (recommended for most serious work)

The best practical approach in 2026: AI does the first pass, human refines. Combines AI speed with human accuracy.

Workflow

Run the interview through Whisper, Otter, or TranscribeVideo.ai for an initial transcript.
Open the transcript in Descript, oTranscribe, or any text editor with audio playback synced.
Play the audio at 1.5-2× speed while reading along.
Pause and correct any errors — proper nouns, technical terms, mishears, missing fillers.
Add speaker IDs and timestamps as needed for your use case.
Export to your delivery format.

Time per hour of audio

AI transcription: 5-30 minutes. Human review pass: 15-30 minutes for intelligent verbatim, 30-60 minutes for strict verbatim. Total: typically 20-90 minutes per hour of audio. Compared to pure manual transcription (4-6 hours per hour of audio), the hybrid workflow is 5-10× faster while preserving accuracy.

Cost per hour of audio

$0-15 for AI + your time for review. If your time is worth $50/hr, an hour of audio takes ~30 minutes to review at $25 of effective cost — still cheaper than human services and almost as accurate.

Method 5: Manual transcription (slowest, full control)

Old school. Type everything yourself while listening. Used to be the default; in 2026 it's mostly for:

Languages AI doesn't support well (specific dialects, low-resource languages)
Audio so poor AI can't extract anything coherent
Highly technical content where errors compound (specialised medical terminology, niche scientific vocabulary)
Confidentiality requirements where no audio can leave your machine

Tools that help

oTranscribe — free web app that combines audio playback controls with a text editor. Hotkeys for play/pause, speed control, and timestamp insertion.
Express Scribe — desktop app with foot pedal support for professional transcribers.
InqScribe — paid Mac app with frame-accurate timestamps for video.

Time per hour of audio

4-6 hours for an experienced transcriber. 8-10 hours for someone new. Reserve for rare cases where the alternatives don't work.

Conventions to follow when formatting an interview transcript

Different fields have different conventions. Match yours.

Journalism

Speaker names in normal case followed by colon: "Sarah:"
Intelligent verbatim — clean fillers
Inline timestamps every 30-60 seconds for quote citation
[brackets] for clarifications added by the journalist

Qualitative research

Pseudonymous speaker IDs: "P1:", "P2:" or "Subject:", "Interviewer:"
Strict verbatim including fillers, false starts, pauses
[laughter], [sigh], [cough] for non-speech sounds
(...) or [pause - 3 sec] for pauses
Specific conventions vary by field — check your discipline's standards

Legal

Names in ALL CAPS followed by colon: "JOHN SMITH:"
Strict verbatim required
Page and line numbering
Certified by court reporter

Podcast / video content

Names in normal case
Clean read or intelligent verbatim depending on use
Timestamp markers for navigation: "[00:14:25]"
Headers for major topics if used as show notes

What an interview transcript actually looks like: a worked example

Conventions are easier to judge against a real passage than in the abstract. Below is the same thirty seconds of a research interview, transcribed three ways. The audio is identical; only the transcription style changes.

True verbatim — every sound, exactly as spoken

[00:04:12]
INTERVIEWER: So, um, can you walk me through — walk me through what happened
             that morning?
PARTICIPANT: Yeah. Yeah, so I, uh... (pause) I got in about, like, quarter past
             eight? And the, the system was already down. [laughs] Again.
INTERVIEWER: [overlapping] Again—
PARTICIPANT: —again, yeah. Third time that month.

Use this when how something was said carries meaning: conversation analysis, discourse analysis, clinical or forensic work, and anything where hesitation, self-correction or laughter is data. It is the slowest style to produce and the hardest to read.

Intelligent verbatim — filler removed, wording untouched

[00:04:12]
INTERVIEWER: Can you walk me through what happened that morning?
PARTICIPANT: I got in about quarter past eight, and the system was already down.
             [laughs] Again. Third time that month.

This is the default for most qualitative research, journalism and UX work. Filled pauses, stutters and false starts go; the participant’s actual words, grammar and register stay. Note that non-verbal events that change the meaning — the laugh here — are still kept.

Clean read — edited for the page

INTERVIEWER: Can you walk me through what happened that morning?
PARTICIPANT: I arrived at about 8:15 and the system was already down — the third
             time that month.

Grammar tidied, sentences smoothed. Appropriate for a published Q&A, a website or marketing material. Notappropriate for research data or anything where the participant’s exact phrasing may later be quoted or challenged, because the editing is no longer reversible.

Whichever style you pick, say so in writing. A single line at the top of the document — “Transcribed in intelligent verbatim” — tells a supervisor, co-author or editor how to read everything that follows.

Interview transcript format: a template you can copy

Most transcripts fail review not because the words are wrong but because the document has no header, so nobody can tell which interview it is six months later. This structure covers what an examiner, editor or co-researcher will look for:

INTERVIEW TRANSCRIPT

Project:          [Study or article title]
Interview ID:     P07
Date:             3 April 2026
Duration:         00:47:19
Location / Mode:  Video call (recorded)
Interviewer:      [Name or initials]
Participant:      P07 — [role//pseudonym, no identifying details]
Transcription:    Intelligent verbatim
Transcribed by:   [Name or tool] on [date]
Notes:            [Consent obtained; audio quality; anything unusual]

---

[00:00:04]
INTERVIEWER: ...

[00:00:21]
P07: ...

Conventions worth applying consistently inside the body:

Speaker labels — pick one form (full caps, initials, or pseudonym) and never switch. Anonymised IDs such as P07 are standard in research and save a redaction pass later.
Timestamps — every few minutes, or at each new question. Enough to find a passage in the audio; not so many that the page becomes unreadable.
Inaudible speech — mark it with the time, e.g. [inaudible 00:12:44]. Never guess a word silently; a marked gap is honest, an invented word is a fabricated quote.
Uncertain words — [sounds like: Kaufmann] flags a best guess as a guess.
Non-verbal events — [laughs], [long pause], [phone rings] in square brackets, so they can never be confused with speech.
Overlapping speech — mark it ([overlapping], or a dash at the interruption point). This is the single most common place automatic tools go wrong.
Redaction — replace names, employers and places at transcription time, not later: [EMPLOYER], [CITY].

Citing an interview transcript in APA 7

For an interview you conducted and have not published, APA 7 treats the exchange as a personal communication: it is cited in the text only and does not appear in the reference list, because there is no recoverable source a reader could retrieve.

(P07, personal communication, April 3, 2026)

Research data is different from a personal communication. If the interviews areyour dataset, the transcripts normally live in an appendix (or a repository) and are referred to by participant ID, with the anonymisation scheme described in your methods section. An interview that has been published— a magazine Q&A, a broadcast, a podcast episode — is cited as that published work, not as a personal communication.

Requirements differ between institutions and between APA, MLA, Chicago and Harvard, and departments often add their own appendix rules. Check your own handbook before formatting a dissertation appendix — treat the above as the general shape, not as your university’s specific rule.

Common interview transcription problems and how to fix them

Speaker labels keep getting mixed up. AI diarization fails on similar voices, overlapping speech, and quiet speakers. Manual review is required for any interview where speaker identity matters. Listen and correct.
Proper nouns are wrong. AI hallucinates plausible names. Compile a list of names, places, brands, and technical terms before transcription, then find-and-replace any errors.
Background noise creates phantom text. AI sometimes invents text from background sounds. Spot check unusual passages against the audio.
Long pauses get ignored. AI typically doesn't preserve pauses unless explicitly instructed. For research transcripts, manually insert pause markers based on the audio.
Multiple languages in one interview. Switch tool to multilingual mode (Whisper handles 99 languages; tell it the languages explicitly via the language parameter for best results).
Audio quality is poor. Run audio through Adobe Enhance Speech (free for short clips) or NVIDIA Broadcast for noise removal first. Then transcribe.
Speakers mumble or speak quickly. Slow the playback to 0.7× speed in Descript or oTranscribe and listen-correct in the hybrid workflow.

Decision matrix — which method should you use?

Your situation	Best method
Interview on YouTube/Vimeo, casual journalism use	Method 1: URL-based AI
Local audio file, casual use	Method 2: Otter / Whisper
PhD research, qualitative interview	Method 4: Hybrid (Whisper + manual review)
Legal proceedings, court submission	Method 3: Human service (Rev verbatim)
Published article quotes for major publication	Method 4: Hybrid
Confidential interview, no cloud allowed	Method 2: Whisper local
Forensic / linguistic analysis	Method 3 or 5: Human
Multilingual interview	Method 2: Whisper (best multilingual AI)
Podcast show notes	Method 1 or 2 + light edit
Book content from interviews	Method 4: Hybrid + clean read pass

Before you start: pick the right transcription style

Method 1: AI URL-based transcription (fastest, ~$0)

Workflow

Accuracy

Cost and time

Best for

Method 2: AI file-upload transcription (Otter, Rev Auto, Whisper)

Otter.ai

Rev.com auto-transcription

Whisper (OpenAI, free, open-source)

Cost and time comparison

Method 3: Human transcription service (slowest, most accurate)

Major services

Cost example

Method 4: Hybrid workflow (recommended for most serious work)

Workflow

Time per hour of audio

Cost per hour of audio

Method 5: Manual transcription (slowest, full control)

Tools that help

Time per hour of audio

Conventions to follow when formatting an interview transcript

Journalism

Qualitative research

Legal

Podcast / video content

What an interview transcript actually looks like: a worked example

True verbatim — every sound, exactly as spoken

Intelligent verbatim — filler removed, wording untouched

Clean read — edited for the page

Interview transcript format: a template you can copy

Citing an interview transcript in APA 7

Common interview transcription problems and how to fix them

Decision matrix — which method should you use?

Related guides