How to Transcribe an Interview (5 Methods, 2026)
Transcribing an interview used to take 4-6 hours per hour of audio. In 2026, AI does it in minutes — but with trade-offs that matter for research, journalism, and legal work. Here are five real methods, with cost and accuracy benchmarks for each.
Before you start: pick the right transcription style
Three transcription styles exist. Picking the wrong one wastes time at the end when you have to re-edit.
- Strict verbatim — every "um," every false start, every cough. Required for legal, qualitative research, linguistics, forensic work.
- Intelligent verbatim — fillers and false starts removed, but speaker's vocabulary preserved exactly. Best for journalism, business, podcasts, and most general use.
- Clean read — grammar fixed, lightly paraphrased for readability. Best for repurposing into prose (blog posts, books, marketing copy).
If you're not sure, default to intelligent verbatim. It's the most readable, the most useful for analysis, and the easiest to convert to clean read later. Full reference on transcription styles.
Method 1: AI URL-based transcription (fastest, ~$0)
If your interview is on YouTube, Zoom recording, or Vimeo — pasting the URL into an AI tool is the fastest path. No upload, no install, no signup for casual use.
Workflow
- Upload the interview to YouTube as a private or unlisted video, or use the existing public URL if it's already published.
- Open TranscribeVideo.ai.
- Paste the URL.
- Click Transcribe — get the transcript in 10-30 seconds.
- Download as plain text or SRT.
Accuracy
90-95% on clear single-speaker audio. Drops to 80-90% on overlapping speech, heavy accents, or noisy environments. AI returns intelligent verbatim by default — you'll need to add filler words manually if you need strict verbatim.
Cost and time
Free for the first 2 videos per session, $10/mo Pro for batch use. Time: 30 seconds per hour of audio.
Best for
- Interviews already published on YouTube, Vimeo, or other public platforms
- Casual interviews where intelligent verbatim is sufficient
- Quick draft transcripts you'll review and refine yourself
- Content creators who interviewed someone for their show
Method 2: AI file-upload transcription (Otter, Rev Auto, Whisper)
If your interview audio is a local file (.mp3, .wav, .m4a, .mp4), you'll need a tool that accepts file uploads. The dominant options:
Otter.ai
Upload a file in the browser or app. Otter transcribes in roughly half the audio's length (a 1-hour interview takes ~30 minutes). Free tier: 300 minutes/month. Pro: $10/mo. Strong real-time captions during live interviews. Excellent speaker diarization for 2-3 speaker interviews.
Rev.com auto-transcription
$0.25 per audio minute for AI-only transcription. Faster than Otter — typically delivers in 5-10 minutes for an hour of audio. Cleaner output formatting for easy export to Word/PDF.
Whisper (OpenAI, free, open-source)
The most flexible option for technical users. Install on your machine: pip install openai-whisper. Run: whisper interview.mp3 --model large-v3 --output_format srt. Free, runs locally, no data leaves your machine. Highest accuracy of any AI option in 2026 — particularly on accented speech. Trade-off: requires command-line comfort and a decent computer (8GB+ RAM, ideally a GPU for speed).
For a one-line install via Homebrew on Mac: brew install ffmpeg openai-whisper and you're ready to go.
Cost and time comparison
| Tool | Cost (1 hr audio) | Speed | Accuracy |
|---|---|---|---|
| Otter.ai | Free up to 300 min/mo, $10/mo Pro | 30 min | 92-95% |
| Rev Auto | $15 | 5-10 min | 93-96% |
| Whisper local (large-v3) | Free | 10-30 min on GPU | 94-97% |
| OpenAI Whisper API | $0.36 ($0.006/min) | 2-5 min | 94-97% |
Method 3: Human transcription service (slowest, most accurate)
Human transcription is still the gold standard for high-stakes interviews. Use when:
- The interview is for legal proceedings or court submission
- The interview is for academic research subject to IRB requirements
- You need certified accuracy for a published quote or claim
- Audio quality is too poor for AI to handle reliably
- The content is technical/medical/scientific with vocabulary AI struggles with
Major services
- Rev.com human transcription: $1.50/audio min for standard, $2.50/min for verbatim. Returns in 24-48 hours typically. 99%+ accuracy.
- GoTranscript: $0.84-2.20/min depending on speed and accuracy tier. Cheaper than Rev for non-urgent work.
- 3PlayMedia: Premium service, $2.50-4/min. Handles complex multi-speaker, technical, and broadcast-grade work.
- SpeakWrite: $1.95-2.75/min depending on turnaround.
- Scribie: Hybrid model with AI draft + human polish. $0.80-2/min.
Cost example
A 1-hour interview at Rev.com standard: $90. At GoTranscript economy: $50. At 3PlayMedia premium: $180+. AI alternatives: free to $15. The human cost premium reflects accuracy, speaker diarization quality, and proper noun handling that AI still struggles with.
Method 4: Hybrid workflow (recommended for most serious work)
The best practical approach in 2026: AI does the first pass, human refines. Combines AI speed with human accuracy.
Workflow
- Run the interview through Whisper, Otter, or TranscribeVideo.ai for an initial transcript.
- Open the transcript in Descript, oTranscribe, or any text editor with audio playback synced.
- Play the audio at 1.5-2× speed while reading along.
- Pause and correct any errors — proper nouns, technical terms, mishears, missing fillers.
- Add speaker IDs and timestamps as needed for your use case.
- Export to your delivery format.
Time per hour of audio
AI transcription: 5-30 minutes. Human review pass: 15-30 minutes for intelligent verbatim, 30-60 minutes for strict verbatim. Total: typically 20-90 minutes per hour of audio. Compared to pure manual transcription (4-6 hours per hour of audio), the hybrid workflow is 5-10× faster while preserving accuracy.
Cost per hour of audio
$0-15 for AI + your time for review. If your time is worth $50/hr, an hour of audio takes ~30 minutes to review at $25 of effective cost — still cheaper than human services and almost as accurate.
Method 5: Manual transcription (slowest, full control)
Old school. Type everything yourself while listening. Used to be the default; in 2026 it's mostly for:
- Languages AI doesn't support well (specific dialects, low-resource languages)
- Audio so poor AI can't extract anything coherent
- Highly technical content where errors compound (specialised medical terminology, niche scientific vocabulary)
- Confidentiality requirements where no audio can leave your machine
Tools that help
- oTranscribe — free web app that combines audio playback controls with a text editor. Hotkeys for play/pause, speed control, and timestamp insertion.
- Express Scribe — desktop app with foot pedal support for professional transcribers.
- InqScribe — paid Mac app with frame-accurate timestamps for video.
Time per hour of audio
4-6 hours for an experienced transcriber. 8-10 hours for someone new. Reserve for rare cases where the alternatives don't work.
Conventions to follow when formatting an interview transcript
Different fields have different conventions. Match yours.
Journalism
- Speaker names in normal case followed by colon: "Sarah:"
- Intelligent verbatim — clean fillers
- Inline timestamps every 30-60 seconds for quote citation
- [brackets] for clarifications added by the journalist
Qualitative research
- Pseudonymous speaker IDs: "P1:", "P2:" or "Subject:", "Interviewer:"
- Strict verbatim including fillers, false starts, pauses
- [laughter], [sigh], [cough] for non-speech sounds
- (...) or [pause - 3 sec] for pauses
- Specific conventions vary by field — check your discipline's standards
Legal
- Names in ALL CAPS followed by colon: "JOHN SMITH:"
- Strict verbatim required
- Page and line numbering
- Certified by court reporter
Podcast / video content
- Names in normal case
- Clean read or intelligent verbatim depending on use
- Timestamp markers for navigation: "[00:14:25]"
- Headers for major topics if used as show notes
Common interview transcription problems and how to fix them
- Speaker labels keep getting mixed up. AI diarization fails on similar voices, overlapping speech, and quiet speakers. Manual review is required for any interview where speaker identity matters. Listen and correct.
- Proper nouns are wrong. AI hallucinates plausible names. Compile a list of names, places, brands, and technical terms before transcription, then find-and-replace any errors.
- Background noise creates phantom text. AI sometimes invents text from background sounds. Spot check unusual passages against the audio.
- Long pauses get ignored. AI typically doesn't preserve pauses unless explicitly instructed. For research transcripts, manually insert pause markers based on the audio.
- Multiple languages in one interview. Switch tool to multilingual mode (Whisper handles 99 languages; tell it the languages explicitly via the language parameter for best results).
- Audio quality is poor. Run audio through Adobe Enhance Speech (free for short clips) or NVIDIA Broadcast for noise removal first. Then transcribe.
- Speakers mumble or speak quickly. Slow the playback to 0.7× speed in Descript or oTranscribe and listen-correct in the hybrid workflow.
Decision matrix — which method should you use?
| Your situation | Best method |
|---|---|
| Interview on YouTube/Vimeo, casual journalism use | Method 1: URL-based AI |
| Local audio file, casual use | Method 2: Otter / Whisper |
| PhD research, qualitative interview | Method 4: Hybrid (Whisper + manual review) |
| Legal proceedings, court submission | Method 3: Human service (Rev verbatim) |
| Published article quotes for major publication | Method 4: Hybrid |
| Confidential interview, no cloud allowed | Method 2: Whisper local |
| Forensic / linguistic analysis | Method 3 or 5: Human |
| Multilingual interview | Method 2: Whisper (best multilingual AI) |
| Podcast show notes | Method 1 or 2 + light edit |
| Book content from interviews | Method 4: Hybrid + clean read pass |