Podcast Transcription MCP | Whisper + Speaker Diarization for Claude

The Problem

Podcast transcription seems solved until you need:

Speaker attribution: Who said what, with accurate timestamps
High accuracy: Industry-grade word error rate (WER)
AI integration: Direct access from Claude for analysis and summarization
Local processing: No cloud dependencies for sensitive content

Existing solutions either lacked speaker diarization, required cloud upload, or couldn't integrate with AI assistants.

My Approach

I combined two state-of-the-art models into a single MCP server:

Whisper Large-v3 for transcription (1550M parameters, 95%+ accuracy)
pyannote.audio 3.1 for speaker diarization (92.8% speaker precision)

The key challenge was temporal alignment - matching the transcription timestamps with speaker segments. I solved this by:

Running both models in parallel on the same audio
Aligning at the word level using timestamp overlap
Assigning each word to the speaker active during its timestamp

Architecture

Podcast Transcription MCP Server - Architecture Diagram

Key Features

95.2% WER Accuracy: Whisper Large-v3 on English podcasts
92.8% Speaker Precision: pyannote correctly identifies speaker boundaries
Parallel Processing: Transcription and diarization run simultaneously
Multiple Outputs: JSON (with timestamps), plain text, SRT subtitles
GPU Acceleration: 3 min for 1-hour podcast on RTX 4090 (vs 12 min CPU)
Speaker Statistics: Speaking time, word counts, turn-taking patterns

Results & Metrics

Metric	Value
Word Error Rate	95.2% accuracy
Speaker Precision	92.8%
1-hour podcast (GPU)	~3 minutes
1-hour podcast (CPU M1 Pro)	~12 minutes
Memory (GPU mode)	~6GB VRAM
Memory (CPU mode)	~8GB RAM

What I Learned

The hardest part was handling overlapping speech. When two people talk simultaneously, both models struggle:

Whisper often transcribes only the louder speaker
pyannote sometimes creates a third "overlap" segment

I added a post-processing step that:

Detects overlap regions from pyannote
Marks those segments as "multiple speakers"
Uses Whisper's word-level confidence to assign ambiguous words

For production use, I'd add a queue system for batch processing and implement caching for repeated queries on the same audio.

Frequently Asked Questions

What problem does this MCP server solve?

It provides high-quality podcast transcription with speaker attribution directly within Claude Desktop. You can transcribe audio and immediately ask Claude to summarize, extract quotes, or analyze the conversation.

What technologies power this project?

OpenAI Whisper Large-v3 (1550M parameters) for transcription, pyannote.audio 3.1 for speaker diarization, PyTorch with CUDA for GPU acceleration, and the MCP protocol for Claude integration.

How accurate is the speaker identification?

92.8% precision on typical two-person podcasts. Accuracy decreases with more speakers or significant background noise. The system works best with clear audio and distinct speaker voices.

Frequently Asked Questions

OpenAI Whisper Large-v3 (1550M parameters) for transcription, pyannote.audio 3.1 for speaker diarization, PyTorch with CUDA for GPU acceleration, and the MCP protocol for Claude integration.

92.8% precision on typical two-person podcasts. Accuracy decreases with more speakers or significant background noise. The system works best with clear audio and distinct speaker voices.

More Projects

View all

AI/ML

Podcast Content Pipeline

AI/ML

Podcast Vector Search MCP

Financial

Credit Card Benefits Organizer

Built by Abhinav Sinha

AI-First Product Manager who builds production-grade tools. Passionate about turning complex problems into elegant solutions using AI, automation, and modern web technologies.

Connect on LinkedIn Get in Touch View All Projects

The Problem

My Approach

Architecture

Key Features

Results &#x26; Metrics

What I Learned

Frequently Asked Questions

What problem does this MCP server solve?

What technologies power this project?

How accurate is the speaker identification?

Frequently Asked Questions

More Projects

Podcast Content Pipeline

Podcast Vector Search MCP

Credit Card Benefits Organizer

Built by Abhinav Sinha

Results & Metrics