How I Built an AI-Powered Thumbnail Generator for My Podcast
A deep dive into using Claude and Gemini to generate high-converting YouTube thumbnails with psychological hook strategies and a 4-stage pipeline that solves AI face morphing.
I was spending 2 hours per episode creating YouTube thumbnails for my podcast, Funds and Founders. Finding the right hook text, positioning faces, matching brand colors, ensuring mobile readability—it was a creative bottleneck that didn't scale.
Now it takes 2 minutes.
I built an AI-powered thumbnail generator that combines Claude for psychological hook text generation and Google's Gemini for image creation. The system produces 3 A/B testable thumbnails per episode, each with a different psychological approach designed to drive clicks.
Here's exactly how I built it, the prompts I used, and the critical problem I had to solve: AI face morphing.
A thumbnail generated by the system with hook text overlay
The Architecture
The system is a multi-stage pipeline where each AI handles what it's best at:
Episode Data → Claude (Hook Generation) → Claude (Selection) → Gemini (Image) → PIL (Text) → 3 ThumbnailsWhy this architecture works:
- Claude excels at understanding content and generating strategic text options
- Gemini is powerful for image generation but struggles with text rendering
- PIL (Python Imaging Library) handles text overlay with perfect spelling and consistent styling
The key insight is that each component does ONE thing well. Trying to make Gemini render text or Claude generate images would produce inferior results.
The Psychology of High-CTR Thumbnails
Before any image generation, Claude analyzes the episode content and generates 5 hook text options, each using a different psychological approach:
| Approach | What It Does | Example |
|---|---|---|
| Curiosity | Creates an information gap viewers must click to fill | "The Secret Nobody Tells You" |
| Bold Claim | Makes a provocative or surprising statement | "This Changes Everything" |
| Emotional | Taps into relatable feelings or experiences | "I Almost Gave Up" |
| Benefit | Clearly states what the viewer will gain | "Double Your Revenue" |
| Contrarian | Presents an unexpected or controversial take | "Stop Following This Advice" |
Here's the actual system prompt I use for Claude:
You are an expert YouTube thumbnail strategist who specializes
in creating high-CTR podcast thumbnails.
Effective thumbnail text:
- Is SHORT (4-5 words maximum)
- Creates curiosity or emotional reaction
- Makes a bold or intriguing claim
- Is readable at small sizes
- Works with the visual to tell a story
Generate exactly 5 different hook options, each using a
different psychological approach.
Claude generates all 5 options, then I have it select the best 3 for A/B testing based on click-through potential, emotional impact, readability at thumbnail size, and differentiation.
The Gemini Image Generation Prompt
This is where the magic happens. I built a detailed prompt template that ensures brand consistency across all thumbnails. Here are the critical sections:
Brand Consistency
BRAND STYLE (CRITICAL - match the provided reference images):
- Primary brand color: #E87722 (signature orange)
- Secondary color: #000000 (black for contrast)
- Accent color: #8C1C3A (burgundy for highlights)
- Style: Professional podcast thumbnail matching the
"Funds and Founders" aesthetic
I send reference images of my existing branding alongside the prompt. Gemini uses these to match the color palette and visual style.
Face Matching
This was the trickiest part. I send 2-3 reference photos of the guest's face, and the prompt explicitly instructs:
FACE CONSISTENCY (CRITICAL):
- Use ONLY the "GUEST FACE REFERENCE" images for the person's face
- DO NOT use any faces from "BRANDING STYLE REFERENCE" images
- Keep all facial features exactly the same as reference
- Preserve natural skin texture with visible pores
- Maintain accurate catchlights in eyes
The explicit labeling matters. Gemini gets confused if you don't clearly separate "these are for style reference" from "this is the actual face to use."
Mobile Legibility
YouTube thumbnails are often viewed at 120 pixels wide on mobile. The prompt enforces readability:
MOBILE LEGIBILITY (CRITICAL):
- Text must remain readable at 120px thumbnail width
- No text smaller than 1/8th of frame height
- Maximum 5 words total
- Highest-contrast element must be the TEXT, not the face
This prevents the common mistake of creating beautiful thumbnails that become illegible blobs on a phone screen.
The Face Morphing Problem (And How I Solved It)
Here's the biggest problem I encountered: when asking Gemini to generate a scene with multiple people's faces simultaneously, it "averages" the faces together. You end up with a weird hybrid that looks like neither person.
The problem in action:
- Input: Guest reference photos + Host reference photos + Scene description
- Expected: Guest on left, host on right, looking distinct
- Actual: Two people who look like a blend of both
The solution: 4-Stage Pipeline
Instead of generating everything at once, I break it into stages where each has ONE primary task:
Stage 1: Generate scene only (no faces)
↓
Stage 2: Add guest face to LEFT side
↓
Stage 3: Add host face to RIGHT side
↓
Stage 4: Add text overlay (PIL - not Gemini)Here's the actual Stage 2 prompt:
Add a person's face to the LEFT EDGE of this image.
FACE PLACEMENT:
- Position: LEFT edge, occupying leftmost 25-30% of frame width
- Crop: Very tight crop showing head only
- Size: Face should fill the full height of the frame
- Direction: Face looking INWARD (toward the right/center)
- Expression: Intense, concerned, or skeptical - NO smiling
FACE CONSISTENCY (CRITICAL):
- Use the provided reference photo EXACTLY
- Match bone structure, skin tone, nose, eyes, jawline EXACTLY
- Preserve natural skin texture with visible pores
- Only adjust expression - do NOT alter any facial features
DO NOT MODIFY:
- The center scene (keep exactly as provided)
- The right side black area (will be used for another face later)
Why it works: Each stage has a simpler task. Gemini can focus on matching ONE face at a time rather than juggling multiple references and a complex scene simultaneously.
I save intermediate results (_stage1_scene.png, _stage2_guest.png) so I can see exactly where something went wrong if the final output isn't right.
The 4-Stage Pipeline in Action
Stage 1: Gemini generates the central scene with black edges where faces will be added
Stage 2: Guest face added to the left edge, maintaining exact facial features
Stage 3: Host face added to the right edge
Final: Text overlay added via PIL for perfect spelling and consistent styling
Why PIL for Text, Not AI
After the faces are in place, I add text overlay using Python's PIL library instead of asking Gemini to render it.
Problems with Gemini text rendering:
- May misspell words (I've seen "REVENU" instead of "REVENUE")
- Inconsistent font styling
- Takes 10-20 seconds per image
- Text placement varies unpredictably
Benefits of PIL:
- Perfect spelling every time
- Consistent styling across all thumbnails
- Fast (under 1 second)
- No additional API cost
The code is straightforward:
from PIL import Image, ImageDraw, ImageFont
def add_text_overlay(image, hook_text, logo_path, use_outline=True):
draw = ImageDraw.Draw(image)
font = ImageFont.truetype("ArchivoBlack-Regular.ttf", size=120)
# Draw text with black outline for readability
if use_outline:
for dx, dy in [(-4,0), (4,0), (0,-4), (0,4)]:
draw.text((x+dx, y+dy), hook_text, font=font, fill="black")
draw.text((x, y), hook_text, font=font, fill="white")
return imageI bundle the Archivo Black font (bold, high-impact) with the project so it's consistent everywhere.
DOAC-Style Triptych Thumbnails
I also built support for "Diary of a CEO" style thumbnails—the dramatic triptych layout with a central scene:
Layout:
- LEFT 25%: Guest face (intense expression, no smiling)
- CENTER 50%: Dramatic AI-generated scene
- RIGHT 25%: Host face (intense expression, no smiling)
- Text: Lowercase, red box, 2-4 words max ("the truth", "before after")
For the central scene, Claude analyzes the episode content and selects one of four scene types:
| Scene Type | Visual Approach |
|---|---|
| Transformation | Before/after metaphor (damaged → repaired, failing → thriving) |
| Revelation | Discovering something hidden (door opening, light shining into darkness) |
| Threat | Ominous situation (robots, storm, danger approaching) |
| Contrast | Two opposing states side by side (success vs failure) |
This style requires the full 4-stage pipeline because you're combining two distinct faces with a generated scene.
Key Learnings
After building this system and generating hundreds of thumbnails, here's what I learned:
1. Rate limiting is critical Gemini has strict limits (around 10 requests per minute). Build in delays and retry logic, or you'll hit walls constantly.
2. Reference image ordering matters Put the main prompt FIRST, then label reference images explicitly. Gemini processes content in order, and clear labels prevent confusion.
[Main prompt text]
=== GUEST FACE REFERENCE ===
[Guest image 1]
[Guest image 2]
=== BRANDING STYLE REFERENCE ===
[Logo and color samples]3. Aspect ratio is a hint, not a guarantee
Even with aspect_ratio="16:9" in the config, Gemini sometimes produces square images. Build in post-processing to crop if needed.
4. Save intermediate results When doing multi-stage generation, save each stage's output. It makes debugging 10x easier when something goes wrong at stage 3.
5. Use PIL for text, always AI text rendering is unreliable. The few seconds you save isn't worth the spelling errors and inconsistent styling.
Results
A standard-style thumbnail with different hook approach
The system now generates 3 high-quality, A/B-ready thumbnails in about 2 minutes (most of that is Gemini processing time). Each thumbnail uses a different psychological hook approach, giving real data on what resonates with my audience.
The consistency across thumbnails has improved dramatically—same brand colors, same text styling, same professional quality. And because it's automated, I can generate thumbnails for an entire backlog of episodes without burning out.
If you're running a podcast or YouTube channel and spending hours on thumbnails, the investment in building (or using) an AI pipeline like this pays off quickly. The technology is finally good enough—you just need to understand its quirks and work around them.
Check out Funds and Founders to see these thumbnails in action.