How I Built an AI-Powered Thumbnail Generator for My Podcast

A deep dive into using Claude and Gemini to generate high-converting YouTube thumbnails with psychological hook strategies and a 4-stage pipeline that solves AI face morphing.

9 min read
aiautomationpodcast
How I Built an AI-Powered Thumbnail Generator for My Podcast

I was spending 2 hours per episode creating YouTube thumbnails for my podcast, Funds and Founders. Finding the right hook text, positioning faces, matching brand colors, ensuring mobile readability—it was a creative bottleneck that didn't scale.

Now it takes 2 minutes.

I built an AI-powered thumbnail generator that combines Claude for psychological hook text generation and Google's Gemini for image creation. The system produces 3 A/B testable thumbnails per episode, each with a different psychological approach designed to drive clicks.

Here's exactly how I built it, the prompts I used, and the critical problem I had to solve: AI face morphing.

Example generated thumbnail A thumbnail generated by the system with hook text overlay

The Architecture

The system is a multi-stage pipeline where each AI handles what it's best at:

Episode Data → Claude (Hook Generation) → Claude (Selection) → Gemini (Image) → PIL (Text) → 3 Thumbnails

Why this architecture works:

  • Claude excels at understanding content and generating strategic text options
  • Gemini is powerful for image generation but struggles with text rendering
  • PIL (Python Imaging Library) handles text overlay with perfect spelling and consistent styling

The key insight is that each component does ONE thing well. Trying to make Gemini render text or Claude generate images would produce inferior results.

The Psychology of High-CTR Thumbnails

Before any image generation, Claude analyzes the episode content and generates 5 hook text options, each using a different psychological approach:

Approach What It Does Example
Curiosity Creates an information gap viewers must click to fill "The Secret Nobody Tells You"
Bold Claim Makes a provocative or surprising statement "This Changes Everything"
Emotional Taps into relatable feelings or experiences "I Almost Gave Up"
Benefit Clearly states what the viewer will gain "Double Your Revenue"
Contrarian Presents an unexpected or controversial take "Stop Following This Advice"

Here's the actual system prompt I use for Claude:

You are an expert YouTube thumbnail strategist who specializes
in creating high-CTR podcast thumbnails.

Effective thumbnail text:
- Is SHORT (4-5 words maximum)
- Creates curiosity or emotional reaction
- Makes a bold or intriguing claim
- Is readable at small sizes
- Works with the visual to tell a story

Generate exactly 5 different hook options, each using a
different psychological approach.

Claude generates all 5 options, then I have it select the best 3 for A/B testing based on click-through potential, emotional impact, readability at thumbnail size, and differentiation.

The Gemini Image Generation Prompt

This is where the magic happens. I built a detailed prompt template that ensures brand consistency across all thumbnails. Here are the critical sections:

Brand Consistency

BRAND STYLE (CRITICAL - match the provided reference images):
- Primary brand color: #E87722 (signature orange)
- Secondary color: #000000 (black for contrast)
- Accent color: #8C1C3A (burgundy for highlights)
- Style: Professional podcast thumbnail matching the
  "Funds and Founders" aesthetic

I send reference images of my existing branding alongside the prompt. Gemini uses these to match the color palette and visual style.

Face Matching

This was the trickiest part. I send 2-3 reference photos of the guest's face, and the prompt explicitly instructs:

FACE CONSISTENCY (CRITICAL):
- Use ONLY the "GUEST FACE REFERENCE" images for the person's face
- DO NOT use any faces from "BRANDING STYLE REFERENCE" images
- Keep all facial features exactly the same as reference
- Preserve natural skin texture with visible pores
- Maintain accurate catchlights in eyes

The explicit labeling matters. Gemini gets confused if you don't clearly separate "these are for style reference" from "this is the actual face to use."

Mobile Legibility

YouTube thumbnails are often viewed at 120 pixels wide on mobile. The prompt enforces readability:

MOBILE LEGIBILITY (CRITICAL):
- Text must remain readable at 120px thumbnail width
- No text smaller than 1/8th of frame height
- Maximum 5 words total
- Highest-contrast element must be the TEXT, not the face

This prevents the common mistake of creating beautiful thumbnails that become illegible blobs on a phone screen.

The Face Morphing Problem (And How I Solved It)

Here's the biggest problem I encountered: when asking Gemini to generate a scene with multiple people's faces simultaneously, it "averages" the faces together. You end up with a weird hybrid that looks like neither person.

The problem in action:

  • Input: Guest reference photos + Host reference photos + Scene description
  • Expected: Guest on left, host on right, looking distinct
  • Actual: Two people who look like a blend of both

The solution: 4-Stage Pipeline

Instead of generating everything at once, I break it into stages where each has ONE primary task:

Stage 1: Generate scene only (no faces)

Stage 2: Add guest face to LEFT side

Stage 3: Add host face to RIGHT side

Stage 4: Add text overlay (PIL - not Gemini)

Here's the actual Stage 2 prompt:

Add a person's face to the LEFT EDGE of this image.

FACE PLACEMENT:
- Position: LEFT edge, occupying leftmost 25-30% of frame width
- Crop: Very tight crop showing head only
- Size: Face should fill the full height of the frame
- Direction: Face looking INWARD (toward the right/center)
- Expression: Intense, concerned, or skeptical - NO smiling

FACE CONSISTENCY (CRITICAL):
- Use the provided reference photo EXACTLY
- Match bone structure, skin tone, nose, eyes, jawline EXACTLY
- Preserve natural skin texture with visible pores
- Only adjust expression - do NOT alter any facial features

DO NOT MODIFY:
- The center scene (keep exactly as provided)
- The right side black area (will be used for another face later)

Why it works: Each stage has a simpler task. Gemini can focus on matching ONE face at a time rather than juggling multiple references and a complex scene simultaneously.

I save intermediate results (_stage1_scene.png, _stage2_guest.png) so I can see exactly where something went wrong if the final output isn't right.

The 4-Stage Pipeline in Action

Stage 1: Scene only Stage 1: Gemini generates the central scene with black edges where faces will be added

Stage 2: Guest face added Stage 2: Guest face added to the left edge, maintaining exact facial features

Stage 3: Both faces Stage 3: Host face added to the right edge

Final DOAC-style thumbnail Final: Text overlay added via PIL for perfect spelling and consistent styling

Why PIL for Text, Not AI

After the faces are in place, I add text overlay using Python's PIL library instead of asking Gemini to render it.

Problems with Gemini text rendering:

  • May misspell words (I've seen "REVENU" instead of "REVENUE")
  • Inconsistent font styling
  • Takes 10-20 seconds per image
  • Text placement varies unpredictably

Benefits of PIL:

  • Perfect spelling every time
  • Consistent styling across all thumbnails
  • Fast (under 1 second)
  • No additional API cost

The code is straightforward:

from PIL import Image, ImageDraw, ImageFont
 
def add_text_overlay(image, hook_text, logo_path, use_outline=True):
    draw = ImageDraw.Draw(image)
    font = ImageFont.truetype("ArchivoBlack-Regular.ttf", size=120)
 
    # Draw text with black outline for readability
    if use_outline:
        for dx, dy in [(-4,0), (4,0), (0,-4), (0,4)]:
            draw.text((x+dx, y+dy), hook_text, font=font, fill="black")
 
    draw.text((x, y), hook_text, font=font, fill="white")
    return image

I bundle the Archivo Black font (bold, high-impact) with the project so it's consistent everywhere.

DOAC-Style Triptych Thumbnails

I also built support for "Diary of a CEO" style thumbnails—the dramatic triptych layout with a central scene:

Layout:

  • LEFT 25%: Guest face (intense expression, no smiling)
  • CENTER 50%: Dramatic AI-generated scene
  • RIGHT 25%: Host face (intense expression, no smiling)
  • Text: Lowercase, red box, 2-4 words max ("the truth", "before after")

For the central scene, Claude analyzes the episode content and selects one of four scene types:

Scene Type Visual Approach
Transformation Before/after metaphor (damaged → repaired, failing → thriving)
Revelation Discovering something hidden (door opening, light shining into darkness)
Threat Ominous situation (robots, storm, danger approaching)
Contrast Two opposing states side by side (success vs failure)

This style requires the full 4-stage pipeline because you're combining two distinct faces with a generated scene.

Key Learnings

After building this system and generating hundreds of thumbnails, here's what I learned:

1. Rate limiting is critical Gemini has strict limits (around 10 requests per minute). Build in delays and retry logic, or you'll hit walls constantly.

2. Reference image ordering matters Put the main prompt FIRST, then label reference images explicitly. Gemini processes content in order, and clear labels prevent confusion.

[Main prompt text]
=== GUEST FACE REFERENCE ===
[Guest image 1]
[Guest image 2]
=== BRANDING STYLE REFERENCE ===
[Logo and color samples]

3. Aspect ratio is a hint, not a guarantee Even with aspect_ratio="16:9" in the config, Gemini sometimes produces square images. Build in post-processing to crop if needed.

4. Save intermediate results When doing multi-stage generation, save each stage's output. It makes debugging 10x easier when something goes wrong at stage 3.

5. Use PIL for text, always AI text rendering is unreliable. The few seconds you save isn't worth the spelling errors and inconsistent styling.

Results

Another thumbnail variation A standard-style thumbnail with different hook approach

The system now generates 3 high-quality, A/B-ready thumbnails in about 2 minutes (most of that is Gemini processing time). Each thumbnail uses a different psychological hook approach, giving real data on what resonates with my audience.

The consistency across thumbnails has improved dramatically—same brand colors, same text styling, same professional quality. And because it's automated, I can generate thumbnails for an entire backlog of episodes without burning out.

If you're running a podcast or YouTube channel and spending hours on thumbnails, the investment in building (or using) an AI pipeline like this pays off quickly. The technology is finally good enough—you just need to understand its quirks and work around them.

Check out Funds and Founders to see these thumbnails in action.