Back to Projects
Automation

Stealth Startup Scraper

Intelligence pipeline that extracts founder profiles from Stealth Startup Spy newsletter. 1,216+ profiles processed with 18+ structured fields per entry.

Tech Stack
5 tools
Timeline
Development
Status
In Progress
S

TL;DR: TL;DR: I built an intelligence pipeline that extracts structured founder/startup data from the Stealth Startup Spy newsletter. Processes 200+ editions with 1,216+ profiles stored across 18+ fields. Features batch processing, gap detection, and MCP integration.

The Problem

Stealth Startup Spy is a valuable newsletter that covers founders coming out of stealth mode. But the data is:

  • Unstructured: Prose format, not a database
  • Scattered: Across 200+ monthly editions
  • Not searchable: Can't query "founders from Google" or "AI startups"
  • Manual to track: No way to monitor new editions automatically

I wanted to build a searchable database of stealth startups for research and networking.

My Approach

I built a multi-stage extraction pipeline:

  1. Newsletter Ingestion: Fetch Substack content via Firecrawl API
  2. Section Detection: Identify "Founders Coming Out of Stealth" and "Key Talent Going Under Stealth"
  3. Profile Extraction: Parse 18+ structured fields using regex patterns
  4. Validation & Storage: Type checking, constraint enforcement, Supabase insert

The system uses MCP for Claude Code integration, enabling direct database operations.

Architecture

Stealth Startup Scraper - Architecture Diagram

Key Features

  • Dual Section Parsing: Handles both "Coming Out of Stealth" and "Going Under Stealth"
  • 18+ Structured Fields: From basic info to funding details
  • Gap Detection: Identifies missing newsletter editions
  • Batch Processing: Process multiple editions in one command
  • Dry-Run Mode: Preview extractions without database writes
  • Idempotent Processing: Safe to re-run without duplicates
  • Monthly Automation: ./run_monthly.sh for scheduled execution

Results & Metrics

Metric Value
Newsletters Archived 200+
Profiles Extracted 1,216+
Fields per Profile 18+
Latest Edition #266
Processing Rate ~5 seconds/newsletter
Error Rate <1%

What I Learned

The hardest part was handling format variations. The newsletter's format evolved over time:

  • Early editions used different section headers
  • Some editions have inline LinkedIn links, others have separate fields
  • Funding info is sometimes detailed, sometimes just "stealth"

I built a flexible regex parser that handles variations:

# Multiple patterns for the same field
linkedin_patterns = [
    r"Connect on LinkedIn:\s*(\S+)",
    r"LinkedIn:\s*(\S+)",
    r"\[Connect\]\((https://linkedin\.com/[^)]+)\)"
]

The MCP integration was valuable for ad-hoc queries during development. Instead of writing database queries manually, I could ask Claude to "find all founders from ex-Google."

Frequently Asked Questions

What problem does this scraper solve?

It converts unstructured newsletter prose into a searchable database of 1,216+ founder profiles. You can query by prior company, industry, location, funding status, and more.

What technologies power this project?

Python for the scraping and parsing logic, Firecrawl for robust web content extraction, Supabase PostgreSQL for structured storage, and MCP for Claude Code integration.

How accurate is the extraction?

Very high accuracy (>99%) for structured fields like names, companies, and locations. Some fields like "team_size" or "funding_info" depend on newsletter content availability and may be incomplete.

Frequently Asked Questions

It converts unstructured newsletter prose into a searchable database of 1,216+ founder profiles. You can query by prior company, industry, location, funding status, and more.
Python for the scraping and parsing logic, Firecrawl for robust web content extraction, Supabase PostgreSQL for structured storage, and MCP for Claude Code integration.
Very high accuracy (>99%) for structured fields like names, companies, and locations. Some fields like "team_size" or "funding_info" depend on newsletter content availability and may be incomplete.

More Projects

View all
AS

Built by Abhinav Sinha

AI-First Product Manager who builds production-grade tools. Passionate about turning complex problems into elegant solutions using AI, automation, and modern web technologies.