# Understanding Generative Engine Optimization for AI Crawlers
URL: https://madhudadi.in/blog/posts/generative-engine-optimization-for-ai-crawlers-explained
Published: 2026-05-30
Tags: AI, Architecture, Production, SEO
Read time: 16 min
Difficulty: intermediate
> How this blog optimizes for AI crawlers — llms.txt, ai-profile.json, structured data (TechArticle, FAQ, HowTo, Course, BreadcrumbList), speakable markup, articleBody for crawlers, and robots.txt for AI bots.# The GEO Optimization Stack

Generative Engine Optimization (GEO) is the practice of making your content accessible and usable by AI crawlers — the bots that train and power ChatGPT, Claude, Perplexity, and Google's AI Overviews.

Unlike traditional SEO, which optimizes for search result snippets, GEO optimizes for the AI that reads your entire page and synthesizes an answer. The difference matters.

---

## The GEO Files

### llms.txt

The `llms.txt` file is a standard proposed by the AI community. It tells AI crawlers what content is available and how to access it:

```txt
# Madhu Dadi — AI, Python & Analytics Hub

> A behind-the-scenes series on building a production-grade AI blog platform.

## Core
- Platform name: Madhu Dadi — AI, Python & Analytics Hub by Madhu Dadi
- Author: Madhu Dadi
- Language: English (en, en-IN)
- Content type: Technical tutorials, architecture deep-dives, AI/ML guides
- Target audience: Software engineers, AI/ML practitioners, data engineers

## Featured
- Building This Blog series: /blog/series/building-this-blog
- RAG Chat System: /blog/ask
- Daily Challenge: /blog/challenge

## Sections
- Posts: /blog/posts
- Series: /blog/series
- Tags: /blog/tags
- About: /blog/about
- FAQ: /blog/ask
```

The implementation is a simple API route:

```typescript
export async function GET() {
    const body = generateLlmsTxt();
    return new Response(body, {
        headers: { "Content-Type": "text/plain; charset=utf-8" },
    });
}
```

The route is at `/blog/llms.txt` and is referenced in the `ai-profile.json` for discovery.

### ai-profile.json

This file provides structured metadata about the site for AI crawlers:

```json
{
    "@context": "https://schema.org",
    "@type": "WebSite",
    "name": "Madhu Dadi — AI, Python & Analytics Hub",
    "url": "https://madhudadi.in/blog",
    "author": {
        "@type": "Person",
        "name": "Madhu Dadi",
        "description": "AI Developer & Marketing Analytics Leader with 9+ years..."
    },
    "contentAreas": [
        "Python Programming",
        "Artificial Intelligence",
        "RAG Systems",
        "FastAPI",
        "System Design"
    ],
    "llmsTxt": "https://madhudadi.in/blog/llms.txt",
    "feedUrl": "https://madhudadi.in/blog/feed.xml",
    "sitemapUrl": "https://madhudadi.in/blog/sitemap.xml"
}
```

Why this matters: AI crawlers like ClaudeBot and GPTBot read `ai-profile.json` to understand what content is available and how to access it efficiently. Without this file, they may only discover content through sitemaps, which don't convey the site's structure or content areas.

---

## Structured Data for Rich Results

Every post page generates five structured data blocks:

### 1. TechArticle (Primary)

```json
{
    "@type": "TechArticle",
    "headline": "Building a RAG Chat System From Zero",
    "description": "How the Ask AI page works...",
    "image": "...",
    "author": {"@id": "https://madhudadi.in/#person"},
    "datePublished": "2026-05-11",
    "teaches": ["Embedding Pipeline", "HNSW Index", "Hybrid Search"],
    "educationalLevel": "Advanced",
    "timeRequired": "PT20M",
    "wordCount": 3200
}
```

`TechArticle` is preferred over `Article` or `BlogPosting` because it signals educational content. Google's rich result parser treats it as more authoritative for technical queries.

### 2. FAQPage

Generated from post content by extracting H2/H3 headings and their following paragraphs:

```typescript
const headingRegex = /^(#{2,3})\s+(.+)$/gm;
const headingPositions: Array<{ index: number; text: string }> = [];

while ((match = headingRegex.exec(post.content)) !== null) {
    headingPositions.push({ index: match.index, text: match[2].trim() });
}

for (let i = 0; i < headingPositions.length && faqItems.length < 6; i++) {
    const sectionText = post.content.slice(start, end);
    const answerMatch = sectionText.match(/(?:^|\n)(?!^#{1,3}\s)([^#\n][\s\S]{0,300}?)\.(?:\s|$)/);
    if (answerMatch) {
        faqItems.push({
            "@type": "Question",
            name: headingPositions[i].text,
            acceptedAnswer: { "@type": "Answer", text: answerMatch[1] },
        });
    }
}
```

This turns every heading into a Q&A pair. Google renders FAQ rich results directly in search, which increases click-through rate by ~30% for technical queries.

### 3. HowTo (For Tutorials)

Tutorial posts (detected by title keywords like "how to", "tutorial", "guide") get an additional HowTo schema:

```json
{
    "@type": "HowTo",
    "name": "Build a RAG Chat System",
    "step": [
        {"@type": "HowToStep", "text": "Set up the embedding pipeline..."},
        {"@type": "HowToStep", "text": "Create the HNSW index..."}
    ]
}
```

Steps are extracted from H2 headings, filtered to exclude meta-sections like "Prerequisites" and "Conclusion."

### 4. Course (For Series)

Series pages and individual posts within a series include Course schema:

```json
{
    "@type": "Course",
    "name": "Building This Blog: A Production AI Platform",
    "description": "...",
    "coursePrerequisites": [
        {"@type": "LearningResource", "url": "/posts/why-i-built-yet-another-blog-architecture-tech-stack"}
    ]
}
```

### 5. BreadcrumbList

Every page includes breadcrumb schema for navigation context in search results.

---

## Speakable Markup

The `speakable` annotation tells Google which parts of the page are suitable for text-to-speech and AI Overviews:

```json
{
    "@type": "SpeakableSpecification",
    "cssSelector": ["h1", "h2", ".prose p"]
}
```

This targets the article title, section headings, and body paragraphs — the content that should be read aloud or summarized by Google's AI. Navigation, sidebars, and footers are excluded.

---

## articleBody for AI Crawlers

The TechArticle schema includes an `articleBody` field with the full post content (truncated to 8KB for performance):

```typescript
const articleBody = post.content
    .replace(/[#*`[\]()]/g, "")
    .replace(/```[\s\S]*?```/g, "\n[code block]\n")
    .slice(0, 8192);
```

This provides the complete text directly in the structured data, so AI crawlers don't need to make a separate request to read the content. The trade-off: slightly larger HTML pages (~8KB more per post) for significantly better AI crawler accessibility.

Code blocks are replaced with `[code block]` to reduce token count while preserving the structure.

---

## robots.txt for AI Crawlers

The robots.txt explicitly welcomes AI crawlers while blocking admin and user-specific paths:

```
User-agent: *
Allow: /blog/api/og

User-agent: *
Allow: /blog
Disallow: /blog/admin
Disallow: /blog/profile
Disallow: /blog/bookmarks
Disallow: /blog/api
Disallow: /blog/auth

User-agent: GPTBot
User-agent: ClaudeBot
User-agent: PerplexityBot
...
Allow: /blog
Allow: /blog/posts
Allow: /blog/series
Allow: /blog/tags
Allow: /blog/ask
Allow: /blog/api/og
Allow: /blog/llms.txt
Allow: /blog/ai-profile.json
Disallow: /blog/admin
```

The OG image endpoint (`/blog/api/og`) must be explicitly allowed before the broader `/blog/api` disallow. Without this, social media crawlers (LinkedIn, Facebook, Twitter) can't fetch the share card image.

---

## AI Chat Context

The frontend also provides content for AI crawlers through the `llms.txt` route, which includes a plain-text summary of the site. This is used by tools like Claude's "read file" capability when a user asks an AI to browse the site.

The route at `/blog/llms.txt` is dynamically generated and includes:
- Site overview
- Post summaries (title + excerpt for each published post)
- Series structure
- Tag taxonomy

---

## What This Achieves

Combined, these optimizations ensure that:

- **ChatGPT** can reference specific posts when answering user questions about Python or AI
- **Claude** can browse the site structure and find relevant content
- **Google AI Overviews** can cite the blog in its answers
- **Perplexity** can include the blog as a source in its responses

The key insight: AI crawlers today consume content much like search crawlers did in 2005. They're simple HTTP clients that follow links and parse HTML. The difference is they care more about structure (schema) and less about keywords. Optimizing for GEO means making your content machine-readable first, human-readable second.

---

## What's Next

The next post covers the gamification engine — how XP, badges, leaderboards, and streaks work under the hood.

---

*Built with JSON-LD, llms.txt, ai-profile.json, and zero third-party SEO plugins.*