Definition
What this term means
Automated bots operated by AI companies to discover, access, and index web content, either for model training, real-time retrieval, or both. Major AI crawlers include GPTBot (OpenAI), Google-Extended (Google), ClaudeBot (Anthropic), PerplexityBot (Perplexity), and CCBot (Common Crawl). Each crawler has different purposes and can be individually controlled through robots.txt directives.
Why it matters
The business impact
AI crawlers are the mechanism through which your content enters the AI ecosystem. If your website blocks these crawlers, your content cannot be indexed, retrieved, or cited by the AI platforms they serve. Understanding which AI crawlers exist, what they are used for, and how to configure access to them is essential for maintaining and growing AI visibility.
Used in context
How you might use this term
“A company configured their robots.txt to explicitly allow GPTBot, ClaudeBot, and PerplexityBot while monitoring their server logs to track crawl frequency. They discovered that PerplexityBot was the most active crawler, visiting key pages daily, explaining why Perplexity cited their content more frequently than other AI platforms.”
Related terms
Explore connected concepts
Robots.txt
A plain text file placed at the root of a website that provides instructions to web crawlers about which pages and directories they are allowed or disallowed from accessing. Robots.txt is the primary mechanism for controlling how both traditional search engine crawlers and AI-specific crawlers (like GPTBot, Google-Extended, and ClaudeBot) interact with your website content.
ai.txt
An emerging web standard that allows website owners to declare their preferences for how AI systems may use their content. Similar to robots.txt but specifically designed for AI use cases, ai.txt communicates whether content may be used for AI training, summarisation, or citation, and under what conditions. The specification is still evolving, with growing adoption among publishers and AI companies.
Common Crawl
A nonprofit organisation that maintains a free, publicly accessible archive of web crawl data containing billions of web pages. Common Crawl data is one of the largest sources used to train AI language models, including GPT, Claude, and many others. The dataset is updated regularly and represents a significant portion of the knowledge that modern AI systems draw from.