Definition
What this term means
The massive datasets of text, code, and other content used to teach AI models during their initial training phase. Training data shapes the foundational knowledge of models like GPT, Gemini, and Claude, including what they know about brands, products, and industries. Sources include web crawls (such as Common Crawl), books, academic papers, Wikipedia, and publicly available databases.
Why it matters
The business impact
Your presence (or absence) in training data determines how AI models perceive your brand at a fundamental level. Unlike RAG, which retrieves information in real time, training data is baked into the model's weights and influences every response. Ensuring your brand has accurate, authoritative representation across the sources that feed AI training sets is a long-term visibility investment.
Used in context
How you might use this term
“An analysis revealed that a brand's Wikipedia article contained outdated information and their industry association listings were incomplete. After correcting these sources, both common inputs to AI training data, the brand's representation in AI responses improved across multiple models over subsequent months.”
Related terms
Explore connected concepts
LLM
A type of artificial intelligence model trained on vast datasets of text to understand, generate, and reason about human language. LLMs power the AI assistants and generative search tools, including ChatGPT, Google Gemini, Claude, and Perplexity, that are rapidly becoming the primary way people discover products, services, and information online.
Knowledge Graph
A structured database that maps entities and the relationships between them, creating a web of interconnected knowledge. Google's Knowledge Graph, Wikidata, and similar systems store billions of facts about people, places, organisations, and concepts, powering the knowledge panels, rich results, and AI-generated answers that appear across search and AI platforms.
Common Crawl
A nonprofit organisation that maintains a free, publicly accessible archive of web crawl data containing billions of web pages. Common Crawl data is one of the largest sources used to train AI language models, including GPT, Claude, and many others. The dataset is updated regularly and represents a significant portion of the knowledge that modern AI systems draw from.