Definition
What this term means
AI systems capable of processing, understanding, and generating multiple types of content, including text, images, audio, and video, within a single model. Multimodal AI can interpret a product photograph, read text overlaid on an image, understand a spoken query, and generate a response that combines text with visual elements. Models like GPT-4o and Gemini are natively multimodal.
Why it matters
The business impact
As AI becomes multimodal, brand visibility extends beyond text content. Product images, video content, infographics, and audio all become discoverable and citable by AI systems. Brands that optimise across multiple content formats, with descriptive metadata, alt text, transcripts, and structured data, gain visibility in channels that text-only optimisation misses entirely.
Used in context
How you might use this term
“An e-commerce brand optimised their product images with detailed alt text, structured product data, and video transcripts. Google's multimodal AI began featuring their products in visual search results and AI-generated shopping recommendations, driving a new source of qualified traffic.”
Related terms
Explore connected concepts
LLM
A type of artificial intelligence model trained on vast datasets of text to understand, generate, and reason about human language. LLMs power the AI assistants and generative search tools, including ChatGPT, Google Gemini, Claude, and Perplexity, that are rapidly becoming the primary way people discover products, services, and information online.
Visual Search
Search technology that allows users to find information by uploading or pointing a camera at an image, rather than typing a text query. Powered by AI image recognition and multimodal models, visual search can identify products, landmarks, plants, text within images, and more. Google Lens, Pinterest Lens, and Bing Visual Search are the most widely used visual search platforms.
Voice Search
Search queries spoken aloud to AI-powered voice assistants such as Siri, Alexa, Google Assistant, or Cortana. Voice searches tend to be longer, more conversational, and more question-based than typed queries. Voice search results typically return a single answer rather than a list of options, making the competition for that one position exceptionally high.