Major AI Crawler User Agent Headers

Aiko · May 5, 2025, 5:34am

Major AI Crawler User Agent Headers

Company	Bot Name	User Agent String	Purpose
OpenAI	GPTBot	`Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; GPTBot/1.1; +https://openai.com/gptbot)`	Used by OpenAI to train and refine generative AI models
OpenAI	ChatGPT-User	`Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot)`	Indexes online content to advance ChatGPT’s research and retrieval
OpenAI	OAI-SearchBot	`Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; OAI-SearchBot/1.0; +https://openai.com/searchbot)`	Used to create an index of websites that can be surfaced as results in OpenAI’s SearchGPT product
Anthropic	anthropic-ai	`Mozilla/5.0 (compatible; anthropic-ai/1.0; +http://www.anthropic.com/bot.html)`	Collects information for Anthropic’s AI development
Anthropic	ClaudeBot	`Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ClaudeBot/1.0; +claudebot@anthropic.com)`	A web crawler operated by Anthropic to download training data for its LLMs (Large Language Models) that power AI products like Claude
Anthropic	Claude-Web	`Mozilla/5.0 (compatible; claude-web/1.0; +http://www.anthropic.com/bot.html)`	Acquires site data to refine Anthropic’s web-focused models
Google	Google-Extended	`Mozilla/5.0 (compatible; Google-Extended/1.0; +http://www.google.com/bot.html)`	Used to help improve Bard (now Gemini) and Vertex AI generative APIs, including future generations of models
Google	GoogleOther	`GoogleOther`	Used by Google for internal research and development
Apple	Applebot	`Mozilla/5.0 (compatible; Applebot/1.0; +http://www.apple.com/bot.html)`	Crawls webpages to improve results for Siri and Spotlight
Apple	Applebot-Extended	`Mozilla/5.0 (compatible; Applebot-Extended/1.0; +http://www.apple.com/bot.html)`	Determines how to use data crawled by Applebot for Apple’s foundation models
Microsoft	BingBot	`Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/W.X.Y.Z Safari/537.36`	Microsoft’s web crawler for Bing search engine
Meta	FacebookBot	`Mozilla/5.0 (compatible; FacebookBot/1.0; +http://www.facebook.com/bot.html)`	Fetches content for Facebook and other Meta services
Meta	Meta-ExternalAgent	`Mozilla/5.0 (compatible; meta-externalagent/1.1; +https://developers.facebook.com/docs/sharing/webmasters/crawler)`	Crawls the web for use cases such as training AI models or improving products by indexing content directly
ByteDance	Bytespider	`Mozilla/5.0 (compatible; Bytespider/1.0; +http://www.bytedance.com/bot.html)`	Surveys webpages to support TikTok’s content discovery
Cohere	cohere-ai	`Mozilla/5.0 (compatible; cohere-ai/1.0; +http://www.cohere.ai/bot.html)`	Collects text samples to refine Cohere’s language models
Perplexity	PerplexityBot	`Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot)`	Perplexity’s crawler, designed to help the platform build and maintain its own index
Mistral AI	MistralAI-User	`Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; MistralAI-User/1.0; +https://docs.mistral.ai/robots)`	Used by Mistral to fetch citations for Le Chat. It doesn’t crawl the web automatically nor collect training data
Common Crawl	CCBot	`Mozilla/5.0 (compatible; CCBot/1.0; +http://www.commoncrawl.org/bot.html)`	Non-profit organization CommonCrawl’s user agent devoted to cataloging the Internet
Diffbot	Diffbot	`Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 (.NET CLR 3.5.30729; Diffbot/0.1; +http://www.diffbot.com)`	Scrapes webpages to produce structured data for AI systems
DuckDuckGo	DuckAssistBot	`Mozilla/5.0 (compatible; DuckAssistBot/1.0; +http://www.duckduckgo.com/bot.html)`	Collects data to deliver AI-backed answers on DuckDuckGo

This table includes the major AI crawlers from companies like OpenAI, Anthropic, Google, Apple, Microsoft, Meta, and others. Each entry includes the company name, bot name, full user agent string, and the primary purpose of the crawler.

Aiko · May 5, 2025, 5:54am

How to Allow Major AI Crawlers in Cloudflare WAF

Method 1: Using Verified Bot Categories in WAF Rules

Log in to your Cloudflare Dashboard
Navigate to Security > WAF > Custom rules
Create a new custom rule:
- Click “Create rule”
- Give your rule a descriptive name like “Allow AI Crawlers”

Configure the expression using the following:

(cf.client.bot and cf.verified_bot_category eq "AI Crawler")

Set the action to “Allow” - This will allow AI crawlers to access your website by bypassing other security rules

This functionality is available to all Cloudflare customers, including those on free plans. Cloudflare has added specific categorization for bots to give site owners better control over what type of bots can access their content.

Method 2: Specific AI Crawler Setup (More Detailed Control)

If you want more granular control over which specific AI crawlers you allow, you can create a more detailed rule:

Create a new custom rule as described above

Use a more specific expression that targets individual bot user agents:

(cf.client.bot and (
  http.user_agent contains "GPTBot" or
  http.user_agent contains "ClaudeBot" or
  http.user_agent contains "OAI-SearchBot" or
  http.user_agent contains "anthropic-ai" or
  http.user_agent contains "MistralAI-User" or
  http.user_agent contains "Bytespider" or
  http.user_agent contains "cohere-ai" or
  http.user_agent contains "PerplexityBot"
))

Set the action to “Allow”

Remember that this rule will only work if your site is properly proxied through Cloudflare, not just using Cloudflare for DNS.

Important Considerations

Cloudflare now provides detailed analytics about AI service crawling activity on your site. You can review which AI services are accessing your content and what content they’re accessing most frequently.
Different AI crawlers have different purposes - some like GPTBot and ClaudeBot collect training data, while others like OAI-SearchBot are used for search features that do provide attribution and links back to your site.
The rule expression uses Cloudflare’s bot detection capabilities to accurately identify legitimate AI crawlers versus fake ones trying to spoof user agent strings.
Make sure your site is configured to be proxied through Cloudflare (orange cloud icon in DNS settings), as WAF rules only work on proxied traffic.

This configuration will allow AI crawlers to access and index your content, which can help with visibility in AI-powered search tools and assistants. Would you like me to explain any specific part of this process in more detail?

Aiko · May 5, 2025, 5:54am

Cloudflare WAF rule allow major AI bot

(cf.client.bot and (
  http.user_agent contains "GPTBot" or
  http.user_agent contains "ChatGPT-User" or
  http.user_agent contains "OAI-SearchBot" or
  http.user_agent contains "ClaudeBot" or
  http.user_agent contains "anthropic-ai" or
  http.user_agent contains "claude-web" or
  http.user_agent contains "MistralAI-User" or
  http.user_agent contains "Bytespider" or
  http.user_agent contains "cohere-ai" or
  http.user_agent contains "PerplexityBot" or
  http.user_agent contains "Google-Extended" or
  http.user_agent contains "Bard" or
  http.user_agent contains "Gemini" or
  http.user_agent contains "DeepSeekBot" or
  http.user_agent contains "DeepSeek-R1" or
  http.user_agent contains "GrokBot" or
  http.user_agent contains "xAI" or
  http.user_agent contains "BingBot" or
  http.user_agent contains "Amazonbot" or
  http.user_agent contains "DuckAssistBot" or
  http.user_agent contains "AI2Bot" or
  http.user_agent contains "CCBot" or
  http.user_agent contains "omgili" or
  http.user_agent contains "Diffbot" or
  http.user_agent contains "FacebookBot" or
  http.user_agent contains "Meta-ExternalAgent" or
  http.user_agent contains "YouBot" or
  http.user_agent contains "Applebot-Extended"
))