Major AI Crawler User Agent Headers

Major AI Crawler User Agent Headers

Company Bot Name User Agent String Purpose
OpenAI GPTBot Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; GPTBot/1.1; +https://openai.com/gptbot) Used by OpenAI to train and refine generative AI models
OpenAI ChatGPT-User Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot) Indexes online content to advance ChatGPT’s research and retrieval
OpenAI OAI-SearchBot Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; OAI-SearchBot/1.0; +https://openai.com/searchbot) Used to create an index of websites that can be surfaced as results in OpenAI’s SearchGPT product
Anthropic anthropic-ai Mozilla/5.0 (compatible; anthropic-ai/1.0; +http://www.anthropic.com/bot.html) Collects information for Anthropic’s AI development
Anthropic ClaudeBot Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ClaudeBot/1.0; +claudebot@anthropic.com) A web crawler operated by Anthropic to download training data for its LLMs (Large Language Models) that power AI products like Claude
Anthropic Claude-Web Mozilla/5.0 (compatible; claude-web/1.0; +http://www.anthropic.com/bot.html) Acquires site data to refine Anthropic’s web-focused models
Google Google-Extended Mozilla/5.0 (compatible; Google-Extended/1.0; +http://www.google.com/bot.html) Used to help improve Bard (now Gemini) and Vertex AI generative APIs, including future generations of models
Google GoogleOther GoogleOther Used by Google for internal research and development
Apple Applebot Mozilla/5.0 (compatible; Applebot/1.0; +http://www.apple.com/bot.html) Crawls webpages to improve results for Siri and Spotlight
Apple Applebot-Extended Mozilla/5.0 (compatible; Applebot-Extended/1.0; +http://www.apple.com/bot.html) Determines how to use data crawled by Applebot for Apple’s foundation models
Microsoft BingBot Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/W.X.Y.Z Safari/537.36 Microsoft’s web crawler for Bing search engine
Meta FacebookBot Mozilla/5.0 (compatible; FacebookBot/1.0; +http://www.facebook.com/bot.html) Fetches content for Facebook and other Meta services
Meta Meta-ExternalAgent Mozilla/5.0 (compatible; meta-externalagent/1.1; +https://developers.facebook.com/docs/sharing/webmasters/crawler) Crawls the web for use cases such as training AI models or improving products by indexing content directly
ByteDance Bytespider Mozilla/5.0 (compatible; Bytespider/1.0; +http://www.bytedance.com/bot.html) Surveys webpages to support TikTok’s content discovery
Cohere cohere-ai Mozilla/5.0 (compatible; cohere-ai/1.0; +http://www.cohere.ai/bot.html) Collects text samples to refine Cohere’s language models
Perplexity PerplexityBot Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot) Perplexity’s crawler, designed to help the platform build and maintain its own index
Mistral AI MistralAI-User Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; MistralAI-User/1.0; +https://docs.mistral.ai/robots) Used by Mistral to fetch citations for Le Chat. It doesn’t crawl the web automatically nor collect training data
Common Crawl CCBot Mozilla/5.0 (compatible; CCBot/1.0; +http://www.commoncrawl.org/bot.html) Non-profit organization CommonCrawl’s user agent devoted to cataloging the Internet
Diffbot Diffbot Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 (.NET CLR 3.5.30729; Diffbot/0.1; +http://www.diffbot.com) Scrapes webpages to produce structured data for AI systems
DuckDuckGo DuckAssistBot Mozilla/5.0 (compatible; DuckAssistBot/1.0; +http://www.duckduckgo.com/bot.html) Collects data to deliver AI-backed answers on DuckDuckGo

This table includes the major AI crawlers from companies like OpenAI, Anthropic, Google, Apple, Microsoft, Meta, and others. Each entry includes the company name, bot name, full user agent string, and the primary purpose of the crawler.

How to Allow Major AI Crawlers in Cloudflare WAF

Method 1: Using Verified Bot Categories in WAF Rules

  1. Log in to your Cloudflare Dashboard

  2. Navigate to Security > WAF > Custom rules

  3. Create a new custom rule:

    • Click “Create rule”
    • Give your rule a descriptive name like “Allow AI Crawlers”
  4. Configure the expression using the following:

    (cf.client.bot and cf.verified_bot_category eq "AI Crawler")
    
  5. Set the action to “Allow” - This will allow AI crawlers to access your website by bypassing other security rules

This functionality is available to all Cloudflare customers, including those on free plans. Cloudflare has added specific categorization for bots to give site owners better control over what type of bots can access their content.

Method 2: Specific AI Crawler Setup (More Detailed Control)

If you want more granular control over which specific AI crawlers you allow, you can create a more detailed rule:

  1. Create a new custom rule as described above

  2. Use a more specific expression that targets individual bot user agents:

    (cf.client.bot and (
      http.user_agent contains "GPTBot" or
      http.user_agent contains "ClaudeBot" or
      http.user_agent contains "OAI-SearchBot" or
      http.user_agent contains "anthropic-ai" or
      http.user_agent contains "MistralAI-User" or
      http.user_agent contains "Bytespider" or
      http.user_agent contains "cohere-ai" or
      http.user_agent contains "PerplexityBot"
    ))
    
  3. Set the action to “Allow”

Remember that this rule will only work if your site is properly proxied through Cloudflare, not just using Cloudflare for DNS.

Important Considerations

  • Cloudflare now provides detailed analytics about AI service crawling activity on your site. You can review which AI services are accessing your content and what content they’re accessing most frequently.

  • Different AI crawlers have different purposes - some like GPTBot and ClaudeBot collect training data, while others like OAI-SearchBot are used for search features that do provide attribution and links back to your site.

  • The rule expression uses Cloudflare’s bot detection capabilities to accurately identify legitimate AI crawlers versus fake ones trying to spoof user agent strings.

  • Make sure your site is configured to be proxied through Cloudflare (orange cloud icon in DNS settings), as WAF rules only work on proxied traffic.

This configuration will allow AI crawlers to access and index your content, which can help with visibility in AI-powered search tools and assistants. Would you like me to explain any specific part of this process in more detail?

Cloudflare WAF rule allow major AI bot

(cf.client.bot and (
  http.user_agent contains "GPTBot" or
  http.user_agent contains "ChatGPT-User" or
  http.user_agent contains "OAI-SearchBot" or
  http.user_agent contains "ClaudeBot" or
  http.user_agent contains "anthropic-ai" or
  http.user_agent contains "claude-web" or
  http.user_agent contains "MistralAI-User" or
  http.user_agent contains "Bytespider" or
  http.user_agent contains "cohere-ai" or
  http.user_agent contains "PerplexityBot" or
  http.user_agent contains "Google-Extended" or
  http.user_agent contains "Bard" or
  http.user_agent contains "Gemini" or
  http.user_agent contains "DeepSeekBot" or
  http.user_agent contains "DeepSeek-R1" or
  http.user_agent contains "GrokBot" or
  http.user_agent contains "xAI" or
  http.user_agent contains "BingBot" or
  http.user_agent contains "Amazonbot" or
  http.user_agent contains "DuckAssistBot" or
  http.user_agent contains "AI2Bot" or
  http.user_agent contains "CCBot" or
  http.user_agent contains "omgili" or
  http.user_agent contains "Diffbot" or
  http.user_agent contains "FacebookBot" or
  http.user_agent contains "Meta-ExternalAgent" or
  http.user_agent contains "YouBot" or
  http.user_agent contains "Applebot-Extended"
))