Integrations

Cloudflare Worker — AI Traffic for SearchMention

This guide walks you through connecting your storefront (on Cloudflare) to SearchMention AI Traffic, so you can see AI crawlers and human visits referred by AI apps in your dashboard.

What you need

Requirement Details
Cloudflare Your shop’s domain uses Cloudflare (DNS proxied / orange cloud).
SearchMention A project whose domain matches the site this worker runs on.
API key sm_live_… from Dashboard → Settings (same key type as other server-side integrations; plans follow SearchMention billing — see product docs).
Tools Node.js (for Wrangler CLI), or use the Cloudflare dashboard to create routes after deploy.

The worker never blocks, slows, or changes what shoppers see. It only sends small background requests to SearchMention when it spots AI-related traffic.


How it works

  1. A request hits your site through Cloudflare.
  2. The worker inspects User-Agent (AI bots) and, if needed, Referer (clicks from ChatGPT, Perplexity, etc.).
  3. If nothing matches, the request continues with no extra work.
  4. If it matches, your origin still responds as usual; the worker asynchronously POSTs a summary to SearchMention’s API.
  5. Data appears under Dashboard → AI Traffic for that project.

Before you deploy (checklist)

  • Domain is on Cloudflare and traffic goes through the Worker route you will add.
  • SearchMention project domain equals your store hostname (e.g. www.yourstore.com vs yourstore.com — be consistent with how you track the project).
  • You have an API key (sm_live_…) from Settings on that project.
  • You created a folder on your computer for this worker (e.g. searchmention-ai-tracker/).

Step 1 — Create wrangler.toml

In your project folder, create wrangler.toml. You can copy this template and adjust the worker name if you like:

name = "searchmention-ai-tracker"
main = "worker.js"
compatibility_date = "2024-01-01"

# Optional: define routes here instead of in the dashboard (uncomment and edit):
# routes = [
#   { pattern = "www.yourstore.com/*", zone_name = "yourstore.com" }
# ]

Notes

  • name is the Worker name in Cloudflare.
  • main must point to worker.js (the script below / in this repo).
  • Routes can be set in this file or in Workers & Pages → your worker → Triggers (see Step 6).

Step 2 — Add worker.js

Place worker.js in the same folder as wrangler.toml.

  • On the SearchMention docs site: use the “Worker script” code block at the bottom of the integrations page — copy the full file.
  • From the Git repository: use cloudflare-worker/worker.js.

Do not edit the API URL unless you are self-hosting (see Self-hosted SearchMention).


Step 3 — Install Wrangler and log in

npm install -g wrangler
wrangler login

A browser window opens so you can authorize the CLI to your Cloudflare account.


Step 4 — Set your SearchMention API key (secret)

From the folder that contains wrangler.toml and worker.js:

wrangler secret put SEARCHMENTION_API_KEY

Paste your key when prompted (it starts with sm_live_). This value is not stored in wrangler.toml and is encrypted by Cloudflare.


Step 5 — Deploy the Worker

wrangler deploy

Wrangler prints your Worker URL and confirms the deployment. If something fails, check that you are in the correct directory and logged in (wrangler whoami).


Step 6 — Route traffic through the Worker

The Worker must run for requests to your storefront. Two common options:

Option A — Route in Cloudflare dashboard

  1. Open Cloudflare dashboard → select your zone (domain).
  2. Go to Workers Routes (or Workers & Pages → your worker → Triggers / Routes, depending on UI).
  3. Add a route that covers your site, for example:
    • yourstore.com/*
    • www.yourstore.com/*
      Add both if you use both hostnames.
  4. Save. Propagation is usually quick (often under a minute).

Option B — Route in wrangler.toml

Uncomment and set the routes block with your real domain and zone, then run wrangler deploy again.

Route setting — Fail open (required for storefronts)

After your route exists, open its settings in the Cloudflare dashboard (route details / edit route — exact labels depend on the current UI).

Find the failure / limit behavior and choose:

Setting What it does
Fail open (proceed) If the Worker cannot run (errors, limits, etc.), additional requests bypass the Worker and go straight to your origin. Shoppers still get your normal site.
Fail closed (block) Requests may not reach your origin when the Worker fails — bad for a live store.

Turn on Fail open (proceed). SearchMention only needs a best-effort beacon; your checkout and catalog must keep working even if the Worker hiccups. The wording in the dashboard is often along the lines of: “Additional requests will bypass your Worker and proceed to your origin.”

If Cloudflare ever recommends “fail closed” for security-sensitive Workers, that does not apply here — this Worker does not enforce auth or block traffic; it only reports AI-related visits in the background.


Step 7 — Confirm in SearchMention

  1. Open Dashboard → AI Traffic for the project whose API key you used.
  2. Generate a bit of test traffic if needed (e.g. visit your site with a normal browser — non-AI traffic won’t show; AI bot or referral traffic will after it occurs).
  3. If nothing appears, see Troubleshooting below.

Self-hosted SearchMention

If you run your own app URL (not searchmention.com), set the ingest URL as a secret:

wrangler secret put SEARCHMENTION_ENDPOINT

Enter the full URL to the visits endpoint, e.g. https://your-domain.com/api/v1/visits. The default in the worker script points at the hosted SearchMention API.


What gets detected (reference)

SearchMention stores visits when the User-Agent matches a known AI bot, the Referer matches a known AI app host, or utm_source on the URL matches an allowlisted hint (when Referer is stripped). The lists evolve; config/ai-bots.php is canonical in production.

AI bots (examples)

Name Company
GPTBot, ChatGPT-User, OAI-SearchBot OpenAI
ClaudeBot, anthropic-ai Anthropic
PerplexityBot Perplexity
Google-Extended, Googlebot Google
Bytespider ByteDance
CCBot Common Crawl
Applebot-Extended Apple
Amazonbot Amazon
Meta-ExternalAgent Meta
cohere-ai Cohere

AI referrals (human clicks from)

Examples include ChatGPT, Perplexity, Gemini, Claude, Copilot, Grok, Meta AI — matched by referrer host (and optional utm_source hints when Referer is stripped). The worker and config/ai-bots.php stay aligned.


Troubleshooting

Problem What to check
No data in AI Traffic Worker route covers the hostname shoppers use; API key is for the same SearchMention project as that store domain.
401 / invalid key Regenerate key in Settings (paid plan), update secret: wrangler secret put SEARCHMENTION_API_KEY.
Worker not running Route pattern matches zone; DNS proxied through Cloudflare; no conflicting Worker higher in the route list.
Only some visits missing Traffic must match bot/referrer rules; normal visitors are not sent.
Free plan API keys for AI Traffic require Starter or Growth.
UTM in URL but nothing in dashboard Referrer-based attribution needs a matching HTTP Referer when possible. If Referer is missing, the worker/API only count utm_source values that match config/ai-bots.phputm_sources (e.g. chatgpt.com, google_ai_mode). Arbitrary campaign tags are ignored.

Debug logging (SEARCHMENTION_DEBUG)

  1. In wrangler.toml, under [vars], set SEARCHMENTION_DEBUG = "1" (or add the same variable in Workers → your worker → Settings → Variables).
  2. Redeploy: wrangler deploy.
  3. Tail logs locally: wrangler tail (or open Workers → Logs in the dashboard).

You will see one line per request with bot/referrer detection, and when a beacon is sent, the HTTP status and a short snippet of the API response (useful for spotting 401 or 422).

Quick test with curl

Simulate a referral from ChatGPT (replace the hostname with your store):

curl -sS -o /dev/null -w "%{http_code}\n" \
  -H "Referer: https://chatgpt.com/" \
  "https://waykanstore.com/"

Then check Dashboard → AI Traffic (and wrangler tail if debug is on). To simulate a bot hit:

curl -sS -o /dev/null -w "%{http_code}\n" \
  -A "Mozilla/5.0 (compatible; GPTBot/1.0; +https://openai.com/gptbot)" \
  "https://waykanstore.com/"

If these do not create rows, the Worker route may not cover that hostname, the API key may be wrong for the project, or the ingest URL may be misconfigured (SEARCHMENTION_ENDPOINT).


Quick command summary

npm install -g wrangler && wrangler login
wrangler secret put SEARCHMENTION_API_KEY
wrangler deploy

Then add routes in Cloudflare so yourstore.com/* (and www if needed) invoke this Worker.


More detail

  • HTTP API (for custom integrations): see SearchMention’s docs/ai-traffic-api.md in the app repository.

Worker script — worker.js

Save as worker.js next to wrangler.toml, then run wrangler deploy.

/**
 * SearchMention AI Traffic Tracker — CloudFlare Worker
 *
 * Detects three classes of AI traffic on ecommerce sites and reports
 * them to the SearchMention API:
 *
 *   1. bot_training     — Crawlers harvesting content for model training
 *                         (robots.txt respected, long-term visibility signal)
 *   2. ai_search_fetch  — User-triggered fetchers: an AI assistant is
 *                         loading this page RIGHT NOW to answer a user's
 *                         question (ChatGPT-User, Perplexity-User, etc.)
 *   3. human_referral   — A real human clicked a link in an AI interface
 *                         (chatgpt.com, gemini.google.com, etc.)
 *
 * Detection is prioritized by 2026 ecommerce traffic share:
 *   ChatGPT (~78% of AI referrals) > Gemini (~9%) > Perplexity (~7%)
 *   > Copilot (~3%) > Claude (~3%) > others.
 *
 * Never blocks or modifies the response to the visitor.
 */

/* ---------- Bot user-agent detection ---------- */

// Training crawlers: harvest content for model training. Respect robots.txt.
// Business meaning: long-term brand presence in future model weights.
const TRAINING_BOTS = [
  { name: "GPTBot", pattern: /GPTBot/i, vendor: "OpenAI" },
  { name: "ClaudeBot", pattern: /ClaudeBot/i, vendor: "Anthropic" },
  { name: "anthropic-ai", pattern: /anthropic-ai/i, vendor: "Anthropic" },
  { name: "Google-Extended", pattern: /Google-Extended/i, vendor: "Google" },
  { name: "Applebot-Extended", pattern: /Applebot-Extended/i, vendor: "Apple" },
  { name: "Meta-ExternalAgent", pattern: /Meta-ExternalAgent/i, vendor: "Meta" },
  { name: "Bytespider", pattern: /Bytespider/i, vendor: "ByteDance" },
  { name: "CCBot", pattern: /CCBot/i, vendor: "CommonCrawl" },
  { name: "Amazonbot", pattern: /Amazonbot/i, vendor: "Amazon" },
  { name: "cohere-ai", pattern: /cohere-ai/i, vendor: "Cohere" },
  { name: "DeepSeekBot", pattern: /DeepSeek(?!.*User)/i, vendor: "DeepSeek" },
];

// Search/retrieval crawlers and user-triggered fetchers.
// Business meaning: immediate visibility in AI answers. For ecommerce,
// these hits often correlate with "AI agent is shopping on behalf of a user".
const SEARCH_FETCH_BOTS = [
  // OpenAI
  { name: "ChatGPT-User", pattern: /ChatGPT-User/i, vendor: "OpenAI" },
  { name: "OAI-SearchBot", pattern: /OAI-SearchBot/i, vendor: "OpenAI" },
  // Anthropic
  { name: "Claude-User", pattern: /Claude-User/i, vendor: "Anthropic" },
  { name: "Claude-SearchBot", pattern: /Claude-SearchBot/i, vendor: "Anthropic" },
  // Google
  { name: "Google-CloudVertexBot", pattern: /Google-CloudVertexBot/i, vendor: "Google" },
  { name: "Google-NotebookLM", pattern: /Google-NotebookLM/i, vendor: "Google" },
  { name: "GoogleAgent-Mariner", pattern: /Google-Agent|GoogleAgent|Mariner/i, vendor: "Google" },
  // Perplexity
  { name: "PerplexityBot", pattern: /PerplexityBot/i, vendor: "Perplexity" },
  { name: "Perplexity-User", pattern: /Perplexity-User/i, vendor: "Perplexity" },
  // Meta
  { name: "Meta-ExternalFetcher", pattern: /Meta-ExternalFetcher/i, vendor: "Meta" },
  // Microsoft / Mistral / DuckDuckGo
  { name: "DuckAssistBot", pattern: /DuckAssistBot/i, vendor: "DuckDuckGo" },
  { name: "MistralAI-User", pattern: /MistralAI-User|Mistral-User/i, vendor: "Mistral" },
  { name: "DeepSeek-User", pattern: /DeepSeek-User/i, vendor: "DeepSeek" },
];

/* ---------- Human referral detection ---------- */

// Ordered by 2026 ecommerce referral share. ChatGPT first = fastest exit
// path for the majority of real AI traffic.
const AI_REFERRER_DOMAINS = [
  {
    name: "ChatGPT",
    // Covers chatgpt.com, chat.openai.com, and the Atlas browser's
    // in-chat origin (chatgpt.com/c/...)
    domains: ["chatgpt.com", "chat.openai.com", "chatgpt.openai.com"],
  },
  {
    name: "Gemini",
    // gemini.google.com is the chat surface. google.com AI Mode appears
    // with gemini or AI-specific query params but comes from google.com,
    // so we handle that via UTM fallback below rather than blanket-match
    // google.com (which would false-positive regular Google organic).
    domains: ["gemini.google.com"],
  },
  {
    name: "Perplexity",
    domains: ["perplexity.ai", "www.perplexity.ai"],
  },
  {
    name: "Copilot",
    domains: [
      "copilot.microsoft.com",
      "www.bing.com/chat",
      "bing.com/chat",
    ],
  },
  {
    name: "Claude",
    domains: ["claude.ai", "www.claude.ai", "claude.com"],
  },
  {
    name: "Meta-AI",
    domains: ["meta.ai", "www.meta.ai"],
  },
  {
    name: "Grok",
    domains: ["grok.com", "www.grok.com", "x.ai", "grok.x.ai"],
  },
  {
    name: "DeepSeek",
    domains: ["chat.deepseek.com", "deepseek.com"],
  },
  {
    name: "Mistral",
    domains: ["chat.mistral.ai"],
  },
];

// Build a hostname -> platform lookup once per worker isolate.
const REFERRER_HOST_INDEX = (() => {
  const index = new Map();
  for (const platform of AI_REFERRER_DOMAINS) {
    for (const domain of platform.domains) {
      // Strip any path fragment (e.g. "bing.com/chat") and key on host only.
      const host = domain.split("/")[0].toLowerCase();
      if (!index.has(host)) index.set(host, platform.name);
    }
  }
  return index;
})();

// UTM fallback: when a real human clicks an AI link, the referrer header
// is often stripped (mobile apps, in-app browsers, no-referrer policy).
// ChatGPT, Perplexity, and Gemini frequently append utm_source tags.
// Treat these as a weaker signal — separate visit_type so downstream can
// distinguish confirmed referrers from inferred ones.
const UTM_SOURCE_MAP = new Map([
  ["chatgpt.com", "ChatGPT"],
  ["chatgpt", "ChatGPT"],
  ["openai", "ChatGPT"],
  ["perplexity.ai", "Perplexity"],
  ["perplexity", "Perplexity"],
  ["gemini.google.com", "Gemini"],
  ["gemini", "Gemini"],
  ["google_ai_mode", "Gemini"],
  ["copilot.microsoft.com", "Copilot"],
  ["copilot", "Copilot"],
  ["claude.ai", "Claude"],
  ["claude", "Claude"],
  ["meta.ai", "Meta-AI"],
  ["grok", "Grok"],
  ["x.ai", "Grok"],
]);

/* ---------- Detection functions ---------- */

function detectBot(userAgent) {
  if (!userAgent) return null;
  for (const bot of SEARCH_FETCH_BOTS) {
    if (bot.pattern.test(userAgent)) {
      return { name: bot.name, vendor: bot.vendor, category: "ai_search_fetch" };
    }
  }
  for (const bot of TRAINING_BOTS) {
    if (bot.pattern.test(userAgent)) {
      return { name: bot.name, vendor: bot.vendor, category: "bot_training" };
    }
  }
  return null;
}

function detectAiReferrer(referer) {
  if (!referer) return null;
  try {
    const host = new URL(referer).hostname.toLowerCase();
    const platform = REFERRER_HOST_INDEX.get(host);
    return platform ? { platform, host } : null;
  } catch (_) {
    return null;
  }
}

function detectUtmAiSource(url) {
  try {
    const u = new URL(url);
    const source = (u.searchParams.get("utm_source") || "").toLowerCase().trim();
    if (!source) return null;
    const platform = UTM_SOURCE_MAP.get(source);
    return platform ? { platform, raw: source } : null;
  } catch (_) {
    return null;
  }
}

/* ---------- Privacy ---------- */

// Truncate IPv4 to /24 and IPv6 to /64. This is the standard approach
// for GDPR-compliant analytics — preserves geographic signal while
// removing user identifiability.
function anonymizeIp(ip) {
  if (!ip) return null;
  if (ip.includes(".")) {
    const parts = ip.split(".");
    if (parts.length === 4) return `${parts[0]}.${parts[1]}.${parts[2]}.0`;
    return null;
  }
  if (ip.includes(":")) {
    const parts = ip.split(":");
    // First 4 hextets = /64
    return parts.slice(0, 4).join(":") + "::";
  }
  return null;
}

/* ---------- Debug helpers ---------- */

function isDebugEnabled(env) {
  const v = env.SEARCHMENTION_DEBUG;
  return v === "1" || v === "true" || v === "yes";
}

function debugLog(env, message, detail) {
  if (!isDebugEnabled(env)) return;
  if (detail !== undefined) {
    console.log("[searchmention-ai-tracker]", message, detail);
  } else {
    console.log("[searchmention-ai-tracker]", message);
  }
}

/* ---------- Reporting ---------- */

async function reportVisit(env, request, response, detection) {
  const endpoint =
    env.SEARCHMENTION_ENDPOINT || "https://searchmention.com/api/v1/visits";
  const apiKey = env.SEARCHMENTION_API_KEY;
  if (!apiKey) {
    debugLog(env, "beacon skipped: SEARCHMENTION_API_KEY is not set");
    return;
  }

  // Optional sampling — useful when a client gets a viral spike and
  // you don't want to hammer the API. Value is 0..1, default 1 (report all).
  const sampleRate = parseFloat(env.SEARCHMENTION_SAMPLE_RATE || "1");
  if (sampleRate < 1 && Math.random() > sampleRate) {
    debugLog(env, "beacon skipped: sampled out", { sampleRate });
    return;
  }

  const userAgent = request.headers.get("user-agent") || "";
  const rawIp = request.headers.get("cf-connecting-ip") || null;

  const cf = request.cf || {};
  const payload = {
    visits: [
      {
        url: request.url,
        user_agent: userAgent,
        visit_type: detection.visit_type,
        platform: detection.platform || null,
        bot_name: detection.bot_name || null,
        vendor: detection.vendor || null,
        referrer: detection.referrer || null,
        referrer_host: detection.referrer_host || null,
        method: request.method,
        status_code: response.status,
        ip_address: anonymizeIp(rawIp),
        country: cf.country || null,
        city: cf.city || null,
        visited_at: new Date().toISOString(),
        source: "cloudflare",
      },
    ],
  };

  // Abort the beacon if the API is slow — don't eat worker CPU on spikes.
  const controller = new AbortController();
  const timeout = setTimeout(() => controller.abort(), 3000);

  try {
    const res = await fetch(endpoint, {
      method: "POST",
      headers: {
        "Content-Type": "application/json",
        Authorization: `Bearer ${apiKey}`,
      },
      body: JSON.stringify(payload),
      signal: controller.signal,
    });
    if (isDebugEnabled(env)) {
      const bodyPreview = await res.text();
      debugLog(env, "beacon response", {
        status: res.status,
        visitType: detection.visit_type,
        platform: detection.platform,
        body: bodyPreview.slice(0, 300),
      });
    }
  } catch (err) {
    debugLog(env, "beacon fetch failed", String(err && err.message ? err.message : err));
  } finally {
    clearTimeout(timeout);
  }
}

/* ---------- Main handler ---------- */

export default {
  async fetch(request, env, ctx) {
    // Kick off the origin fetch immediately — detection runs in parallel
    // so we don't add latency to the visitor's response.
    const responsePromise = fetch(request);

    const userAgent = request.headers.get("user-agent") || "";
    const referer = request.headers.get("referer") || "";

    const bot = detectBot(userAgent);
    const aiReferrer = !bot ? detectAiReferrer(referer) : null;
    const utmReferrer = !bot && !aiReferrer ? detectUtmAiSource(request.url) : null;

    let detection = null;
    if (bot) {
      detection = {
        visit_type: bot.category, // "ai_search_fetch" | "bot_training"
        platform: bot.vendor,
        bot_name: bot.name,
        vendor: bot.vendor,
      };
    } else if (aiReferrer) {
      detection = {
        visit_type: "human_referral",
        platform: aiReferrer.platform,
        referrer: referer,
        referrer_host: aiReferrer.host,
      };
    } else if (utmReferrer) {
      detection = {
        visit_type: "human_referral_utm",
        platform: utmReferrer.platform,
        referrer: null,
        referrer_host: null,
      };
    }

    debugLog(env, "request", {
      method: request.method,
      url: request.url,
      detection,
      userAgent: userAgent.slice(0, 200),
    });

    const response = await responsePromise;

    if (detection) {
      ctx.waitUntil(reportVisit(env, request, response, detection));
    }

    return response;
  },
};