Measuring shadow AI

[Key takeaways]

A single audit produces a number. Measurement produces a trend, and only a trend tells you whether governance is working.
Behavioral detection infers unsanctioned AI from how endpoints, identities, and traffic behave, not from a static blocklist that new tools always outrun.
The four telemetry sources you already collect (egress, identity, browser, endpoint) each carry a measurable signal. Correlating them is what raises confidence above any one alone.
Track a small set of durable metrics: prevalence, sanctioned-to-unsanctioned ratio, data-exposure events, and mean time to discovery.
Industry surveys report roughly 80 percent of organizations see moderate to pervasive shadow AI while only about a quarter have real visibility. Measurement is how you close that gap for yourself.

An audit is a number, measurement is a trend line

Most shadow AI programs start and end with a one-time audit. Someone exports 90 days of logs, produces an inventory, and presents a figure: 47 AI tools in use, 12 sanctioned. That number is useful once, and then it decays. It cannot tell you whether the situation is improving, whether a new policy changed behavior, or whether exposure is concentrating in a few risky tools or spreading across many. For any of those questions you need the same signals sampled repeatedly and compared over time.

The distinction matters because the underlying population moves fast. Industry surveys in 2026 consistently report that a large majority of organizations, on the order of 80 percent, observe moderate to pervasive unsanctioned AI use, while only roughly a quarter claim comprehensive visibility into it. Those are aggregate figures and your environment will differ, but they frame the problem: the gap between what is happening and what security can see is the thing you are trying to measure and shrink.

This page is the measurement companion to our hands-on shadow AI discovery playbook. The playbook gets you a baseline inventory in a week. This piece is about what to do after that baseline exists: which behavioral signals to instrument, which metrics to compute from them, and how often to recompute so the number becomes a curve.

Behavioral detection over static blocklists

The naive way to find unsanctioned AI is a blocklist of known domains. It fails predictably: new models, wrappers, and self-hosted endpoints appear weekly, and a blocklist is always a step behind the tool your team adopted yesterday. Behavioral detection inverts the problem. Instead of asking "is this destination on my list of AI vendors," it asks "does this traffic, this OAuth grant, or this process behave like AI use?"

Behavioral signals that generalize beyond any single vendor include scoring outbound destinations by domain age, request frequency, and TLS characteristics rather than name; flagging processes that spawn unexpected outbound connections, for example an office application invoking a runtime to send data externally; and treating OAuth grants by the sensitivity of the scopes requested rather than the popularity of the app. A little-known "AI assistant" that requests read access to an entire mailbox is a finding on scope alone, regardless of whether anyone has heard of it.

The practical payoff is that behavioral detection degrades gracefully. When a brand-new tool appears, a blocklist reports nothing while a behavioral model still sees an unfamiliar destination receiving encrypted bursts of data from a finance workstation. You may not know the tool's name yet, but you have already measured its existence and its risk.

Four telemetry sources, four measurable slices

Every source below produces a countable, repeatable signal. The goal is not just to detect once but to sample each on a fixed cadence so the counts become time series. Confidence comes from the overlap: a tool visible in three sources is a stronger measurement than one visible in a single log.

1. Network and DNS egress

Your firewall, secure web gateway, or DNS resolver already records every outbound connection. The measurable signals are the count of distinct AI-associated destinations, the request volume per destination as a proxy for how load-bearing a tool is, and the rate of net-new destinations appearing per week. Rising unique-destination counts week over week are a direct measure of tool sprawl.

2. Identity and OAuth grants

Your IdP logs every SSO login and OAuth authorization. Measure the number of third-party apps holding AI-related or broad data scopes, the count of new grants per period, and the fraction of grants that were never reviewed. Because OAuth grants persist until revoked, this source is where silent, long-lived exposure accumulates, and where a stale-grant count trends up unless someone actively prunes it.

3. Browser and extension telemetry

Most AI use is a browser tab, and network logs describe it poorly because it all rides HTTPS to a few CDNs. Managed-browser and extension inventories make it countable: number of AI sidebars and chat tabs in use, count of "summarize this page" extensions that ship content to a model, and unique users touching each. This slice usually reveals the widest, most casual adoption.

4. Endpoint and developer telemetry

Your EDR and MDM inventory lists installed desktop AI apps, and developer machines carry the newest layer: coding-agent config, IDE extensions, and MCP server definitions in project and user config files. Measure installed AI-app counts, the number of configured MCP servers, and how many of those servers grant broad local or network access. MCP is the fastest-growing and least-visible surface, so its trend line deserves its own attention.

What to actually count, and how to compute it

A measurement program needs a small, stable set of metrics that survive tool churn and stay comparable quarter to quarter. Resist the temptation to track everything. Four families carry most of the signal, and each derives directly from the telemetry sources above.

Prevalence is the share of the organization touching AI at all: distinct users seen across all four sources divided by headcount, recomputed each period. It answers "how normal is this," and it almost always climbs, which is fine. The point is to know the slope.

Sanctioned-to-unsanctioned ratio is the governance number: of the tools and grants you measured, what fraction are on an approved list versus tolerated or unknown. This is the metric a policy is supposed to move. If prevalence rises but the ratio holds or improves, governance is keeping pace with adoption. If the ratio falls, adoption is outrunning you.

Data-exposure events count the moments sensitive data plausibly left the perimeter to an AI endpoint: prompts carrying secrets or PII, uploads to unsanctioned tools, or broad-scope grants that could read sensitive stores. Industry reporting attributes a disproportionate share of exposure to a handful of tools and to categories like source code and legal data, so weight this metric by data sensitivity, not raw event count.

Mean time to discovery is the lag between a tool first appearing in any telemetry and security first recording it in the inventory. With quarterly log exports this can be 90 days by construction. Inline visibility drives it toward zero. Falling mean time to discovery is the clearest single proof that your program is getting faster than the problem, and it pairs naturally with a tool-churn rate: new tools appearing and old ones going dormant per period.

Prevalence

Distinct AI users across all four telemetry sources divided by headcount, recomputed each period. Measures how widespread AI use is and, more importantly, its slope over time.

Sanctioned-to-unsanctioned ratio

The fraction of discovered tools and OAuth grants that are approved versus tolerated or unknown. The governance metric a policy is meant to move, tracked as a trend.

Data-exposure events

Occurrences where sensitive data plausibly reached an AI endpoint (secrets or PII in prompts, uploads, broad-scope grants). Weight by data sensitivity, not raw count.

Mean time to discovery and churn

Lag between a tool first appearing in telemetry and its entry in the inventory, plus the rate of new and dormant tools per period. Falling MTTD proves the program is outpacing sprawl.

From a snapshot to a standing measurement

Metrics only become a trend if you compute them on a fixed schedule with a consistent method. Set a cadence and hold it: recompute the four metric families weekly or monthly from the same queries, against the same headcount and sanctioned-list definitions, so period-over-period comparisons are honest. Change the method rarely, and annotate the trend line when you do.

The limiting factor is data freshness. If your only inputs are quarterly log exports, your mean time to discovery cannot beat the export interval, and your prevalence curve is always looking backward. The durable version of measurement is inline: a transparent proxy that observes AI traffic across the browser, desktop, IDE, CLI, and API as it happens, so a net-new tool registers in the count the day it appears rather than the next time someone runs an export. That is the gap Cerbera was built to close, with detection running locally on the endpoint and the same signals feeding both the live inventory and the metrics trend.

Whichever way you implement it, the discipline is the same: pick the small set of metrics that matter, compute them the same way every period, and report the direction rather than the point. A CISO does not need to know you found 47 tools. They need to know the number is trending down while prevalence trends up, which is exactly what a governance program that works looks like on a chart.