What I Took to the Enterprise AI Summit

AItokensenterpriseconference

What I Took to the Enterprise AI Summit

The room

Gene Kim's 2026 Enterprise AI Summit was held in an exposed-brick former pool hall on downtown San Jose's nightlife strip. The energy of the venue matched the energy of the field. The hall had pivoted from billiards to board games and LAN parties; the bar and leather couches stayed almost untouched. Enterprise IT is in the middle of its own pivot, moving from tech enablement in general to AI enablement in specific.

I'd had a shot at getting in free with a relevant talk and wanted a topic I could speak to from my Adobe work without crossing into anything that would invite scrutiny. I'd just spent two months with Brian Scott onboarding vendors, watching the 10x AI adopters, and noticing the token burn they generated. I'd also been chewing on an interview with Boris from Anthropic, who admitted he'd had to embrace exponential thinking to stay ahead of scaling and adoption. By April, I was convinced we wouldn't have years to ease into cost efficiency. The bills were going to turn into a financial black hole if left unchecked. I could also present the topic without putting Adobe on a single slide.

Looking at the rest of the program, it was a Fortune 500 hitlist with a smattering of startups and a few government agencies, all talking about how to accelerate AI adoption.

The scarcity that shapes systems

Software engineering has always been shaped by whatever resource is scarce. Mainframes made it memory. Client-server made it network bandwidth. Cloud made it API calls and egress. In each transition, engineers who understood the new constraint built systems that scaled; engineers who ignored it built systems that bankrupted their sponsors.

Tokens are the constraint in AI-native computing. Every LLM interaction is metered in tokens: prompts, completions, reasoning steps, tool calls. Tokens determine cost, latency, what fits in a context window, and what gets truncated. They determine whether a workload runs on a small efficient model or requires a frontier model at fifty times the cost.

Most enterprises treat tokens the way early web developers treated database queries: as an implementation detail someone else will worry about later. In a world of exponential growth, later is measured in days.

The Jevons paradox of AI

Per-token costs have fallen roughly 280x in two years. Inference market spend hit $106 billion in 2025 and is projected to reach $255 billion by 2030. Enterprises are celebrating more miles per gallon while burning fifty times more gallons.

The directional pressure is clear. Organizations that treat tokens as undifferentiated commodity will end up like the early-cloud organizations: surprised by the bill, unable to trace it back to business value.

Borrowing Big-O for tokens

If we share a goal of token efficiency and a framework for evaluating it, we can learn from each other's experiments without re-running them. That's the value Big-O gives programmers. Big-O doesn't tell you the exact runtime of an algorithm; it tells you how runtime grows as input scales. An O(n log n) sort categorizes behavior. The category is what lets engineers make decisions without benchmarking every input.

Big-T applies the same principle to token consumption. It categorizes workloads by how token consumption grows as usage scales, complexity increases, or autonomy expands.

The notation is T(n · k · a):

  • n is the number of requests
  • k is the model calls per request
  • a is the agent depth

When an orchestrator invokes sub-agents that invoke their own sub-agents, you're at T(n · k · a). That's the O(n²) of AI.

The complexity classes

Walking the taxonomy from cheapest to scariest:

T(1) — constant. A cached response, or any operation where the model isn't invoked per request. A bash script or a cache hit lives here.

T(log n) — sublinear. Where most of the quick workload-specific wins are hiding. Any time deterministic code (SQL, regex, embeddings) can reduce the input set before the model sees it, you're here. The tokens that would have been processed are never processed at all.

T(n) — linear. One model call per request, proportional to input size. Recent measurements on Claude models suggest request size stays roughly static across long sessions. Anything that knocks the request size down while producing the same answer is a durable investment.

T(n · k) — multiplicative. k model calls per request: RAG chains, multi-turn conversations, reasoning tokens. The k is often invisible. A reasoning model can consume 128,000 tokens internally on a single request and produce a 500-token visible response. If your costs are climbing while the user-visible output keeps shrinking, you need observability to find the multiplier.

T(n · k · a) — agent-multiplicative. Where the cost-versus-value debate for agentic AI is going to play out. If you haven't read Antonio Gulli's Agentic Design Patterns: A Hands-On Guide to Building Intelligent Systems, put it on the list. The layered sub-agent designs in there are clearly valuable, and clearly expensive. High-value use cases right now aren't sustainable at current token prices. What becomes affordable if we drive efficiency down a class is the question worth chasing.

T(∞) — unbounded. Autonomous loops with no termination. Retry logic that re-prompts on every failure. I'm seeing this in production today: a sub-agent's context window gets overrun, never returns an answer, the orchestrator has no business logic to fall back on. What does an LLM orchestrator do when it gets an unexpected response and no recovery path? Probably what you're doing now. Flail around for whatever makes the most sense and move on.

Two worked examples

This is where Big-T diverges from Big-O. The expensive part of agentic work usually isn't the algorithm. It's the interface choosing ten agent calls when one shell pipeline would do.

A concrete example: 100 GB of log files on a filesystem. You ask the LLM, "Find the most common errors." The path of least resistance for the model is to enumerate log files, open each one, look for the word ERROR, sort, count, summarize. Six agent calls, minimum. Along the way the model is also negotiating with itself about what counts as a log file (*.log? *.txt? *.out?), what counts as an error, and how to fold near-duplicates with different timestamps and server prefixes.

A different interface, same task:

find . -name "*.log" | grep -i ERROR | sort | uniq -c | sort -rn | head

The agent's job collapses to summarizing the output. Iteration over 100 GB leaves the inference pipeline entirely.

Lean further into the LLM side and you arrive at Cloudflare's Code Mode, where MCP tools expose typed APIs and the LLM is instructed to generate code that calls them. Instead of making 31 individual tool calls to create 31 calendar events, the agent writes one TypeScript function that loops over the dates and constructs the payloads. The function runs in a sandboxed runtime. One generation step replaces 31 agent turns.

Cloudflare measured a 32% token reduction on simple single-operation tasks and an 81% reduction on the 31-event batch. The work was identical. The token cost wasn't.

The Big-T reading: tool-calling for a batch of b operations is T(n · b). Each operation is a full agent turn with context replay. Code generation for the same batch collapses to T(n). The b operations still happen. They happen in compute runtime, outside the token economy.

The five levers

Five places to push, from cheapest move to most strategic.

Model routing. Workloads have a minimum quality bar; routing finds the cheapest model that clears it. Same pattern as database read replicas: send traffic to the cheapest resource that can serve it correctly. Not every task needs a frontier model.

Prompt engineering. Preprocessing documents to extract only the conceptually relevant sections has delivered 70% token reduction without quality loss. Chain-of-density prompting compresses summaries to a third the length with higher information density. Pattern-based compression on repeated structures: 50–70% on suitable workloads.

Data serialization. For data-heavy workloads, the serialization choice can be the single largest cost driver. Output tokens cost 3–8x more than input tokens (median ~4x), so constraining output format pays back the most per change. Deterministic schema elements don't benefit from LLM generation. Hand them to the decoder.

Caching. Prompt caching can cut input token costs by 90% for stable system prompts. Semantic caching can eliminate model calls outright for high-frequency, low-variance queries. If you've run a CDN or a query cache, the mental model is unchanged: any T(n) workload is a candidate for T(1) on the common case, with T(n) as fallback for misses.

Abstraction transparency. Credit-based pricing, seat licenses, and "unlimited AI" bundles obscure the relationship between usage and cost. The abstraction isn't always malicious; it's often just a vendor choice. But it blocks optimization. You cannot improve what you cannot measure. Token-aware organizations treat AI billing like cloud billing: per-call metering, cost attribution by team and project, anomaly detection on consumption spikes, dashboards that make unit economics visible. If your tooling doesn't expose token-level telemetry, you're paying for the privilege of not knowing what you're paying for.

Workload classification and governance. Make sure the workloads burning the most tokens are the ones producing the most value. A developer using an agent to ship a high-priority feature consumes tokens against a strategic objective. A meeting-summary bot running on every 30-minute standup consumes tokens against a convenience objective. Both show up as "AI usage." Only one has a defensible ROI. The 1% of users consuming 12% of tokens aren't the problem. They're likely your most strategically productive AI users. The problem is not knowing whether that's true, and having no mechanism to find out.

Beyond the levers

An organization that achieves T(log n) and then runs its agents against a self-hosted Jira with a 100-request-per-minute rate limit hasn't solved scaling. It moved the bottleneck. Multiply that across Confluence, M365 Graph API, Salesforce, ServiceNow, and every other system of record. These systems were architected for human-speed access. Agentic access patterns look like batch jobs running at API speed, but without batch-job predictability. The agent decides at runtime which APIs to call, in what order, and how many times. That's a different problem with its own solution space.

Ask, Assist, Automate

The other useful thing from the Summit wasn't on my slides. OpenAI shared findings from their enterprise onboarding and observed that organizations progress through three phases.

Ask. Knowledge-base chatbots and summary use cases. The early days. The model answers questions.

Assist. High-toil regular work: status reports, emails, presentations, draft responses. The model takes a turn with the human still in the loop.

Automate. High-value workstreams where humans are no longer the bottleneck. Provisioning workflows, contract reviews, SDLC tasks. Engineers were never slow to adopt new tech. Legal, finance, and facilities are now under the same pressure, and the pattern there requires intent, because the muscle to automate didn't exist before.

The reason this stuck with me: it's the organizational version of what Big-T is for architecture. Both progress in stages. Both need to advance in step. The orgs that get this right are choosing what's worth automating, then engineering it to a token cost their business can actually support.

The question to take back

Before any AI workload ships: what is the token complexity class of this workload, and is that complexity justified by the value it produces?

This week: run an abstraction audit on your top AI vendors. Can you measure token consumption by workload? If not, getting to observable is the prerequisite for everything else. In parallel, identify your T(n · k) and T(n · k · a) systems and ask whether deterministic code can reduce the prompt. It's not a failure to use AI to bootstrap the work. This is high-value work and deserves the help.

This quarter: stop running everything on the frontier. Get comfortable with the mid-tier and small models that clear your quality floor at a fraction of the cost. Add a caching layer for your high-repetition workloads. Watch your prompt-cache hit rate climb.

Closing

Big-T Notation is a starting point. A shared language for reasoning about token complexity before it becomes a financial crisis. The discipline it asks for is the right question at the right time.

Efficiency is what makes ambition sustainable. The 10x organizations won't use fewer AI tokens. They'll use them better.