Token pricing drops. Your bill keeps rising anyway.


Token pricing drops. Your bill keeps rising anyway.

Introduction

This newsletter helps CFOs think better about AI decisions. Each edition tests frameworks from the CFO AI Playbook against real evidence and delivers one concrete action.

This edition deals with the total cost of ownership for AI. What happened at Microsoft and Uber laid bare some unexpected movements in pricing impacting costs and usage of agentic AI.

Most finance leaders know AI is getting cheaper per token. Fewer know their AI bill is growing anyway. Almost none can explain why.

The Playbook publishes Q2 2026. Your feedback shapes the final work.

The paradox holding

Per-token prices for frontier AI models have dropped roughly 80% in the last twelve months. By every headline metric, AI got cheaper.

Yet enterprise AI bills are growing. The reason is structural: agentic workflows don't consume tokens the way chatbots do. A single agentic task can consume between 5 and 30 times the tokens of a standard generative query. Some enterprise workflows chain 10 to 20 model calls before returning an answer. Research from Stanford and the University of Michigan confirms this — agentic coding tasks consume on average 3,500 times more tokens than single-turn reasoning tasks, with input tokens rather than output tokens driving the cost. The same context gets fed back into the model at every step of the reasoning loop.

The FinOps Foundation calls this Context Window Creep — the single greatest hidden cost in most production AI applications. Most LLM APIs are stateless: the model has no memory between interactions. Every multi-turn conversation requires the full history to be re-sent at each step. In agentic workflows with dozens of reasoning steps, this compounds rapidly. The advertised per-token price, as the FinOps Foundation puts it, is the tip of a complex iceberg.

The unit got cheaper. The unit count exploded. Most organisations weren't measuring the unit count.


Three challenges

First: token consumption is stochastic. The same task, run twice, can differ by 30 times in token consumption depending on what the agent encounters during execution. This isn't a rounding error — it means variance is structural, not accidental. Traditional budgeting assumes predictable cost-output relationships. AI consumption doesn't have one.

Second: more tokens don't produce better outcomes. This is the finding that should change how finance functions think about AI spend. Accuracy peaks at intermediate token budgets and degrades at the extremes. Excess consumption reflects unproductive exploration — the agent continuing to reason past the point of marginal return. An unconstrained agent is therefore not just expensive; it is often producing worse outputs at the tail. A token budget isn't a constraint on quality. In many cases, it is a quality control mechanism.

Third: agents don't stop running when they stop being useful. Decommissioned agents continue consuming infrastructure, polling endpoints, and triggering API calls — silently, without attribution, and well past the point when anyone considers them active. Cloud-native ERP platforms with agent lifecycle governance are designed to address this. Most organisations are not using those tools deliberately enough. The result is a category of spend that sits in nobody's budget and surfaces only on the invoice.


What it looks like when it lands

Two cases reported this week make the mechanism concrete.

Microsoft launched an internal Claude Code pilot in December 2025, giving thousands of developers in its Experiences and Devices division access to the tool. The pilot was a success by every usage measure. It was also a budget failure. Under token-based billing, the division burned through its entire annual AI budget within months. Microsoft has set a June 30 cancellation deadline and is redirecting affected developers to GitHub Copilot, which it owns and accounts for internally at near-zero marginal cost. The financial nature of the spend changed; the output didn't.

Uber's case is starker. After deploying Claude Code to 5,000 engineers, the company burned through its entire 2026 AI budget in four months. Per-engineer API costs ran between $500 and $2,000 monthly. The budget collapse happened because adoption spread organically — faster than any forecasting model could track — and nobody had instrumentation to see the aggregate until it was gone.

Neither company had a tool failure. Both had a governance failure: no visibility into consumption as it accumulated, no controls to slow it before the budget ceiling arrived, and no framework to forecast what agentic usage at scale would actually cost. The spend was predictable in structure. It was invisible in practice.


Where the exposure is building

The repricing is no longer theoretical. This week, GitHub moved all Copilot plans to usage-based billing tied directly to token consumption, effective June 1. The seat price is unchanged. What changes is that agentic workflows — which were previously absorbed inside a flat fee — now surface as real costs. GitHub's own announcement is explicit: the product now powers far more complex agentic workflows that consume far more compute, and flat pricing was no longer sustainable. Users with intense agentic usage will see costs increase.

Anthropic's mechanism is different but the effect is similar. Per-token prices have not changed. However, the Opus 4.7 tokenizer generates up to 35% more tokens for the same input text compared to its predecessor. The price per token is identical. The effective cost per request is not. As one market analyst framed it: AI labs cut prices when cash is plentiful and market share matters; they raise effective costs when margins matter — and that condition now applies across all three major vendors.

The CFO's job is not to predict how far this goes. It is to build a cost architecture that doesn't depend on that prediction being right. If repricing stays contained, good governance still reduces waste and improves output quality. If it accelerates, organisations with visibility and controls absorb the change; those without it discover the problem on the invoice.

The exposure compounds regardless of how pricing moves. Agents don't live inside a single application. They read from one system, reason, and write to another. Each boundary crossing is increasingly priced: API call fees, data egress charges, write-back fees from receiving platforms. An agent that processes ten thousand records across systems isn't paying for tokens alone. It pays a toll in, a toll for reasoning, and a toll out. Most organisations have no visibility into the aggregate.


The full cost picture

Token consumption is the most visible layer. It is not the only one. CFOs steering on AI costs need to account for at least five categories:

Token and inference costs. The volume problem described above, compounded by a cost structure most budgets don't reflect. Output tokens — what the model generates — cost three to five times more than input tokens because generating a response is computationally more expensive than reading a prompt. Multimodal inputs add further: processing an image can cost twice the equivalent in text; some models charge eight times more for audio. An agentic workflow that incorporates document retrieval, image analysis, and multi-step reasoning is not paying a flat per-token rate. It is paying a blended rate that varies with every interaction. Variable, stochastic, and scaling faster than most forecasting models expect.

Infrastructure and compute. Cloud costs that scale with data volumes, agent workloads, and the embedding and indexing infrastructure that keeps AI systems current. These run whether anyone uses the AI or not. Organisations that index large document repositories to support retrieval-augmented generation pay a continuous cost that grows with the data estate, not with usage.

Maintenance and model drift. AI systems degrade as data environments change. Keeping models accurate requires continuous monitoring, retraining, and validation cycles. Research suggests ongoing maintenance runs at 30 to 50 percent of original build cost annually for typical deployments. This cost is almost never in the original business case.

Integration and boundary costs. The cross-system tolls described above. As agents become more autonomous and multi-system, these costs become material. They are currently the least visible category in most organisations.

Governance and oversight costs. The human review, audit trail, and escalation infrastructure that responsible AI deployment requires. These are not optional in regulated environments. They are also not free. Treating governance as overhead underestimates total cost of ownership and creates the conditions for the failures Microsoft and Uber experienced.

The implication is that a CFO looking only at the API bill is looking at a fraction of the total. Total cost of ownership for AI — across all five categories — is what needs to be governed.


What governance needs to do

The finance function's job here is not to predict which pricing model wins. It is to build visibility across the full cost picture before the next overrun arrives.

That means four things:

Outcome-anchored measurement. Not tokens consumed in aggregate, but tokens per completed unit of work. If consumption can't be tied to a business outcome, it can't be governed. The metric that matters is efficiency relative to result — not spend relative to budget.

Cross-system visibility. Any dashboard that stops at one system boundary is measuring a fraction of the cost. Agent workflows span ERP, data warehouse, orchestration layer, and external APIs. Governance needs to sit above all of them, covering all five cost categories — not just inference.

Proactive controls, not reactive reports. By the time a monthly invoice explains an overrun, the overrun is paid. Circuit breakers, consumption thresholds, and iteration limits need to run continuously — the same way treasury controls run, not quarterly. These controls also improve output quality: a step limit on an agent that's lost in unproductive reasoning is both a cost control and a performance guardrail.

Decision lineage. When an agent costs significantly more than expected, someone needs to reconstruct why: which model, which context, which policy version governed the execution. Without that trail, AI cost governance is reactive pattern-matching at best. With it, exceptions become precedent and the governance function compounds institutional knowledge over time.

These four requirements are extensions of controls finance functions already own. The CFO AI Playbook covers the governance architecture in detail: how to extend COSO principles to agentic AI systems, how to design dual-path monitoring that separates real-time cost signals from longer-term drift detection, and how to build the audit trail infrastructure that makes decision lineage possible in practice. The frameworks apply directly to the cost governance problem the Microsoft and Uber cases expose.

Monday Moves: from Insight to Action

3 actions finance leaders could take immediately.

Track your full AI spend. Pull costs across all five categories from this article: token and inference, infrastructure, maintenance, integration, and governance overhead. Not just the API invoice. If you can't yet populate all five, the gaps are your risk map.

Inventory your governance measures. Write down what controls you currently have in place: what is monitored, at what frequency, who owns each cost category, and where the escalation threshold sits for agentic spend. A short list is fine. The point is to make the current state visible.

Match governance to overrun risk. Compare your inventory against the four governance requirements in this article: outcome-anchored measurement, cross-system visibility, proactive controls, and decision lineage. Where the gap between what you have and what Microsoft and Uber lacked is smallest, you are reasonably protected. Where it is largest, that is where the next overrun is most likely to come from.

A few questions worth your time:

If you're a CFO or controller: does your Q1 AI bill match your Q1 budget? If not, do you know which of the five cost categories is driving the gap?

If you're on the technical side: of the four governance disciplines above, which feels most underserved in the tools you're actually using?

If you're in a consultancy or vendor role: how are your clients reacting when the bill lands?

And the question I care most about: what did I get wrong? Where has the market moved in a direction this article didn't anticipate? That's the feedback that sharpens the thinking.

Drop a comment or send a message.

I write about this in The CFO AI Playbook, making up a full AI-implementation framework for Finance. Subscribe to the waitlist at https://quipucfo.com/#book.

Sources: Bai et al. (2026), "How Do AI Agents Spend Your Money?" arXiv:2604.22750; Ezequiel Massimino, AI Consumption Economics series (LinkedIn, December 2025 — April 2026); The State of Brand, "Every AI Subscription Is a Ticking Time Bomb for Enterprise" (May 2026); Fortune, "Microsoft reports are exposing AI's real cost problem" (May 22, 2026); Crypto Briefing, "Microsoft cancels Claude Code licenses as AI costs surge" (May 2026); GitHub Blog, "GitHub Copilot is moving to usage-based billing" (April 2026); Finout, "Anthropic API Pricing 2026" (April 2026); FinOps Foundation, "GenAI FinOps: How Token Pricing Really Works" (2025).

unsubscribe

This week's articles

Microsoft reports are exposing AI’s real cost problem: Using the tech is more expensive than paying human employees

Microsoft is reportedly canceling most internal Claude Code licenses in its Experiences and Devices division by June 30, 2026, and directing thousands of employees wtoward GitHub Copilot CLI instead. Microsoft isn’t the only company scaling back its internal AI use. Uber’s CTO Praveen Neppalli Naga told The Information in April that the firm had already burnt through its entire 2026 AI coding tools budget in just four months. That comes after the company had actively incentivized adoption through internal leaderboards ranking teams by AI tool usage.

GenAI FinOps: How Token Pricing Really Works

Advertised “per-token price” can be misleading; GenAI costs are determined by operational nuances, and not just list price. Be aware of hidden costs like Context Window Creep, which can exponentially increase spend, especially in long, media-rich interactions. Work with AI engineers to apply a Unit Economics approach to identify token pricing, leveraging techniques like Prompt Caching or Batch Processing, instead of choosing the cheapest-tier model.