Probabilistic AI. Deterministic Problem.

Welcome to this edition of QuipuCFO. This week: what happens when AI meets arithmetic in Excel, why the errors are invisible, and the architectural principle that prevents them. The second article connects this to the broader question of why 40% of agentic AI projects will fail by 2027, drawing on a framework I recently published in FP&A Trends.

Every Final-Year Number Was Wrong

SumProduct, an Excel modeling consultancy, tested Excel’s new COPILOT function. They asked it to project five products over five years using given initial sales figures and annual growth rates. The function returned a complete table. Every column filled, every row consistent. It looked right.

Every number in the final year was wrong. Some errors minor, some more noticeable. A cell that should have read 22,497 returned 22,517. And the function shows only the output: no formula, no calculation steps, no way to trace how any number was derived. The audit trail doesn’t just fail to flag the error. There is no audit trail.

SumProduct’s assessment: the numbers almost look right. That is the worst part.

The Vendor Agrees

Microsoft’s own documentation for the COPILOT function warns users to avoid it for numerical calculations and to use native Excel formulas instead. The documentation explicitly states: avoid AI-generated outputs for financial reporting, legal documents, or other high-stakes scenarios. When the vendor tells you not to use their product for your use case, that warrants attention.

The broader evidence confirms the pattern. OpenAI’s own research established that even advanced training methods raise LLM accuracy for numerical tasks to 78%. A 22% failure rate on math problems, using the best available methods. CFA Institute testing found it gets worse: their RAG pipeline achieved only 55% accuracy on quantitative data versus 66% on qualitative information. The model does not even need to calculate in that test. It just needs to read a number from a document and report it accurately. It fails 45% of the time. For a dashboard no one acts on, these rates might be acceptable. For a pricing formula, a margin calculation, or a financial projection, they are disqualifying.

The Deterministic Handover

This is not a quality problem that the next model release will solve. It is an architectural mismatch.

LLMs work by predicting the most likely next token in a sequence. For an LLM, “10” is no different from “blue.” Both are tokens. Neither carries mathematical meaning. That mechanism is powerful for pattern recognition, document analysis, and synthesis across unstructured data. It is the wrong mechanism for arithmetic.

Growth projections are deterministic operations: given the same inputs and growth rates, the output must be identical every time. Excel solved this decades ago with formulas that are exact, auditable, and reproducible.

Process selection is the decision that prevents this class of error. Finance operations fall into distinct categories. Pattern recognition and synthesis tasks suit AI. Exact reproducible outcomes need rules-based automation. Strategic judgment stays with humans. The SumProduct errors exist because AI was applied to tasks in the wrong category.

What I call the Deterministic Handover is the architectural principle that enforces this distinction. In a well-designed system, it should be impossible for numerical outputs to bypass deterministic execution. When an AI system encounters a calculation, it does not attempt the arithmetic. It routes the request to a rules-based engine that computes the result with guaranteed precision. The AI’s role is translation: converting a question into a structured query that a deterministic tool executes.

Why This Is a CFO Problem

The danger is not one wrong number. It is interdependence. In a financial model, each calculation feeds the next. Pricing feeds margin analysis, margin feeds inventory thresholds, inventory feeds working capital. The SumProduct test demonstrated exactly this: five products projected over five years, where each year’s output depends on the previous year’s result. The final year was the most wrong because it had the most layers of prediction stacked on top of each other. The errors didn’t cancel out. They varied unpredictably in size and direction across cells. And the function shows only the output: no formula, no calculation steps, no way to trace how any number was derived. An analyst reviewing the table sees clean rows and has no mechanism to detect the errors without rebuilding the entire model independently.

The AI-in-Excel trend is accelerating. Add-ins and built-in functions make it easier than ever to put probabilistic prediction inside a tool built for deterministic computation. Before adopting any AI-powered spreadsheet tool, apply process selection at the task level. For every cell, every output: is this pattern recognition or deterministic calculation? If it’s the latter, the spreadsheet already has the right tool. It’s called a formula.

The second article in this edition takes this principle further: how agentic AI architectures apply the Deterministic Handover systematically across FP&A.

Why 40% of Agentic AI Fails before 2027. How to Succeed.

Gartner predicts over 40% of agentic AI projects will be cancelled by the end of 2027, due to escalating costs, unclear value, or poor risk controls. The interesting question is not why 40% fail. It is what the surviving 60% have in common.

I recently published a framework for this in FP&A Trends. The argument: trustworthy AI in finance rests on three architectural foundations.

Match Process to Technology

The first article in this edition showed what happens when AI is applied to a deterministic task: invisible errors that survive review. The process selection framework prevents this by categorising finance operations before choosing technology.

Pattern recognition and synthesis tasks, such as variance analysis, scenario planning, and anomaly detection across large datasets, suit AI. Exact reproducible outcomes, such as journal postings, reconciliations, and cost allocations, need rules-based automation. Strategic judgment, such as accounting policy decisions and impairment assessments, stays with humans. As Boston Consulting Group puts it: use AI where it’s about language, not math.

Agentic AI makes this separation operational. CFA Institute research describes architectures where AI translates natural language into structured database queries, deterministic engines execute the computation, and AI synthesises the results into narrative. The model acts as a linguistic translator, not a calculator. Each component does what it does best. This is the Deterministic Handover applied at system level.

Design Controls Into the Architecture

Human oversight points must be designed into agent architecture, not added as afterthoughts. Every agent workflow needs three elements: approval gates specifying which decisions require human sign-off, escalation triggers setting conditions that prompt the agent to pause, and override protocols defining how humans intervene without disrupting routine operations.

Monitoring needs two layers. Real-time alerts catch immediate failures: an agent attempting unauthorised access or costs breaching thresholds. Trend analysis identifies gradual degradation: forecast accuracy declining over quarters, exception rates creeping upward. Without both, you catch the catastrophic failures but miss the slow erosion of reliability.

Ground AI in Verified Data

KPMG found that 62% of organisations identify weak data governance as the main barrier to AI adoption. FailSafeQA benchmark showed that even the best-performing model fabricated information in 41% of cases when it lacked sufficient context. Retrieval-Augmented Generation reduces this risk by forcing AI to cite verified sources before generating answers, but it works best for qualitative synthesis: policy interpretation, trend summarisation, narrative analysis. For calculations, deterministic tools remain the only reliable option.

Architecture Determines Survival

The 60% of Agentic AI projects that survive 2027 will share a common trait: they matched technology to task, embedded controls from the start, and grounded AI in governed data. That is not a technology choice. It is a governance architecture decision. And it belongs to finance.

The full framework, with implementation examples from Lufthansa, EY, and activity-based costing, is published in FP&A Trends: https://fpa-trends.com/article/agentic-ai-projects-fail-2027-how-fpa-succeeds

This week's articles

Testing CoPilot on numerical tasks

Testing CoPilot for Excel, SumProduct found computational results are sometimes wrong and because the errors tend to be small, they can be difficult to identify. The risk of hallucination limits the use for tasks requiring exact and repeatable outcomes.

Read article

RAG for Finance: Automating Document Analysis with LLMs

When using Large Language Models in Finance, one of the most pressing issues is hallucination—when an LLM generates plausible but incorrect or misleading information due to gaps in its training data. As LLMs predict words based on statistical probabilities rather than true understanding. Since their knowledge is static and frozen at the time of training, they struggle with new, proprietary, or real-time information. Retrieval-augmented generation (RAG) i meant to ground inputs in verified information.

Read article

Monday Moves: From Insight to Action

Four actions FP&A leaders can take:

Audit your AI portfolio against the process classification framework. Decompose target processes into discrete tasks. This shows where agents add value and what controls each task needs. Are you applying probabilistic AI to deterministic tasks? Document where category mismatches exist.
Map control points for your highest-priority AI application. Where do humans approve? What triggers escalation? Can you explain the approval logic to your auditors?
Test your data foundation. Can you trace AI inputs to source systems? Do training, validation, and production data remain separate? If not, start systematic lineage documentation for high-risk AI applications, and establish clear ownership and documentation.
Calculation routing. Do you enforce that calculations route to deterministic tools (SQL queries, code execution) rather than allowing probabilistic generation?

Is Your Finance AI Ready for Audit?

The three foundations in this article map directly to the QuipuCFO AI Readiness Assessment. Ten minutes to find out whether your organisation matches technology to task, has controls designed in, and governs its data for AI at https://quipucfo.com/#assessment.

The frameworks in this edition are chapters in the CFO AI Playbook, launching Q2 2026. Join the waiting list at https://quipucfo.com/#book.

Unsubscribe · Preferences