The Adolescence of Procurement Technology

Professional analysing procurement decisions and category priorities on a laptop in a Melbourne office

Why your fourth analytics program will fail for the same reason as the first three.

Last year I sat in a steering committee for a global industrial group running roughly $2.3bn through indirect procurement across 38 countries. The CPO opened with a slide titled “Why this time is different.” It listed the three previous analytics programs the function had attempted since 2020. The first was a spend cube refresh that took eleven months and was retired within a year because nobody trusted the category mapping. The second was an AI-assisted contract review tool that flagged so many false positives the legal team quietly stopped logging in. The third was a generative co-pilot embedded in their S2P platform, decommissioned after a board paper cited a savings figure that turned out to be double-counted across two business units.

The fourth program, the one we were there to discuss, had a budget of just under $4m and a board mandate to “deliver autonomous procurement decisioning by FY27.” The executive sponsor wanted to know what would make this attempt different. I told him, honestly, that nothing would, unless we stopped treating the symptom and addressed the cause.

The cause was not the tools. The vendor master held 14 distinct spellings of one of the company’s largest suppliers, call it Company X. The GL coding scheme had drifted across acquired entities for nine years without reconciliation. The historical award data in the contract repository carried no outcome tags. No model, however capable, could compensate for a foundation in that condition. We were not buying intelligence. We were buying a faster way to be wrong.

This is the messy middle that procurement is currently stuck in. The technology is genuinely impressive. The results are not. The gap between the two is almost always a data foundation problem, not a tool problem.

The pattern is now familiar enough to name

I have run or reviewed enough of these programs to recognise the shape. A procurement function buys a category-leading tool. Implementation runs over budget but eventually goes live. The first quarterly readout looks promising because the demos always do. By month nine, business users have stopped using the analytics dashboard because the numbers do not match what they see in their P&L. By month fifteen, the tool is still technically operational but functionally abandoned. The CPO is now hunting for the next platform that will, finally, deliver. The cycle resets.

Three concrete examples make the point better than the abstraction does.

The co-pilot that hallucinates suppliers. A retailer I worked with deployed a procurement co-pilot trained on their internal data. A buyer asked, in natural language, for the total spend with Company X over the prior 24 months. The model returned a confident answer. It was wrong by 31 percent, because the vendor master contained Company X, Company X Ltd, Company X (UK) Ltd, Co X, Company X SA, Compny X (yes, a typo, in production for six years), and eight other variants. Some had been linked through a parent-child hierarchy. Most had not. The model was working as designed. The data underneath it was not.

The autonomous PO matching project that broke on GL. A logistics business attempted to automate three-way matching across roughly 180,000 invoices per quarter. The business case modelled 78 percent automation. Actual auto-match in month three was 41 percent. The diagnosis was not the matching engine. It was that the GL coding rules were applied differently across four business units, all folded in through acquisition and never harmonised. The same category of spend, professional services, was being booked against six different account combinations depending on who raised the PO. The exception queue was unmanageable. The team eventually rebuilt the GL mapping from the ground up, which took eleven months and dwarfed the original tooling investment.

The generative RFP tool producing commercially naive output. A financial services firm rolled out a tool that drafted RFP documents from prior precedent. The output read well. The commercial structure was poor. Pricing schedules were lifted from contracts that had underperformed. SLA frameworks were copied from agreements that had been terminated for breach. The tool had no way of knowing this, because the historical award data carried no outcome tags. There was no field that said “this contract delivered against forecast” or “this supplier was exited after 14 months for performance.” The model was learning from a corpus that did not distinguish good outcomes from bad ones. Predictably, it averaged them.

In each case the tool was capable. The data foundation was not. The gap was invisible until the tool was live, at which point it was expensive to fix and politically difficult to admit.

A post-mortem

One of these is worth being specific about, because the principle matters more than the anecdote. The client stays unidentifiable.

A FTSE 250 manufacturer, roughly $400m of indirect spend across 47,000 active suppliers, ran an 18-month program to deploy an AI-enabled spend analytics platform with the explicit goal of identifying $25m of addressable savings over three years. The tool was best-in-class. The implementation partner was a tier-one consultancy. The executive sponsorship was real. By any conventional measure the program was set up to succeed.

It failed. At the 14-month review the platform had identified $9m of opportunity, of which $3m had been validated, and of which $1.4m had been realised. The board lost confidence. The CPO commissioned an independent diagnostic, which is where I came in. The finding was that 62 percent of the spend cube was sitting in three categories tagged “Other” or “Miscellaneous,” not because the tool could not classify it, but because the source data lacked the dimensions required to classify it. The vendor master had no parent-child linkage for 38 percent of suppliers. The PO data had inconsistent UNSPSC tagging, with 71 percent of lines either untagged or tagged at level 2 (segment) rather than level 4 (commodity). The contract repository was 84 percent populated by document but only 19 percent populated by structured metadata such as effective date, expiry, auto-renewal terms, and category. The tool was doing exactly what it was designed to do. It was being asked to find patterns in a dataset that had no patterns to find.

The lessons from that engagement, which I now restate to almost every client at kickoff, are these. First, data preparation is not a project phase, it is a permanent capability. Treat it as a one-off cleanup ahead of go-live and it will degrade within 18 months, putting you back where you started. Second, the quality of your analytics output is bounded by the quality of your operational data, not by the sophistication of your model. A capable model on poor data will confidently produce poor answers. Third, structured metadata is the unglamorous infrastructure that makes everything else possible. Outcome tags on contracts, level-4 commodity codes on POs, parent-child hierarchies on the vendor master. None of this is exciting. All of it is load-bearing.

What this period actually is

There is a useful framing borrowed from outside our field. Amodei calls this period the adolescence of technology, the phase where capability has run ahead of judgement and infrastructure has not yet caught up.[1] Procurement is squarely in that phase. The capability of the tooling is real. The infrastructure underneath, in most functions, is not ready to support it. Buyers are being asked to make decisions on the basis of outputs they cannot verify, from systems trained on data they have not audited, against benchmarks that do not exist.

This is not an argument against procurement technology. The technology is the easy part. The difficult, expensive, multi-year work is fixing the foundation so the technology can do what it claims to do. Most procurement functions I see are skipping that work, because the foundation work is not a story you can tell on a steering slide. “We cleansed the vendor master” does not get a standing ovation. “We deployed an autonomous procurement agent” does. The incentive structure rewards the visible. The results follow the invisible.

This is not despair. It is sequence. The functions getting real value from procurement AI today are not the ones with the most ambitious tooling roadmaps. They are the ones who spent two unglamorous years on master data, GL harmonisation, and contract metadata before they bought anything. They are now compounding. The functions that skipped the foundation are running their fourth pilot.

Before you buy another tool, answer these five questions

If you cannot answer these clearly, the next platform will fail in the same way the last three did. The rest of this series unpacks each one in turn.

  1. What percentage of your spend is mapped to a level-4 commodity code, with reconciliation between source and target taxonomies documented? Good looks like 90 percent or higher at level 4, refreshed quarterly, with an owner accountable for drift.
  2. Is your vendor master deduplicated, with parent-child hierarchy maintained, and is there a single steward responsible for it? Good looks like fewer than 2 percent unmatched suppliers above a materiality threshold, hierarchy reviewed monthly, named accountable role.
  3. What proportion of your historical contracts carry structured metadata for category, expiry, value, auto-renewal, and outcome? Good looks like 80 percent or higher across the first four fields, with outcome tags being applied retrospectively as a deliberate program of work.
  4. Is your GL coding scheme harmonised across business units, and is the mapping between GL and category documented and current? Good looks like a single chart of accounts, a documented GL-to-category crosswalk owned jointly by finance and procurement, and a quarterly reconciliation.
  5. Can you identify, for any decision your procurement systems recommend, the source data and the logic that produced the recommendation? Good looks like traceability from output back to source record, model logic documented in business language, and a defined process for challenging the output.

If your answer to any of these is “we are working on it” or “the implementation partner is handling it,” you are buying a tool to compensate for a foundation problem. The tool will not compensate. It will magnify.

If you cannot answer all five clearly, do not start with another platform demo. Start with a data-readiness diagnostic. That is the only investment that changes the slope of what comes after.

What’s next in this series

The next post, From Vibes to Ground Truth, broadens the diagnosis from your tooling stack to your day-to-day decisions. Why most procurement calls, savings claims, supplier ratings, category bets, rest on opinion rather than evidence. What the cost of running on vibes actually is, in pounds and dollars, for a typical mid-market function. And the four questions a sceptical CFO will ask about your last savings claim that you may struggle to answer. If your last savings number was queried by finance and the room went quiet, read it first.

Dario Amodei, The Adolescence of Technology. The framing is borrowed loosely; the diagnosis here is procurement’s own.

Get Procurement Insights That Matter

Join 10,000+ procurement professionals getting monthly expert cost-optimisation strategies and exclusive resources. Unsubscribe anytime.

Join