Reference Checks in Enterprise AI Procurement: Why Most Organisations Get Nothing Useful (And What to Ask Instead)
Most AI reference checks confirm little beyond whether a vendor can demo well and say the right things. This article covers why they fail, what they're actually meant to surface, and how to structure the process to get genuine insight before you sign.
Most AI reference checks confirm little beyond whether a vendor can demo well and say the right things. This article explores why they often fail, what they're actually meant to surface, and how to get more decision-useful insight from the process.
You've spent three months running an enterprise AI tender. Shortlisted two vendors. Scored their responses. Sat through demos and technical deep-dives where the outputs looked impressively accurate and the platform felt polished.
Now you're at reference checks.
So you call the three contacts the vendor provided. You ask if they're happy with the platform. They say yes. You ask if they'd recommend the vendor. They say yes. You tick the box and move on.
Twelve months later, you're dealing with adoption rates that never materialised, token costs running well above what was projected, a model deprecation nobody warned you about in advance, and a vendor relationship that looks nothing like what was sold in the procurement process.
Somewhere in the back of your mind, you're wondering if those reference conversations could have told you this was coming.
They could have. But only if you'd known what to actually ask.
Why AI reference checks fail more often than traditional IT ones
In most IT procurement, the gap between vendor promise and delivery reality is visible fairly quickly. A managed service either meets its SLAs or it doesn't. Software either integrates with your systems or it doesn't.
Enterprise AI is different. The demo is almost always impressive. The outputs in a controlled environment - curated data, optimised prompts, a familiar use case - look compelling. What you're evaluating in the sales process is rarely what you'll experience in production.
The failure modes in enterprise AI don't look like traditional software failures. They look like this: adoption that stalls because users don't trust outputs. Accuracy that degrades on edge cases your use cases are full of. Costs that compound in ways the original model didn't anticipate. A model update that quietly changes output behaviour and breaks workflows you'd built around it.
Vendors don't give you references that have experienced those problems. They give you their best relationships - clients who got good outcomes, projects that landed well, contacts who've agreed to take calls and say positive things.
That isn't dishonest. It's just how the process works.
The problem is that most organisations ask questions designed for friendly references. Questions like "Are you satisfied?" or "Would you recommend them?" produce yes-or-no answers. In a reference call arranged by the vendor, the answer is almost always yes.
At this stage of an AI procurement, you're not trying to confirm basic competence. You already know the vendor can make the technology work at some level or they wouldn't be shortlisted. What you're trying to understand is how they actually operate in production, where friction shows up, and whether that friction matters in your context.
That requires different questions.
What AI reference checks should actually do
A useful reference check in an enterprise AI procurement does three things.
First, it confirms capability in a comparable scenario. Not whether the vendor's platform is impressive in general, but whether they've delivered something structurally similar to what you're proposing. Comparable use case complexity, similar data environment, similar organisational change management challenges. A successful deployment at a financial services firm processing structured documents tells you very little if you're a government agency working with unstructured policy text.
Second, it surfaces behaviour under pressure. What happens when outputs are consistently inaccurate for a specific use case? When a model version update changes the behaviour of workflows you've built? When adoption stalls and users stop trusting the tool? When consumption costs run 30% above projection? The goal isn't to catch the vendor out - it's to understand their response patterns so you know what "normal" looks like once the contract is signed.
Third, it helps you calibrate expectations. If every reference mentions that the vendor is technically strong but slow to respond to accuracy issues, that may not be disqualifying. But it tells you where contract protections, escalation paths, and internal governance need to be stronger. If references consistently say change management support was thinner than expected, you know where to negotiate harder or resource more heavily on your own side.
If your reference checks aren't doing those three things, they're mostly noise.
Asking for the right references in the first place
Before you get to the questions, you need to speak to the right people.
Most AI vendors default to providing their largest, most recognisable clients. Those references look good in the vendor's proposal, but they're only useful if your environment is genuinely comparable - not just in size, but in use case, data complexity, and organisational readiness for AI adoption.
A 500-person professional services firm implementing AI for document review gets very little insight from a reference at a 10,000-person global bank with a dedicated AI centre of excellence and a large internal ML team. The deployment model, the support structure, the internal capability required, and the commercial leverage are all different.
Be specific about what you need. Ask for references from organisations of comparable size, similar AI maturity, and - critically - a comparable use case. If you're deploying AI for contract analysis, a reference from another legal or procurement team tells you more than one from an unrelated function, even at a larger organisation.
Recency matters more in AI than in almost any other technology category. The AI vendor landscape has changed significantly in the past 18 months. A reference from a client onboarded two years ago reflects a different product, different pricing model, and likely a different delivery team. Ask for references where the deployment went live within the last 12 months. That gives you a view of current platform capabilities, current support quality, and how the vendor is handling the rapid evolution of the underlying models.
If a vendor struggles to provide recent references in comparable use cases, that's information in itself.
The questions that actually matter
Start with context, not satisfaction.
Ask the reference to describe the engagement. What were they trying to achieve? What use cases were in scope at launch? How long from contract to go-live? What did the rollout look like? This tells you whether the reference is genuinely comparable to your situation, and it surfaces early signals about implementation complexity before you've asked anything pointed.
Then move into specifics.
"What surprised you about working with this vendor once you were past the sales process?" Almost every reference will have something here. In AI procurement, the gap between sales process and implementation reality is often significant - in either direction.
"Where did the platform underperform against what was demonstrated or promised?" This is more useful than asking whether there were problems. Every AI platform has gaps. The question is whether those gaps overlap with what matters to you.
"What did actual token consumption costs look like compared to what was projected during procurement?" Cost predictability is a genuine risk in enterprise AI that doesn't have a direct equivalent in traditional software licensing. References who've lived through a consumption model for 12 months can give you a ground truth that no vendor pricing model will.
"How did the vendor respond when the model was updated or changed?" Model versioning is an enterprise AI-specific risk. You want to know whether the vendor proactively communicated changes, whether workflows broke, and how quickly issues were resolved. A vendor who handles model transitions well is operationally very different from one who doesn't.
"What was adoption actually like, and what did the vendor do to support it?" In enterprise AI, adoption is where most value is lost. References can tell you whether change management support was substantive or superficial, whether users trusted outputs, and how long it took to reach genuine utilisation.
"How did they respond when accuracy or output quality became an issue in your context?" This is the AI-specific equivalent of asking how a vendor handles support escalations. The answer tells you what it's actually like to work through a problem with this vendor in production.
"If you were doing this engagement again, what would you change about how it was set up commercially or contractually?" This question frequently reveals gaps in how agreements were structured - things like inadequate accuracy commitments, weak data handling provisions, or consumption cost controls that weren't tight enough.
"Would you use them again, and for what?" This is more revealing than a blanket recommendation. It clarifies where the vendor genuinely adds value and where they don't. A reference who says "yes, but only for use case X" is giving you more useful signal than one who says "yes, absolutely, highly recommend."
And one of the most useful questions of all: "What didn't we ask that we should have?"
What to listen for in the answers
Pay attention to hesitation. Not silence, but qualification. "They're great, but…" is usually where the useful detail sits. In AI procurement specifically, listen for qualifications about output quality, cost predictability, and vendor responsiveness to accuracy issues - these are the areas where the gap between expectation and reality tends to be widest.
Listen for specificity. Vague praise is meaningless. "The platform is really powerful" tells you nothing. "We reduced document review time by 40% in the first three months, but accuracy on complex cross-jurisdictional contracts was a problem we had to work around" tells you a great deal.
Watch for patterns across references. If multiple references independently raise the same strength or the same limitation, it's probably real. A pattern of strong technical delivery but thin change management support, for example, isn't necessarily disqualifying - but it tells you precisely where to invest more on your own side, and where to push harder in contract negotiations.
Pay attention to what isn't said. If questions about cost predictability drift back to platform features, that's a signal. If questions about accuracy issues get redirected to customer success team quality, that's another. What a reference doesn't address directly is sometimes as informative as what they do.
When to go off-list
Formal procurement processes usually restrict reference checks to contacts provided by the vendor. That's reasonable from a probity perspective, particularly in public sector or highly regulated environments.
But for higher-value or higher-risk enterprise AI deployments, informal calibration through peer networks can add meaningful context. A quiet conversation with a CIO or IT leader at a comparable organisation who has worked with a vendor - even briefly - may surface insights that don't appear in a formal process. Industry forums, technology peer groups, and professional networks are worth activating before reference calls, not after.
This isn't about bypassing governance. It's about understanding how vendors behave outside rehearsed reference calls, and going into formal conversations with better questions as a result.
How this fits into the broader AI procurement process
Reference checks often sit at the end of an evaluation, when a preferred vendor is already emerging and evaluation fatigue has set in. That creates confirmation bias. You're listening for reassurance rather than risk.
In enterprise AI procurement, this is particularly pronounced. By the time references are called, the team has usually seen impressive demos, heard compelling case studies, and built some degree of enthusiasm for the vendor. The reference call becomes a final box to tick rather than a genuine risk assessment.
Treating reference checks as a scored evaluation criterion - with defined questions, consistent weighting, and documented findings - helps counter that bias. They become a structured data point within a broader enterprise AI procurement process, not an afterthought.
There's also a timing consideration specific to AI: if you're assessing references after a proof of concept or pilot, you're in a stronger position. You have your own production experience to pressure-test against what references tell you. The gaps between reference accounts and your own pilot findings are worth examining closely.
What happens if you skip this
Nothing immediately.
The vendor is appointed. The contract is signed. The implementation begins. For a while, everything looks fine - the pilot use cases perform, the demos are repeated internally, early adopters are positive.
Then the harder realities surface. Adoption in the broader user population stalls. Token costs climb as usage scales. A model update changes output behaviour in ways your workflows weren't built to handle. Support response times that felt acceptable in early deployment become a problem when you're operating at scale. The vendor's roadmap moves in a direction that doesn't align with where your use cases are heading.
None of this would necessarily have disqualified the vendor. But much of it could have been anticipated and managed differently - through contract protections, governance structures, or simply more realistic internal expectations - if the signals had been surfaced during procurement.
That's the real cost of weak reference checks in enterprise AI. Not catastrophically bad decisions, but decisions made with less information than was available.
A final note
Reference checks won't tell you everything. They won't eliminate risk or guarantee outcomes. And in a category as fast-moving as enterprise AI, past performance is a less reliable predictor of future delivery than in more stable technology categories.
But they do offer genuine insight into how vendors actually operate in production - not how they present in a tender response, and not how the technology performs in a curated demo environment. In enterprise AI procurement, where the gap between those two things is often significant, that insight is worth more than most organisations invest in capturing it.
This article provides general commercial and procurement commentary only and does not constitute legal, financial, or professional advice.