01Trust is the bottleneck, not capability
Ask a procurement leader what is stopping them from scaling AI across the function, and the honest answer is rarely "the model isn't good enough." It is some version of: "I can't put something in front of the audit committee that I can't defend." That is the real ceiling. Not capability — defensibility.
And the concern is well-founded, not a reflex. The generative-AI literature has a name for the specific failure mode that makes procurement leaders nervous: hallucination — fluent, confident, plausible output that is simply wrong. Ji et al.'s survey of the field catalogues how pervasive the problem is across generation tasks, and why it "degrades system performance and fails to meet user expectations in many real-world scenarios" (Ji et al., 2022). A model that is right 95% of the time and indistinguishably confident the other 5% is not a model a CPO can hand to the audit committee — not because it is bad, but because they cannot tell which 5% they are looking at.
So the question that actually matters for agentic procurement is not "is the model smart enough?" It is "can I defend any given output?" And that turns out to be an architecture question, not a model question.
02Four properties of a defensible output
An agent output you can take to the board has four properties. None of them is "the model was very capable." All four are things you build around the model.
- Grounded — every claim traces to a source the buyer controls, not the model's memory.
- Traced — the reasoning is visible, not a black box. You can see how it got there.
- Bounded — the skill operates inside explicit limits, and says so when it hits one.
- Reviewable — a named human decides, on the record, at a defined point.
The middle two have direct support in the research. Yao et al.'s ReAct work showed that interleaving a model's reasoning trace with its actions — making the "why" visible step by step — measurably reduces hallucination and error propagation compared with a model that just produces an answer (Yao et al., 2022). A visible reasoning trace is not just a nice-to-have for auditability; it is part of what makes the output more correct in the first place.
A raw model output is a draft. It becomes a decision once it has been grounded in policy, checked against known behaviour, and reviewed by a named person — leaving an audit trail behind it.
The diagram is how the four properties chain together in practice — the path from a raw model output to something with a CPO's name on it. The next three sections walk the three stations on that path that the skill suite already builds for you.
03Playbook-as-policy
The override file from the customization pattern does a second job that has nothing to do with tuning. It is your written policy, made executable.
When xClause flags a clause, it does not flag it because the model has an opinion. It flags it because the clause conflicts with a specific, named line in your playbook — a position your legal team wrote down. The output is grounded by construction: every finding points back to a policy you can show an auditor. That is the difference between "the AI thinks this indemnity cap is risky" and "this indemnity cap conflicts with playbook §3.2, which your General Counsel signed off in March." The first is an opinion. The second is defensible.
This is the quiet power of keeping policy in a file the skill reads, rather than in the model's training. The model's training is a black box you cannot audit. A playbook file is a document you can put in front of anyone.
04Fixtures-as-evals
Every skill in the suite ships a fixtures/sample-inputs.md — three representative inputs. The obvious use is the smoke test: run them after editing your overrides, check the output is well-formed.
But the fixtures do a second, quieter job: they are a trust artifact. They let you say, precisely, "here is exactly what this skill does on these known inputs" — and then show it. That is the core idea behind serious model evaluation. The HELM framework's central argument is that trust in a language model comes from broad, transparent, reproducible measurement — not from the model's reputation, and not from a single headline score (Liang et al., 2022). Fixtures are that idea, scaled down to a single skill: a known input, a known-good output, reproducible on demand.
When the audit committee asks "how do you know it does what you say," fixtures are the answer you can actually run in the room.
A vendor demo proves the skill can produce a good output once. It proves nothing about the other ninety-nine. Fixtures invert that: a small, fixed, public set of inputs you re-run every time you change an override — so you are measuring behaviour you can reproduce, not a performance you watched. If a skill has no fixtures, you do not have an eval. You have a memory of a demo.
05The human decision point is a feature
There is a temptation, once a skill is tuned and the fixtures pass, to let it run end to end. Resist it — not because the skill is not good, but because the audit trail needs a name on it.
The well-designed skills in the suite have explicit human decision points. xClause proposes redlines; a person accepts them. xMaverick surfaces a "talk-to-them-Monday" list; a person decides who to call. xRisk flags a sanctions hit as a hard stop; a person — in legal — confirms the match. These are not the skill being incomplete. They are the skill being honest about where machine judgment ends and accountable human judgment begins.
And that named decision point is what the audit trail is made of: who decided, when, on what evidence. An output the model produced is a draft. An output a named person reviewed — against a grounded finding, with the reasoning visible — is a decision. The audit committee can work with decisions. They cannot work with drafts that nobody owns.
06Where the easy part ends
The shape of it is clear enough: trust in an agentic procurement function is not something you wait for the model to earn. It is four properties — grounded, traced, bounded, reviewable — that you build into the architecture around the model.
The skills ship with the building blocks: the playbook seam, the fixtures, the explicit decision points. What they cannot ship is the part that is specific to you — where exactly your decision points sit, what your audit committee actually needs to see, how your risk posture translates into bounds the skill enforces. Designing that governance layer for a real function, with real regulators and a real board, is the work I do hands-on. If you are being asked to scale AI across procurement and the honest blocker is "I can't defend it yet" — that is the conversation. The note below is where it starts.
References
- Ji, Z. et al. (2022). Survey of Hallucination in Natural Language Generation. arXiv:2202.03629
- Yao, S. et al. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629
- Liang, P. et al. (2022). Holistic Evaluation of Language Models. arXiv:2211.09110
