The difference between a skill that runs and a skill you can trust

01The bar — runs, versus trust

A skill that produces an answer is easy. I could write you a procurement skill this afternoon that classifies a GL export or red-lines an MSA, and it would produce something. It would look the part. It would have a confident tone and a tidy structure.

But "produces something" and "I would put my name on what it produced" are different bars, and the gap between them is the whole job. That gap is trust — and trust is not a feature you bolt on at the end. It is a set of habits the skill has: about saying where its facts come from, about admitting what it does not know, about staying a draft rather than pretending to be a decision.

I went looking at how the most credible teams handle this. Anthropic published claude-for-legal — a full system of skills for legal practice. I read all of it. Legal and procurement are not the same job, but they share a spine: you are producing work product that someone downstream acts on, and when it is wrong, it is expensive. The bar that system sets — provenance on every citation, a review posture on every output — is the right bar. So I took it, and rebuilt the suite around four conventions. None of them change the answer. All of them change whether you can trust it.

02Provenance — every fact has an address

The most dangerous thing an AI skill can hand you is a confident fact with no address. xRisk tells you a supplier had a credit-rating downgrade. Good — did it read that off the rating agency this morning, or did it half-remember it from training data that went stale eighteen months ago? Those are very different facts, and until now they looked identical on the page.

Every skill that asserts an external fact now tags where that fact came from.

The four provenance tags

Three tags you can cite and move on. One — model-knowledge — the skill flags for a human to check before anyone leans on it.

The rule that matters: anything tagged model-knowledge also carries verify: true and is counted in the output summary. So you see, at a glance, how much of what you are holding is solid and how much needs a human to check it before it moves. The skill is not pretending the two are the same.

And the three skills where stale intel does the most damage — xRisk, xMarket, xVendorBench — do not get to emit model-knowledge for their core findings at all. They refuse to run without live search or a recent dataset. That freshness gate was already there; provenance is what makes it visible in the output instead of buried in the skill's conscience.

03Provisional — when the skill is generic

The whole suite runs on a defaults-plus-overrides model — I wrote about that pattern in a separate playbook. Every skill works the moment you unzip it, on a sensible default, and it gets sharp when you drop in your own playbook, your own taxonomy, your own approved-vendor list.

But there was a gap. If you never dropped in your override, the skill still produced confident output — it just argued a generic buyer's position instead of yours, and nothing on the page told you which one you were holding.

Now it does. Run a skill with no override detected, and the output is stamped [PROVISIONAL]. The structure is sound and the maths is right — but it is a generic position, not yours. Drop in the override and the stamp clears on the next run. xClause already tracked this internally, finding by finding; now the whole suite says it out loud, at the top, where you cannot miss it.

Provisional is not an error state

A [PROVISIONAL] run is genuinely useful — it is how the skill works on day one, before you have tuned anything, and the default positions are the ones a competent Fortune-500 buyer would hold. The stamp is not a warning that something is broken. It is the skill being honest about which version of itself you are looking at.

04Decision-support, not the decision

Some of these skills produce output that goes somewhere serious. xSavings feeds a board readout. xESG feeds a CSRD filing. xClause marks up a contract a counterparty's lawyer is going to read line by line. The cost of a quiet error in any of those is not "the model looked silly" — it is a number a CFO defended that does not hold, or a clause that went out soft.

So every skill's output now carries a line that says plainly what it is: decision-support, a draft for human review — not procurement, legal or financial advice. A named person owns the decision, and owns what leaves the building.

That line is not legal cover. It is a description of how the skill should actually be used. The skill is fast, it is thorough, and it will catch things you would miss at four o'clock on a Friday. It is not the CPO, and it is not your counsel. Keeping that boundary explicit on the output is what lets you move quickly on everything inside it.

05The skill checks its own work

The last convention is the least glamorous and might be the most useful. Every SKILL.md now ends with a Quality checks block — a short, specific list the skill runs through before it emits anything.

It is tailored per skill. xClause confirms the playbook was loaded and quoted, that every finding has a real redline rather than a "review this" placeholder, that the risk scores are calibrated and not all 8s. xRenew confirms every date went through the explicit month-carry-and-day-clamp procedure instead of being freestyled — because models get calendar maths wrong when they improvise. xSpend confirms the category roll-up reconciles to the input file total. xMaverick confirms the talk-to list is framed as root-cause conversations, not accusations.

It is the smoke test the skill runs on itself. Think of the difference between an analyst who hands you a model, and an analyst who hands you a model and says: I checked that the totals foot, I walked the edge cases, and here is the one number I am still not sure about. The second one you can work with. The checklist is how every skill becomes the second analyst.

"The answer was always the easy part. Provenance, a provisional stamp, a review posture, a self-check — none of it changes what the skill says. All of it changes whether you can act on it without re-doing the work yourself." — Why I rebuilt the suite around four conventions

06What's next — onboarding, then connectors

Two more moves are coming, and I will be honest about the order and the timing.

Onboarding — a setup interview

Right now you tune the suite by editing override files. Powerful — but it is a lot of files. The next release adds a setup interview: answer a handful of questions once, and it writes a single company profile that every skill reads from. Same idea as the override files, with a front door.

~2 weeks out

Connectors — wire it to your systems

Today the skills work on files you hand them. The connectors layer wires them to the systems the data actually lives in — your ERP, your CLM, sanctions lists, commodity feeds. That one will land in a Pro tier, because the value is concrete and the upkeep is real.

Pro tier

The free suite stays free, and it stays the whole sixteen skills. The trust layer that shipped today is part of it — not a paid add-on, not a teaser. The point was never to gate the skills. It was to make them worth trusting in the first place. That part is done; the rest is reach.

Already running the suite?

You do not have to do anything. The four conventions live inside the SKILL.md files and the suite README — re-download the bundle, or pull the latest, and your next run picks them up. If you have override files in place, nothing about them changes; the trust layer sits on top.

Martin Bacigal

Founder, ProcureAI

Martin is the founder of ProcureAI and Global Category Manager — IT at Nouryon, where he negotiates the same agentic systems he builds at home. Across Nouryon and Henkel he's booked $16M+ in cumulative IT, SaaS and cybersecurity savings, while leading the global AI capability-building programme that put 350+ procurement professionals across four continents into production with AI workflows.

LinkedIn [email protected]

The difference between a skill that runs and a skill you can trust

01The bar — runs, versus trust

02Provenance — every fact has an address

03Provisional — when the skill is generic

04Decision-support, not the decision

05The skill checks its own work

06What's next — onboarding, then connectors

Keep reading

Why generic AI plateaus on procurement work — and the layer that doesn't

Trust and the audit trail in agentic procurement

Make the skills yours — the override-file pattern, no code required

The trust layer ships with every skill.