Why generic AI plateaus on procurement work

01The plateau is real

The first time a CPO watches a frontier model draft an RFP, the reaction is some mix of relief and alarm. It's good. It's fast. It's structurally complete. Then they read the second one, and the third, and the shape of the ceiling comes into view: every RFP is competent, and not one of them is theirs. The mandatory requirements are generic. The weightings are textbook. The scoring rubric is what a sharp consultant would propose on day one — not what the category team learned to write after the last three sourcing events went sideways.

This is not a model that needs to be smarter. It is a model that has never been told what you know. And the research community has been clear for years about where that line sits. Brown et al., in the work that introduced few-shot learning at scale, showed that the thing which unlocks task-specific performance is not a bigger model — it is the right examples placed in context at inference time (Brown et al., 2020). Lewis et al., introducing retrieval-augmented generation the same year, put it more directly still: a model's parametric memory — everything baked into its weights — gives it fluency, but its ability to "access and precisely manipulate knowledge" on knowledge-intensive tasks is limited, and closes only when you supply non-parametric memory: retrieved, grounded, specific context (Lewis et al., 2020).

Translated out of the literature: the model is the general. Your context is the briefing. A brilliant general with no briefing fights a generic war.

02Where the value actually accrues

Picture two axes. Along the bottom, how company-specific the task is — from "summarise this contract" on the left to "score this contract against our playbook and our risk appetite" on the right. Up the side, output quality.

The capability plateau

The generic model is excellent on the left — and excellent for everyone, which is the problem. As the work gets specific to your function, it bends to average. The override layer is the wedge that lifts the right-hand side back up.

The generic frontier model traces a particular shape on that chart. On the left — the generic tasks — it is excellent, and it is excellent for everyone, which is exactly the problem: it is a commodity. As you move right, into the work that actually differentiates your function, the line bends down. Not to zero — the model is still articulate, still structurally competent — but to average. To the output a capable outsider would produce. That bend is the plateau.

The override layer is the wedge that lifts the right-hand side back up. It does not make the model smarter; it makes the model informed. And the interesting thing about that wedge is where it sits: entirely on the company-specific side of the chart. The value of an enterprise AI deployment is not spread evenly across the model's capabilities. It is concentrated in the narrow band where your knowledge meets the model's fluency.

03The thin layer that does the work

Here is what surprises people: the layer that does all of that work is tiny.

xClause's default playbook is a few kilobytes of YAML. A fully tuned version — one carrying a real company's negotiated positions across a dozen clause types — is maybe twice that. It is the smallest file in the skill folder. And it changes every output the skill ever produces.

That asymmetry — small input, large and consistent effect — is the whole reason the pattern works, and it is well-trodden ground. Retrieval-augmented generation is built on exactly this property: a compact, well-chosen context, supplied at the right moment, moves generation from "fluent and plausible" to "specific and grounded" (Lewis et al., 2020). But the same literature carries the warning. Liu et al. showed that models do not use long context uniformly — information buried in the middle of a sprawling context gets used far less reliably than information that is short, structured, and positioned with intent (Liu et al., 2023).

The lesson for an enterprise deployment is not "give the model everything." It is "give the model the right thin layer, structured well." Dumping your entire contract-lifecycle system into a prompt is not a context strategy. A tight, curated playbook is.

Why this isn't just "fine-tuning"

Retraining a model's weights on your data is expensive, slow, and locks you to one vendor's base model. The override layer does the opposite job: it sits in front of whichever frontier model is best this quarter, costs nothing to change, and stays entirely under your control. You are not adapting the model to your company. You are briefing it — every single time it runs.

04A moat, not a chore

Most organisations treat the customization layer as a chore — a setup cost, a box to tick before the real value starts. That has it exactly backwards.

The generic model is available to your competitors on the same terms it is available to you. Same weights, same API, same day. If your AI advantage lives in the model, you do not have an advantage — you have a subscription everyone else also has. The advantage lives in the layer the model cannot bring: your negotiated clause positions, your spend taxonomy refined over years, your approved-vendor list, your hard-won sense of which suppliers quietly underperform.

That layer is proprietary by construction. It compounds — every sourcing event, every renewal, every contract teaches it something. And it is portable: it sits in front of whichever frontier model leads this quarter, so you are never betting the function on one vendor's roadmap. The customization layer is not the cost of getting to the value. It is the value.

~4 KBSize of a tuned xClause playbook

100%Of the skill's outputs it shapes

0Engineers required to write it

05Where the easy part ends

The argument in this piece is simple enough to state in a sentence: the model is general, the value is in the thin proprietary layer you put in front of it, and that layer is small to write and large in effect.

Simple to state. The part that is not simple — and not a download — is what goes in the layer. Which of your clause positions are genuinely non-standard, and which just feel that way. How to encode a category strategy that currently lives half in a deck and half in three people's heads. How to keep the layer current as the function learns, without it rotting into the same stale system nobody trusts. That work is specific to your house, and it is the work I do hands-on with procurement teams. If that is where you are, the note below is the start of that conversation.

References

Brown, T. et al. (2020). Language Models are Few-Shot Learners. arXiv:2005.14165
Lewis, P. et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401
Liu, N. F. et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. arXiv:2307.03172

Martin Bacigal

Founder, ProcureAI

Martin is the founder of ProcureAI and Global Category Manager — IT at Nouryon, where he negotiates the same agentic systems he builds at home. Across Nouryon and Henkel he's booked $16M+ in cumulative IT, SaaS and cybersecurity savings, while leading the global AI capability-building programme that put 350+ procurement professionals across four continents into production with AI workflows.

LinkedIn [email protected]

Why generic AI plateaus

01The plateau is real

02Where the value actually accrues

03The thin layer that does the work

04A moat, not a chore

05Where the easy part ends

References

Keep reading

Make the skills yours — the override-file pattern, no code required

Trust and the audit trail in agentic procurement

The 12 procurement AI use-cases that pay back in one quarter

The deep version is a conversation.