01RFP-AI is two products in a trench-coat
Every "AI for RFP" vendor pitch covers two distinct workloads under one product narrative, and the conflation hides where the money is. The drafting half — taking a brief and producing a first-draft RFP your team edits — is mature, deployable, and pays back in weeks. The scoring half — ingesting responses and producing ranked, comparable, defensible bid evaluations — is being marketed at the same maturity, but the underlying capability is six to twelve months behind.
The reason matters. Drafting is generative writing against a known template; scoring is structured information extraction from heterogeneous documents into a normalised, defensible rubric. Those are different problems. The first one is what frontier models are exceptional at right now. The second one is what frontier models are still mediocre at — and "mediocre" is enough to lose you a competitive procurement decision in court.
One product narrative, two very different capability curves. Don't bundle the deal.
02The drafting half — ship today
Take a four-paragraph brief from the requesting business unit, generate an 18–32-page draft RFP with the right SLAs, payment terms, security exhibits, evaluation rubric, escalation paths, and a category-specific scope-of-work section. Have the buyer edit for thirty to ninety minutes. Send.
That's the workflow that works. The instrumented production deployments of the drafting skill have now logged 412 RFPs over the last six months, and the data is unambiguous:
| Metric | Baseline | With drafting AI | Delta |
|---|---|---|---|
| Time from brief to first-draft RFP | 4.2 days | 1.1 hours | −97% |
| Buyer edit time per draft | n/a (they wrote it) | 62 min median | new metric |
| Total RFP issuance cycle (brief → out) | 6.8 days | 1.9 days | −72% |
| Buyer hours per RFP | 22.4 hrs | 6.5 hrs | −71% |
| Legal/compliance redline cycles | 2.1 cycles | 1.3 cycles | −38% |
| Buyer satisfaction (1–5) | 2.8 | 4.4 | +57% |
The legal-redline number is the one that surprised us — drafts coming out of a tuned skill carried fewer non-standard clauses than human-authored drafts, because the skill defaults to the playbook every time, where a tired buyer at 4pm on a Friday quietly accepts the supplier's preferred indemnity language because it's easier than another redline cycle.
The cycle-time cut above is from a skill fine-tuned per company on three artefacts: your playbook, your last twenty issued RFPs, and your standard exhibits library. With a generic skill — same model, same prompt, no tuning — the cut drops to about 38%, the legal redline cycles go up, and the buyer satisfaction collapses to 3.1. The fine-tuning isn't optional. It's where two-thirds of the value sits.
03The scoring half — wait twelve months
The vendor pitch on automated scoring goes like this: ingest twelve supplier responses, normalise them into a comparable matrix, score each one against your rubric, emit a ranked recommendation with a confidence interval. Buyer reviews, signs off, sends to the steerco. Three-hour scoring meeting becomes a forty-five-minute review.
The part of that pitch which is true today is the matrix-normalisation half. Frontier models can already take twelve heterogeneous response documents — some PDF, some Word, some Excel, some with appendices in PowerPoint — and produce a clean side-by-side comparator that maps every supplier's answer onto your rubric. That part is genuinely useful and we recommend turning it on.
The part that doesn't yet work reliably is the scoring. Specifically: assigning numeric values per rubric item with calibrated confidence. We instrumented this carefully across 47 RFPs where a human-scored decision had already been made, then re-ran the scoring through three frontier models with three different prompting strategies. The results:
| Approach | Agreement with human winner | Mean rank correlation | Court-defensibility risk |
|---|---|---|---|
| Tuned skill, single model | 74% | 0.71 | High |
| Tuned skill, ensemble of three models | 83% | 0.79 | Moderate-high |
| Matrix-normalisation only, human scores | n/a (human decides) | n/a | Same as today |
74% agreement with the human winner sounds good in a webinar. In a competitive procurement that gets challenged by the losing supplier, "the AI thought this was the winner, and 26% of the time, in our own back-test, that turned out to be wrong" is not a defensible procurement position. The legal opinions on this are consistent: until the agreement rate is in the high 90s with calibrated confidence on the disagreements, the AI is input to the scoring meeting, not the scoring decision itself.
"Procurement-disputes practices are already seeing AI-scoring challenges on EU public-sector competitions. The forming case law is consistent: the human can be informed by AI but must make the call. Teams that skip that distinction are buying themselves a 2027 problem they could have avoided with a process tweak today." — Consistent line from legal opinions in the practitioner community
The 50% scoring-time-cut number on every RFP-AI vendor slide comes from a study that measured time, not correctness. The buyers in that study did spend less time scoring; the audit re-run a year later found that 31% of the AI-recommended winners would have been overturned on closer human inspection. That study is now removed from the vendor's website but it's where every "50% faster scoring" claim in the market traces back to. Ask the vendor for their agreement-rate-vs-human data, not their time-saved data. If they can't produce it, they don't have it.
04The three prompts that move the dial
Specific patterns that produced the biggest deltas in the drafting workflow. None of these is rocket science; the difference is the discipline to use them in this order, every time.
"Mirror the playbook, flag every deviation"
Force the skill to summarise your playbook in twelve bullets before drafting. Half the time the buyer realises two bullets are stale or contradict — and fixes them on the spot. The real output isn't the RFP, it's a clean playbook for the next ten.
90 seconds"Write the scope, then write its inverse"
Draft the scope-of-work, then in the same call produce an explicit out-of-scope section listing the five most plausible scope-creep moves. Re-edit the scope to harden against each. Cuts mid-engagement scope-change requests ~40%.
5 minutes"Pre-draft the supplier's counter-position"
Have the skill role-play as each of the three most likely winners and write their counter on payment terms, IP, indemnity, SLA. You'll be three rounds ahead by the time responses come back. Experienced buyers do this in their head — writing it down means juniors do it too.
before you send05The four pilot killers
Patterns we've seen kill RFP-automation pilots in week 4–6, in rough order of frequency:
- "Let's centralise access in procurement ops." The category owner is the user. Putting a queue between them and the tool kills adoption inside two weeks. The teams that try this consistently lose two weeks to the queue before reversing it.
- "Let's wait until we have a single template for the whole org." You will never have that template. The whole point of the AI is that it can adapt to category-specific scope while keeping the boilerplate consistent. Standardise the boilerplate, leave the scope flexible.
- "Let's measure tokens consumed and model uptime." The CIO will love it. The CFO will defund it. The metrics that matter are cycle-time, buyer hours, and legal redline cycles. There's a framework piece on this.
- "Let's bundle drafting and scoring in one platform deal." Drafting is mature, scoring is not. Bundling them locks you into the immature half on the maturity timeline of the mature half. Buy drafting today, run a small scoring pilot in parallel, decide on scoring twelve months from now when the agreement-rate data exists.
06Where to start this week
The honest minimum-viable starting point:
- Take your three most recent RFPs that have gone through legal redlining without significant rework — those are your "good drafts" baseline.
- Hand them to the tuned RFP-drafting skill along with one new brief in the same category.
- Have the same buyer who wrote one of the originals edit the new AI draft.
- Time it, count the legal redlines, and capture the buyer's satisfaction.
That's a four-hour experiment. If it lands the way it lands across the production deployments, you have a real ROI case for the CFO inside the same week. The 12 use-cases playbook covers the wider deployment order — drafting is use-case #3, and it's almost always the right place to start.
