01How I ran it
Four workloads, three models, three runs each, blind scoring by two procurement leaders and one in-house generative-AI engineer per run. Every input was a real procurement artefact, anonymised. Every output was scored against a published rubric. The exact dataset and rubric are in the methodology appendix; the headline rubric weightings: correctness 50%, defensibility 25%, completeness 15%, format-readiness 10%.
Cost is measured at end-of-April 2026 published list rates with a 25% prompt-cache discount applied to the long-context tasks where caching demonstrably reduced the cost. Latency is measured at the 95th percentile end-to-end including tool-use round-trips. All three models were given identical instruction sets per task; the only delta was the model swap.
I didn't run open-weight models (DeepSeek V4, Llama 4.5, Mistral Frontier). They have real cost advantages, but the procurement workloads I tested require either tool-use reliability or long-context fidelity that open-weight models trail the frontier on by roughly six to nine months as of writing. I'll do a separate open-weight benchmark in Q3.
02Headline results
| Task | Accuracy winner | Cost winner | Best overall |
|---|---|---|---|
| RFP drafting from a brief | Opus 4.7 (94.1) | GPT-5.5 ($0.18/draft) | GPT-5.5 |
| MSA clause extraction | Opus 4.7 (97.2) | Gemini 3.1 Pro | Opus 4.7 |
| Scope-3 inference (tool-use heavy) | GPT-5.5 (88.4) | GPT-5.5 | GPT-5.5 |
| Supplier risk synthesis (long context) | Opus 4.7 (95.6) | Gemini 3.1 Pro | Opus 4.7 |
| Multilingual contract (CN/JP/KR) | Gemini 3.1 Pro (91.0) | Gemini 3.1 Pro | Gemini 3.1 Pro |
03Task 1 — RFP drafting from a one-paragraph brief
The most commercial of the four workloads — and the one where most procurement desks will deploy first. Input: a one-paragraph brief for a real category, plus the company's playbook, exhibits library, and three reference RFPs. Output: a complete draft RFP for human review.
| Model | Correctness | Defensibility | Completeness | Format | Composite | Cost/draft |
|---|---|---|---|---|---|---|
| Opus 4.7 | 96.1 | 93.2 | 92.0 | 94.5 | 94.1 | $0.41 |
| GPT-5.5 | 93.0 | 91.8 | 89.3 | 92.7 | 92.0 | $0.18 |
| Gemini 3.1 Pro | 90.2 | 88.4 | 87.6 | 91.0 | 89.4 | $0.22 |
Take. Opus 4.7 produces the best draft, but the 2.1-point gap over GPT-5.5 disappears after one round of human editing — and the cost differential is real. For RFP drafting at scale, GPT-5.5 is the correct routing default; reserve Opus 4.7 for the regulated-category drafts where the additional defensibility matters (legal, public-sector, healthcare, financial services).
Where the models diverged
The biggest delta showed up in handling scope-creep prevention. Opus 4.7 spontaneously produced a "what is out of scope" section in 11 of 12 drafts; GPT-5.5 produced it in 6 of 12 without explicit instruction; Gemini 3.1 Pro in 4 of 12. With explicit instruction (the prompt-2 pattern from the RFP automation guide), all three produced it reliably. The lesson: for drafting at scale, prompt for the boundary-setting explicitly even on the strongest model.
04Task 2 — MSA clause extraction (long-context fidelity)
Input: a 78-page MSA with 14 amendments. Output: structured extraction of renewal date, auto-renew window, indemnity cap, IP-assignment language, data-processing addendum status, SLA structure, payment terms, change-control clauses, and the top five clauses that deviate from a standard playbook. This is the long-context task — every model's weak spot, but the area where Opus 4.7's design choices show through.
| Model | Composite | Misses (per 100 clauses) | Hallucination rate | Cost/MSA |
|---|---|---|---|---|
| Opus 4.7 | 97.2 | 0.8 | 0.2% | $1.84 |
| GPT-5.5 | 91.4 | 3.1 | 1.1% | $1.42 |
| Gemini 3.1 Pro | 88.6 | 4.7 | 1.4% | $0.98 |
Take. Opus 4.7 is the clear winner and it isn't close. The hallucination delta — 0.2% vs 1.1% vs 1.4% — is the dispositive number. In contract work, a 1% hallucination rate means roughly one out of every hundred extracted clauses contains content that isn't actually in the contract. That is unacceptable for production legal work; the rework cost dwarfs the model-cost saving.
"On the long contract work the cost discussion is theatre. If the cheaper model produces a clause that isn't there, you've spent the savings four times over arguing with the supplier." — Recurring feedback from in-house counsel reviewing contract-AI outputs
05Task 3 — Scope-3 inference (tool-use heavy)
Input: a top-100 supplier list with names, country, industry-code, and approximate annual spend. Output: an inferred Scope-3 emissions estimate per supplier with confidence interval, citing the public-disclosure sources used. This is a tool-use-heavy workload — the model has to invoke web search, retrieve filings, extract structured data, and synthesise.
| Model | Composite | Tool-use success | Citation accuracy | Cost/supplier |
|---|---|---|---|---|
| GPT-5.5 | 88.4 | 96.7% | 91.2% | $0.07 |
| Opus 4.7 | 85.1 | 92.4% | 94.8% | $0.14 |
| Gemini 3.1 Pro | 78.3 | 89.6% | 85.7% | $0.09 |
Take. GPT-5.5's tool-use reliability is meaningfully ahead — 96.7% successful round-trips vs 92.4% — and at half the per-supplier cost, it's the routing default for any tool-heavy workflow. Opus 4.7's higher citation accuracy is the one place to be careful: if the use-case is audit-defensibility-heavy (CSRD reporting, supplier-disclosure attestation), the citation-accuracy delta probably matters more than the tool-success delta, and Opus 4.7 wins. As covered in the 12 use-cases playbook, the broader question is whether Scope-3 inference is audit-defensible at all today — short answer: not yet.
06Task 4 — Supplier risk synthesis (long context, multiple sources)
Input: a single top-50 supplier across 90 days of news, financial filings, ESG disclosures, sub-tier-supplier graph snapshots, and three industry-analyst reports. Output: a one-page synthesis with calibrated risk score, top-three risk drivers, and a recommended buyer action with a written rationale.
| Model | Composite | Useful coverage | False-flag rate | Cost/synthesis |
|---|---|---|---|---|
| Opus 4.7 | 95.6 | 97.1% | 2.4% | $0.61 |
| GPT-5.5 | 89.2 | 91.4% | 5.8% | $0.32 |
| Gemini 3.1 Pro | 83.5 | 87.0% | 7.1% | $0.29 |
Take. Opus 4.7 is the right tool for this job. The false-flag rate matters: in the supplier-news brief described in the 12 use-cases playbook, a 5.8% false-flag rate is the difference between "category owners trust it" and "category owners stop reading it by week 3". The 2× cost premium over GPT-5.5 is bought back many times over by the adoption-rate delta.
Gemini 3.1 Pro trailed on every English-language workload in this set. It is, however, the dominant model on multilingual contract work — particularly Chinese, Japanese and Korean supplier disclosures. For desks with significant APAC exposure, Gemini 3.1 Pro is the right routing choice for that subset of the work even though it loses every English benchmark in this piece.
07The routing layer — what I actually deploy
The routing logic used by the ProcureAI suite. Open-sourced in the repo; copy it if you're building your own. The architecture is simple and the simplicity is the point:
Don't pick a model. Pick a router, write down the rules, treat the router as a first-class deployment artefact.
- If task is RFP drafting: GPT-5.5 default; route to Opus 4.7 if category is regulated (defined list).
- If task is contract clause extraction: Opus 4.7 always.
- If task is tool-use-heavy research synthesis: GPT-5.5 default; fall back to Opus 4.7 only if citation-accuracy is the dispositive metric.
- If task is long-context multi-source synthesis (risk, category strategy): Opus 4.7 always.
- If input language is CN / JP / KR: Gemini 3.1 Pro always.
- If estimated cost > $5 per call: insert a cheaper-model preview pass; only escalate to frontier if the cheaper model flags uncertainty.
08The two prompts that closed the gap
Where the leader board changed when prompting was tuned per model. Two prompt patterns moved every metric; one meta-rule made both of them work.
"Show your work, citation-first"
Force the model to emit citations before the prose summary. Lifted GPT-5.5 citation-accuracy on Scope-3 from 91.2% → 94.6%, closing Opus's lead on the one dimension it had a clear edge. No measurable effect on Opus — suggesting it already does this internally.
+3.4pp on GPT-5.5"Confidence interval per claim, not per output"
Force the model to assign a confidence interval to each clause / claim / inference, not the overall output. Cut hallucination rates by ~50% on Opus 4.7 and ~33% on GPT-5.5. The single most underused prompt pattern in procurement-AI deployments I've audited.
−50% hallucinationsDon't pick a model — pick a router
The meta-rule. Pick a routing layer, write down the rules, treat the router as a first-class deployment artefact. The free ProcureAI suite ships with this routing pre-wired across all 16 skills — override files let you tune per task per company.
the takeaway