Download free
ProcureAI · Insights · Benchmark
Benchmark · 16 min read · Updated May 2026

Opus 4.7 vs GPT-5.5 vs Gemini 3.1 Pro on the procurement-real tasks

I ran the three frontier models on a calibrated set: RFP drafting from a brief, MSA clause extraction, Scope-3 inference, supplier risk synthesis. Where each model wins on accuracy, where each wins on cost, and the prompts that closed the gap in the cases where the leader changed.

TL;DR

There is no winner. There are four workloads, three models, and a routing decision per workload. Opus 4.7 wins on long-context clause extraction and supplier risk synthesis. GPT-5.5 wins on cost-per-acceptable-RFP-draft and on the Scope-3 inference task (where its tool-use reliability is meaningfully ahead). Gemini 3.1 Pro wins on multilingual contract work and is the only model that handles certain table-heavy Asian-vendor disclosures without hand-holding. The ProcureAI suite routes per task and exposes the routing layer in the open — copy it if you're building your own.

Model naming note

The model identifiers used throughout (Opus 4.7, GPT-5.5, Gemini 3.1 Pro) reflect the version strings in my testing pipeline as of May 2026. Vendors update version numbers frequently — cross-reference the exact model you're routing to against the vendor's current API documentation. The methodology, rubric, and routing logic hold regardless of the version suffix.

01How I ran it

Four workloads, three models, three runs each, blind scoring by two procurement leaders and one in-house generative-AI engineer per run. Every input was a real procurement artefact, anonymised. Every output was scored against a published rubric. The exact dataset and rubric are in the methodology appendix; the headline rubric weightings: correctness 50%, defensibility 25%, completeness 15%, format-readiness 10%.

Cost is measured at end-of-April 2026 published list rates with a 25% prompt-cache discount applied to the long-context tasks where caching demonstrably reduced the cost. Latency is measured at the 95th percentile end-to-end including tool-use round-trips. All three models were given identical instruction sets per task; the only delta was the model swap.

What I deliberately didn't test

I didn't run open-weight models (DeepSeek V4, Llama 4.5, Mistral Frontier). They have real cost advantages, but the procurement workloads I tested require either tool-use reliability or long-context fidelity that open-weight models trail the frontier on by roughly six to nine months as of writing. I'll do a separate open-weight benchmark in Q3.

02Headline results

Task Accuracy winner Cost winner Best overall
RFP drafting from a brief Opus 4.7 (94.1) GPT-5.5 ($0.18/draft) GPT-5.5
MSA clause extraction Opus 4.7 (97.2) Gemini 3.1 Pro Opus 4.7
Scope-3 inference (tool-use heavy) GPT-5.5 (88.4) GPT-5.5 GPT-5.5
Supplier risk synthesis (long context) Opus 4.7 (95.6) Gemini 3.1 Pro Opus 4.7
Multilingual contract (CN/JP/KR) Gemini 3.1 Pro (91.0) Gemini 3.1 Pro Gemini 3.1 Pro
3 / 5Workloads won by Opus 4.7
$0.18GPT-5.5 cost per RFP draft
×4.2Cost spread, best vs worst, same task

03Task 1 — RFP drafting from a one-paragraph brief

The most commercial of the four workloads — and the one where most procurement desks will deploy first. Input: a one-paragraph brief for a real category, plus the company's playbook, exhibits library, and three reference RFPs. Output: a complete draft RFP for human review.

ModelCorrectnessDefensibilityCompletenessFormatCompositeCost/draft
Opus 4.796.193.292.094.594.1$0.41
GPT-5.593.091.889.392.792.0$0.18
Gemini 3.1 Pro90.288.487.691.089.4$0.22

Take. Opus 4.7 produces the best draft, but the 2.1-point gap over GPT-5.5 disappears after one round of human editing — and the cost differential is real. For RFP drafting at scale, GPT-5.5 is the correct routing default; reserve Opus 4.7 for the regulated-category drafts where the additional defensibility matters (legal, public-sector, healthcare, financial services).

Where the models diverged

The biggest delta showed up in handling scope-creep prevention. Opus 4.7 spontaneously produced a "what is out of scope" section in 11 of 12 drafts; GPT-5.5 produced it in 6 of 12 without explicit instruction; Gemini 3.1 Pro in 4 of 12. With explicit instruction (the prompt-2 pattern from the RFP automation guide), all three produced it reliably. The lesson: for drafting at scale, prompt for the boundary-setting explicitly even on the strongest model.

04Task 2 — MSA clause extraction (long-context fidelity)

Input: a 78-page MSA with 14 amendments. Output: structured extraction of renewal date, auto-renew window, indemnity cap, IP-assignment language, data-processing addendum status, SLA structure, payment terms, change-control clauses, and the top five clauses that deviate from a standard playbook. This is the long-context task — every model's weak spot, but the area where Opus 4.7's design choices show through.

ModelCompositeMisses (per 100 clauses)Hallucination rateCost/MSA
Opus 4.797.20.80.2%$1.84
GPT-5.591.43.11.1%$1.42
Gemini 3.1 Pro88.64.71.4%$0.98

Take. Opus 4.7 is the clear winner and it isn't close. The hallucination delta — 0.2% vs 1.1% vs 1.4% — is the dispositive number. In contract work, a 1% hallucination rate means roughly one out of every hundred extracted clauses contains content that isn't actually in the contract. That is unacceptable for production legal work; the rework cost dwarfs the model-cost saving.

"On the long contract work the cost discussion is theatre. If the cheaper model produces a clause that isn't there, you've spent the savings four times over arguing with the supplier." — Recurring feedback from in-house counsel reviewing contract-AI outputs

05Task 3 — Scope-3 inference (tool-use heavy)

Input: a top-100 supplier list with names, country, industry-code, and approximate annual spend. Output: an inferred Scope-3 emissions estimate per supplier with confidence interval, citing the public-disclosure sources used. This is a tool-use-heavy workload — the model has to invoke web search, retrieve filings, extract structured data, and synthesise.

ModelCompositeTool-use successCitation accuracyCost/supplier
GPT-5.588.496.7%91.2%$0.07
Opus 4.785.192.4%94.8%$0.14
Gemini 3.1 Pro78.389.6%85.7%$0.09

Take. GPT-5.5's tool-use reliability is meaningfully ahead — 96.7% successful round-trips vs 92.4% — and at half the per-supplier cost, it's the routing default for any tool-heavy workflow. Opus 4.7's higher citation accuracy is the one place to be careful: if the use-case is audit-defensibility-heavy (CSRD reporting, supplier-disclosure attestation), the citation-accuracy delta probably matters more than the tool-success delta, and Opus 4.7 wins. As covered in the 12 use-cases playbook, the broader question is whether Scope-3 inference is audit-defensible at all today — short answer: not yet.

06Task 4 — Supplier risk synthesis (long context, multiple sources)

Input: a single top-50 supplier across 90 days of news, financial filings, ESG disclosures, sub-tier-supplier graph snapshots, and three industry-analyst reports. Output: a one-page synthesis with calibrated risk score, top-three risk drivers, and a recommended buyer action with a written rationale.

ModelCompositeUseful coverageFalse-flag rateCost/synthesis
Opus 4.795.697.1%2.4%$0.61
GPT-5.589.291.4%5.8%$0.32
Gemini 3.1 Pro83.587.0%7.1%$0.29

Take. Opus 4.7 is the right tool for this job. The false-flag rate matters: in the supplier-news brief described in the 12 use-cases playbook, a 5.8% false-flag rate is the difference between "category owners trust it" and "category owners stop reading it by week 3". The 2× cost premium over GPT-5.5 is bought back many times over by the adoption-rate delta.

A note on Gemini 3.1 Pro

Gemini 3.1 Pro trailed on every English-language workload in this set. It is, however, the dominant model on multilingual contract work — particularly Chinese, Japanese and Korean supplier disclosures. For desks with significant APAC exposure, Gemini 3.1 Pro is the right routing choice for that subset of the work even though it loses every English benchmark in this piece.

07The routing layer — what I actually deploy

The routing logic used by the ProcureAI suite. Open-sourced in the repo; copy it if you're building your own. The architecture is simple and the simplicity is the point:

Router architecture Procurement task ROUTER rules.yaml a first-class artefact Opus 4.7 contracts · risk · long-ctx GPT-5.5 drafting · tool-use Gemini 3.1 Pro CN · JP · KR contracts if regulated / long-ctx if drafting / tool-use if non-EN input

Don't pick a model. Pick a router, write down the rules, treat the router as a first-class deployment artefact.

  • If task is RFP drafting: GPT-5.5 default; route to Opus 4.7 if category is regulated (defined list).
  • If task is contract clause extraction: Opus 4.7 always.
  • If task is tool-use-heavy research synthesis: GPT-5.5 default; fall back to Opus 4.7 only if citation-accuracy is the dispositive metric.
  • If task is long-context multi-source synthesis (risk, category strategy): Opus 4.7 always.
  • If input language is CN / JP / KR: Gemini 3.1 Pro always.
  • If estimated cost > $5 per call: insert a cheaper-model preview pass; only escalate to frontier if the cheaper model flags uncertainty.

08The two prompts that closed the gap

Where the leader board changed when prompting was tuned per model. Two prompt patterns moved every metric; one meta-rule made both of them work.

1

"Show your work, citation-first"

Force the model to emit citations before the prose summary. Lifted GPT-5.5 citation-accuracy on Scope-3 from 91.2% → 94.6%, closing Opus's lead on the one dimension it had a clear edge. No measurable effect on Opus — suggesting it already does this internally.

+3.4pp on GPT-5.5
2

"Confidence interval per claim, not per output"

Force the model to assign a confidence interval to each clause / claim / inference, not the overall output. Cut hallucination rates by ~50% on Opus 4.7 and ~33% on GPT-5.5. The single most underused prompt pattern in procurement-AI deployments I've audited.

−50% hallucinations
3

Don't pick a model — pick a router

The meta-rule. Pick a routing layer, write down the rules, treat the router as a first-class deployment artefact. The free ProcureAI suite ships with this routing pre-wired across all 16 skills — override files let you tune per task per company.

the takeaway
Martin Bacigal

Martin Bacigal

Founder, ProcureAI

Martin is the founder of ProcureAI and Global Category Manager — IT at Nouryon, where he negotiates the same agentic systems he builds at home. Across Nouryon and Henkel he's booked $16M+ in cumulative IT, SaaS and cybersecurity savings, while leading the global AI capability-building programme that put 350+ procurement professionals across four continents into production with AI workflows.

Reading is good. Shipping is better.

The 16 agentic skills plus the five working deliverables — free download.

Download the suite