Claude Opus 4.7 vs GPT-5.5 vs Gemini 3.1 Pro on procurement-real tasks

01How I ran it

Four workloads, three models, three runs each, blind scoring by two procurement leaders and one in-house generative-AI engineer per run. Every input was a real procurement artefact, anonymised. Every output was scored against a published rubric. The exact dataset and rubric are in the methodology appendix; the headline rubric weightings: correctness 50%, defensibility 25%, completeness 15%, format-readiness 10%.

Cost is measured at end-of-April 2026 published list rates with a 25% prompt-cache discount applied to the long-context tasks where caching demonstrably reduced the cost. Latency is measured at the 95th percentile end-to-end including tool-use round-trips. All three models were given identical instruction sets per task; the only delta was the model swap.

What I deliberately didn't test

I didn't run open-weight models (DeepSeek V4, Llama 4.5, Mistral Frontier). They have real cost advantages, but the procurement workloads I tested require either tool-use reliability or long-context fidelity that open-weight models trail the frontier on by roughly six to nine months as of writing. I'll do a separate open-weight benchmark in Q3.

02Headline results

Task	Accuracy winner	Cost winner	Best overall
RFP drafting from a brief	Opus 4.7 (94.1)	GPT-5.5 ($0.18/draft)	GPT-5.5
MSA clause extraction	Opus 4.7 (97.2)	Gemini 3.1 Pro	Opus 4.7
Scope-3 inference (tool-use heavy)	GPT-5.5 (88.4)	GPT-5.5	GPT-5.5
Supplier risk synthesis (long context)	Opus 4.7 (95.6)	Gemini 3.1 Pro	Opus 4.7
Multilingual contract (CN/JP/KR)	Gemini 3.1 Pro (91.0)	Gemini 3.1 Pro	Gemini 3.1 Pro

3 / 5Workloads won by Opus 4.7

$0.18GPT-5.5 cost per RFP draft

×4.2Cost spread, best vs worst, same task

03Task 1 — RFP drafting from a one-paragraph brief

The most commercial of the four workloads — and the one where most procurement desks will deploy first. Input: a one-paragraph brief for a real category, plus the company's playbook, exhibits library, and three reference RFPs. Output: a complete draft RFP for human review.

Model	Correctness	Defensibility	Completeness	Format	Composite	Cost/draft
Opus 4.7	96.1	93.2	92.0	94.5	94.1	$0.41
GPT-5.5	93.0	91.8	89.3	92.7	92.0	$0.18
Gemini 3.1 Pro	90.2	88.4	87.6	91.0	89.4	$0.22

Take. Opus 4.7 produces the best draft, but the 2.1-point gap over GPT-5.5 disappears after one round of human editing — and the cost differential is real. For RFP drafting at scale, GPT-5.5 is the correct routing default; reserve Opus 4.7 for the regulated-category drafts where the additional defensibility matters (legal, public-sector, healthcare, financial services).

Where the models diverged

The biggest delta showed up in handling scope-creep prevention. Opus 4.7 spontaneously produced a "what is out of scope" section in 11 of 12 drafts; GPT-5.5 produced it in 6 of 12 without explicit instruction; Gemini 3.1 Pro in 4 of 12. With explicit instruction (the prompt-2 pattern from the RFP automation guide), all three produced it reliably. The lesson: for drafting at scale, prompt for the boundary-setting explicitly even on the strongest model.

04Task 2 — MSA clause extraction (long-context fidelity)

Input: a 78-page MSA with 14 amendments. Output: structured extraction of renewal date, auto-renew window, indemnity cap, IP-assignment language, data-processing addendum status, SLA structure, payment terms, change-control clauses, and the top five clauses that deviate from a standard playbook. This is the long-context task — every model's weak spot, but the area where Opus 4.7's design choices show through.

Model	Composite	Misses (per 100 clauses)	Hallucination rate	Cost/MSA
Opus 4.7	97.2	0.8	0.2%	$1.84
GPT-5.5	91.4	3.1	1.1%	$1.42
Gemini 3.1 Pro	88.6	4.7	1.4%	$0.98

Take. Opus 4.7 is the clear winner and it isn't close. The hallucination delta — 0.2% vs 1.1% vs 1.4% — is the dispositive number. In contract work, a 1% hallucination rate means roughly one out of every hundred extracted clauses contains content that isn't actually in the contract. That is unacceptable for production legal work; the rework cost dwarfs the model-cost saving.

"On the long contract work the cost discussion is theatre. If the cheaper model produces a clause that isn't there, you've spent the savings four times over arguing with the supplier." — Recurring feedback from in-house counsel reviewing contract-AI outputs

05Task 3 — Scope-3 inference (tool-use heavy)

Input: a top-100 supplier list with names, country, industry-code, and approximate annual spend. Output: an inferred Scope-3 emissions estimate per supplier with confidence interval, citing the public-disclosure sources used. This is a tool-use-heavy workload — the model has to invoke web search, retrieve filings, extract structured data, and synthesise.

Model	Composite	Tool-use success	Citation accuracy	Cost/supplier
GPT-5.5	88.4	96.7%	91.2%	$0.07
Opus 4.7	85.1	92.4%	94.8%	$0.14
Gemini 3.1 Pro	78.3	89.6%	85.7%	$0.09

Take. GPT-5.5's tool-use reliability is meaningfully ahead — 96.7% successful round-trips vs 92.4% — and at half the per-supplier cost, it's the routing default for any tool-heavy workflow. Opus 4.7's higher citation accuracy is the one place to be careful: if the use-case is audit-defensibility-heavy (CSRD reporting, supplier-disclosure attestation), the citation-accuracy delta probably matters more than the tool-success delta, and Opus 4.7 wins. As covered in the 12 use-cases playbook, the broader question is whether Scope-3 inference is audit-defensible at all today — short answer: not yet.

06Task 4 — Supplier risk synthesis (long context, multiple sources)

Input: a single top-50 supplier across 90 days of news, financial filings, ESG disclosures, sub-tier-supplier graph snapshots, and three industry-analyst reports. Output: a one-page synthesis with calibrated risk score, top-three risk drivers, and a recommended buyer action with a written rationale.

Model	Composite	Useful coverage	False-flag rate	Cost/synthesis
Opus 4.7	95.6	97.1%	2.4%	$0.61
GPT-5.5	89.2	91.4%	5.8%	$0.32
Gemini 3.1 Pro	83.5	87.0%	7.1%	$0.29

Take. Opus 4.7 is the right tool for this job. The false-flag rate matters: in the supplier-news brief described in the 12 use-cases playbook, a 5.8% false-flag rate is the difference between "category owners trust it" and "category owners stop reading it by week 3". The 2× cost premium over GPT-5.5 is bought back many times over by the adoption-rate delta.

A note on Gemini 3.1 Pro

Gemini 3.1 Pro trailed on every English-language workload in this set. It is, however, the dominant model on multilingual contract work — particularly Chinese, Japanese and Korean supplier disclosures. For desks with significant APAC exposure, Gemini 3.1 Pro is the right routing choice for that subset of the work even though it loses every English benchmark in this piece.

07The routing layer — what I actually deploy

The routing logic used by the ProcureAI suite. Open-sourced in the repo; copy it if you're building your own. The architecture is simple and the simplicity is the point:

Router architecture

Don't pick a model. Pick a router, write down the rules, treat the router as a first-class deployment artefact.

If task is RFP drafting: GPT-5.5 default; route to Opus 4.7 if category is regulated (defined list).
If task is contract clause extraction: Opus 4.7 always.
If task is tool-use-heavy research synthesis: GPT-5.5 default; fall back to Opus 4.7 only if citation-accuracy is the dispositive metric.
If task is long-context multi-source synthesis (risk, category strategy): Opus 4.7 always.
If input language is CN / JP / KR: Gemini 3.1 Pro always.
If estimated cost > $5 per call: insert a cheaper-model preview pass; only escalate to frontier if the cheaper model flags uncertainty.

08The two prompts that closed the gap

Where the leader board changed when prompting was tuned per model. Two prompt patterns moved every metric; one meta-rule made both of them work.

"Show your work, citation-first"

Force the model to emit citations before the prose summary. Lifted GPT-5.5 citation-accuracy on Scope-3 from 91.2% → 94.6%, closing Opus's lead on the one dimension it had a clear edge. No measurable effect on Opus — suggesting it already does this internally.

+3.4pp on GPT-5.5

"Confidence interval per claim, not per output"

Force the model to assign a confidence interval to each clause / claim / inference, not the overall output. Cut hallucination rates by ~50% on Opus 4.7 and ~33% on GPT-5.5. The single most underused prompt pattern in procurement-AI deployments I've audited.

−50% hallucinations

Don't pick a model — pick a router

The meta-rule. Pick a routing layer, write down the rules, treat the router as a first-class deployment artefact. The free ProcureAI suite ships with this routing pre-wired across all 16 skills — override files let you tune per task per company.

the takeaway

Martin Bacigal

Founder, ProcureAI

Martin is the founder of ProcureAI and Global Category Manager — IT at Nouryon, where he negotiates the same agentic systems he builds at home. Across Nouryon and Henkel he's booked $16M+ in cumulative IT, SaaS and cybersecurity savings, while leading the global AI capability-building programme that put 350+ procurement professionals across four continents into production with AI workflows.

LinkedIn [email protected]

Opus 4.7 vs GPT-5.5 vs Gemini 3.1 Pro on the procurement-real tasks

01How I ran it

02Headline results

03Task 1 — RFP drafting from a one-paragraph brief

Where the models diverged

04Task 2 — MSA clause extraction (long-context fidelity)

05Task 3 — Scope-3 inference (tool-use heavy)

06Task 4 — Supplier risk synthesis (long context, multiple sources)

07The routing layer — what I actually deploy

08The two prompts that closed the gap

Reading is good. Shipping is better.

Opus 4.7 vs GPT-5.5 vs Gemini 3.1 Pro on the procurement-real tasks

01How I ran it

02Headline results

03Task 1 — RFP drafting from a one-paragraph brief

Where the models diverged

04Task 2 — MSA clause extraction (long-context fidelity)

05Task 3 — Scope-3 inference (tool-use heavy)

06Task 4 — Supplier risk synthesis (long context, multiple sources)

07The routing layer — what I actually deploy

08The two prompts that closed the gap

Keep reading

RFP automation honestly assessed: where it lands, where it wastes a year

The 12 procurement AI use-cases that pay for themselves in one quarter

Build vs. buy vs. fractional: the three-question test for procurement AI

Reading is good. Shipping is better.