The Benchmark Quarterly Vol. IV · No. 2 · May 2026

Four models,
one scoreboard.

A side-by-side reading of Kimi K2.6, MiniMax M2.7, Claude Opus 4.7, and GPT-5.5 across the benchmarks the labs themselves report. Open weights versus closed flagships, spring 2026.

Open weights
Kimi K2.6
Moonshot AI
Open weights
MiniMax M2.7
MiniMax
Closed flagship
Opus 4.7
Anthropic
Closed flagship
GPT-5.5
OpenAI
Figure 01 — Composite

Headline benchmarks

Composite benchmark comparison.

GPT-5.5 leads on terminal use and knowledge-work tasks; Opus 4.7 holds the line on long-form coding; the open models close the gap on the composite index for a tenth of the price. — Editor's note

Artificial Analysis Index

GPT-5.557
Opus 4.757
Kimi K2.654
MiniMax M2.750

Hallucination rate (lower is better)

MiniMax M2.734%
Opus 4.736%
Kimi K2.639%
GPT-5.5
Figure 02 — Cost

Blended price per million tokens

Cost comparison.

Sources. Artificial Analysis Intelligence Index v4.0; OpenAI GPT-5.5 announcement (April 23, 2026); Anthropic Opus 4.7 system card; MiniMax M2.7 release notes; Moonshot Kimi K2.6 technical report; Vellum and BuildFast comparison reports.

Method. Vendor-reported scores where available. Empty cells indicate the lab did not publish a comparable number. Terminal-Bench 2.0 and tau-2 Bench Telecom use original prompts without prompt tuning. SWE-bench Verified for MiniMax is from its release notes; Opus 4.7's number is the published headline figure.

Caveats. Self-reported benchmarks favor the publishing lab. Independent verification lags release dates. Token-efficiency, latency, and real-world coding behavior are not captured in any single composite score.