Four models,
one scoreboard.

A side-by-side reading of Kimi K2.6, MiniMax M2.7, Claude Opus 4.7, and GPT-5.5 across the benchmarks the labs themselves report. Open weights versus closed flagships, spring 2026.

Open weights

Kimi K2.6

Moonshot AI

Open weights

MiniMax M2.7

MiniMax

Closed flagship

Opus 4.7

Anthropic

Closed flagship

GPT-5.5

OpenAI

Figure 01 — Composite

Headline benchmarks

GPT-5.5 leads on terminal use and knowledge-work tasks; Opus 4.7 holds the line on long-form coding; the open models close the gap on the composite index for a tenth of the price. — Editor's note

Artificial Analysis Index

GPT-5.5	57
Opus 4.7	57
Kimi K2.6	54
MiniMax M2.7	50

Hallucination rate (lower is better)

MiniMax M2.7	34%
Opus 4.7	36%
Kimi K2.6	39%
GPT-5.5	—

Figure 02 — Cost

Blended price per million tokens

Sources. Artificial Analysis Intelligence Index v4.0; OpenAI GPT-5.5 announcement (April 23, 2026); Anthropic Opus 4.7 system card; MiniMax M2.7 release notes; Moonshot Kimi K2.6 technical report; Vellum and BuildFast comparison reports.

Method. Vendor-reported scores where available. Empty cells indicate the lab did not publish a comparable number. Terminal-Bench 2.0 and tau-2 Bench Telecom use original prompts without prompt tuning. SWE-bench Verified for MiniMax is from its release notes; Opus 4.7's number is the published headline figure.

Caveats. Self-reported benchmarks favor the publishing lab. Independent verification lags release dates. Token-efficiency, latency, and real-world coding behavior are not captured in any single composite score.

Four models,one scoreboard.

Headline benchmarks

Artificial Analysis Index

Hallucination rate (lower is better)

Blended price per million tokens

Four models,
one scoreboard.