Qwen3.5 Plus 2026-02-15 Reasoning (medium)
Qwen
Consistency 10.00 · Attempt pass rate 100.0%
Total Tests: 10
Fully passed tests: 10/10
Score: 10.00
Reasoning score: 7.83
Output Tokens: 407
Consistency: ⓘ 10.00
Attempt pass rate: ⓘ 100.0%
Flaky tests: ⓘ 0
| Category | Fully passed tests | Score | Reasoning score | Output Tokens | Reasoning Tokens |
|---|---|---|---|---|---|
| Anti-AI Tricks | 2/2 | 10.00 | 9.25 | 25 | 2,255 |
| Data parsing and extraction | 1/1 | 10.00 | 9.75 | 110 | 3,822 |
| Domain specific | 2/2 | 10.00 | 5.38 | 23 | 17,004 |
| Instructions following | 2/2 | 10.00 | 7.63 | 69 | 13,421 |
| Puzzle Solving | 3/3 | 10.00 | 8.00 | 180 | 17,572 |
GLM 5 Reasoning (medium)
Z.ai
Consistency 9.05 · Attempt pass rate 95.0%
Total Tests: 10
Fully passed tests: 9/10
Score: 9.50
Reasoning score: 9.22
Output Tokens: 15,960
Consistency: ⓘ 9.05
Attempt pass rate: ⓘ 95.0%
Flaky tests: ⓘ 1
| Category | Fully passed tests | Score | Reasoning score | Output Tokens | Reasoning Tokens |
|---|---|---|---|---|---|
| Anti-AI Tricks | 2/2 | 10.00 | 9.50 | 449 | 1,559 |
| Data parsing and extraction | 1/1 | 10.00 | 9.90 | 214 | 2,232 |
| Domain specific | 1/2 | 7.75 | 8.00 | 13,012 | 13,656 |
| Instructions following | 2/2 | 9.75 | 9.63 | 115 | 1,919 |
| Puzzle Solving | 3/3 | 10.00 | 9.37 | 2,170 | 4,817 |
GPT-5.2 Reasoning (medium)
OpenAI
Consistency 9.19 · Attempt pass rate 75.0%
Total Tests: 10
Fully passed tests: 7/10
Score: 8.10
Reasoning score: 7.34
Output Tokens: 630
Consistency: ⓘ 9.19
Attempt pass rate: ⓘ 75.0%
Flaky tests: ⓘ 1
| Category | Fully passed tests | Score | Reasoning score | Output Tokens | Reasoning Tokens |
|---|---|---|---|---|---|
| Anti-AI Tricks | 2/2 | 10.00 | 9.75 | 28 | 526 |
| Data parsing and extraction | 1/1 | 10.00 | 9.50 | 86 | 84 |
| Domain specific | 1/2 | 5.50 | 2.50 | 28 | 8,988 |
| Instructions following | 2/2 | 9.50 | - | 62 | 525 |
| Puzzle Solving | 1/3 | 7.00 | 8.25 | 426 | 464 |
Kimi K2.5 Reasoning (medium)
MoonshotAI
Consistency 9.06 · Attempt pass rate 75.0%
Total Tests: 10
Fully passed tests: 7/10
Score: 7.95
Reasoning score: 8.98
Output Tokens: 10,043
Consistency: ⓘ 9.06
Attempt pass rate: ⓘ 75.0%
Flaky tests: ⓘ 1
| Category | Fully passed tests | Score | Reasoning score | Output Tokens | Reasoning Tokens |
|---|---|---|---|---|---|
| Anti-AI Tricks | 2/2 | 10.00 | 9.78 | 223 | 1,061 |
| Data parsing and extraction | 1/1 | 10.00 | 9.75 | 318 | 2,247 |
| Domain specific | 1/2 | 7.75 | 7.38 | 7,094 | 16,171 |
| Instructions following | 2/2 | 9.50 | 9.45 | 1,966 | 2,663 |
| Puzzle Solving | 1/3 | 5.00 | 8.97 | 442 | 4,237 |
Claude Sonnet 4.6 Reasoning (medium)
Anthropic
Consistency 9.06 · Attempt pass rate 75.0%
Total Tests: 10
Fully passed tests: 7/10
Score: 7.75
Reasoning score: 8.72
Output Tokens: 33,105
Consistency: ⓘ 9.06
Attempt pass rate: ⓘ 75.0%
Flaky tests: ⓘ 1
| Category | Fully passed tests | Score | Reasoning score | Output Tokens | Reasoning Tokens |
|---|---|---|---|---|---|
| Anti-AI Tricks | 1/2 | 5.50 | 9.88 | 455 | 445 |
| Data parsing and extraction | 1/1 | 10.00 | 9.50 | 531 | 433 |
| Domain specific | 0/2 | 3.25 | 4.75 | 31,512 | 21,237 |
| Instructions following | 2/2 | 10.00 | 10.00 | 201 | 328 |
| Puzzle Solving | 3/3 | 10.00 | 9.50 | 406 | 456 |
StepFun: Step 3.5 Flash Reasoning (medium) Free Available
Stepfun
Consistency 10.00 · Attempt pass rate 70.0%
Total Tests: 10
Fully passed tests: 7/10
Score: 7.55
Reasoning score: 9.13
Output Tokens: 28,305
Consistency: ⓘ 10.00
Attempt pass rate: ⓘ 70.0%
Flaky tests: ⓘ 0
| Category | Fully passed tests | Score | Reasoning score | Output Tokens | Reasoning Tokens |
|---|---|---|---|---|---|
| Anti-AI Tricks | 2/2 | 10.00 | 10.00 | 567 | 2,199 |
| Data parsing and extraction | 1/1 | 10.00 | 9.50 | 195 | 3,413 |
| Domain specific | 1/2 | 5.50 | 7.38 | 24,525 | 27,552 |
| Instructions following | 2/2 | 9.75 | 10.00 | 1,120 | 1,769 |
| Puzzle Solving | 1/3 | 5.00 | 9.00 | 1,898 | 5,657 |
GPT-5 Nano Reasoning (medium)
OpenAI
Consistency 9.99 · Attempt pass rate 70.0%
Total Tests: 10
Fully passed tests: 7/10
Score: 7.50
Reasoning score: 6.05
Output Tokens: 1,385
Consistency: ⓘ 9.99
Attempt pass rate: ⓘ 70.0%
Flaky tests: ⓘ 0
| Category | Fully passed tests | Score | Reasoning score | Output Tokens | Reasoning Tokens |
|---|---|---|---|---|---|
| Anti-AI Tricks | 2/2 | 10.00 | 5.38 | 220 | 5,952 |
| Data parsing and extraction | 1/1 | 10.00 | 8.00 | 115 | 3,520 |
| Domain specific | 1/2 | 5.50 | 4.00 | 102 | 9,152 |
| Instructions following | 2/2 | 9.75 | 8.90 | 192 | 3,008 |
| Puzzle Solving | 1/3 | 4.83 | 5.33 | 756 | 8,064 |
gpt-oss-120b Reasoning (medium) Free Available
OpenAI
Consistency 5.38 · Attempt pass rate 65.0%
Total Tests: 10
Fully passed tests: 4/10
Score: 7.25
Reasoning score: 8.35
Output Tokens: 3,636
Consistency: ⓘ 5.38
Attempt pass rate: ⓘ 65.0%
Flaky tests: ⓘ 5
| Category | Fully passed tests | Score | Reasoning score | Output Tokens | Reasoning Tokens |
|---|---|---|---|---|---|
| Anti-AI Tricks | 2/2 | 10.00 | 10.00 | 175 | 683 |
| Data parsing and extraction | 0/1 | 5.50 | 9.50 | 113 | 285 |
| Domain specific | 0/2 | 5.50 | 6.75 | 2,041 | 2,221 |
| Instructions following | 2/2 | 10.00 | 9.50 | 69 | 1,202 |
| Puzzle Solving | 0/3 | 5.33 | 7.17 | 1,238 | 1,217 |
MiniMax M2.5 Reasoning (medium)
MiniMax
Consistency 7.28 · Attempt pass rate 65.0%
Total Tests: 10
Fully passed tests: 5/10
Score: 7.15
Reasoning score: 7.69
Output Tokens: 9,753
Consistency: ⓘ 7.28
Attempt pass rate: ⓘ 65.0%
Flaky tests: ⓘ 3
| Category | Fully passed tests | Score | Reasoning score | Output Tokens | Reasoning Tokens |
|---|---|---|---|---|---|
| Anti-AI Tricks | 1/2 | 7.75 | 8.25 | 7 | 3,492 |
| Data parsing and extraction | 1/1 | 10.00 | 9.65 | 102 | 2,046 |
| Domain specific | 1/2 | 5.50 | 5.13 | 8,835 | 116,546 |
| Instructions following | 1/2 | 8.50 | 8.25 | 629 | 1,803 |
| Puzzle Solving | 1/3 | 6.00 | 8.00 | 180 | 8,796 |
MiMo-V2-Flash Reasoning (medium)
Xiaomi
Consistency 8.16 · Attempt pass rate 60.0%
Total Tests: 10
Fully passed tests: 5/10
Score: 6.60
Reasoning score: 7.92
Output Tokens: 5,691
Consistency: ⓘ 8.16
Attempt pass rate: ⓘ 60.0%
Flaky tests: ⓘ 2
| Category | Fully passed tests | Score | Reasoning score | Output Tokens | Reasoning Tokens |
|---|---|---|---|---|---|
| Anti-AI Tricks | 1/2 | 5.50 | 9.75 | 185 | 839 |
| Data parsing and extraction | 0/1 | 5.50 | 7.00 | 103 | 4,367 |
| Domain specific | 2/2 | 10.00 | 7.38 | 4,867 | 9,578 |
| Instructions following | 2/2 | 10.00 | 9.75 | 115 | 1,854 |
| Puzzle Solving | 0/3 | 3.17 | 6.75 | 421 | 4,923 |
Claude Sonnet 4.6 No Reasoning
Anthropic
Consistency 9.29 · Attempt pass rate 55.0%
Total Tests: 10
Fully passed tests: 5/10
Score: 6.25
Reasoning score: -
Output Tokens: 1,098
Consistency: ⓘ 9.29
Attempt pass rate: ⓘ 55.0%
Flaky tests: ⓘ 1
| Category | Fully passed tests | Score | Reasoning score | Output Tokens | Reasoning Tokens |
|---|---|---|---|---|---|
| Anti-AI Tricks | 0/2 | 1.00 | - | 454 | 0 |
| Data parsing and extraction | 1/1 | 10.00 | - | 98 | 0 |
| Domain specific | 2/2 | 10.00 | - | 18 | 0 |
| Instructions following | 1/2 | 5.25 | - | 136 | 0 |
| Puzzle Solving | 1/3 | 6.67 | - | 392 | 0 |
Qwen3.5 Plus 2026-02-15 No Reasoning
Qwen
Consistency 10.00 · Attempt pass rate 50.0%
Total Tests: 10
Fully passed tests: 5/10
Score: 5.70
Reasoning score: -
Output Tokens: 433
Consistency: ⓘ 10.00
Attempt pass rate: ⓘ 50.0%
Flaky tests: ⓘ 0
| Category | Fully passed tests | Score | Reasoning score | Output Tokens | Reasoning Tokens |
|---|---|---|---|---|---|
| Anti-AI Tricks | 0/2 | 1.00 | - | 8 | 0 |
| Data parsing and extraction | 1/1 | 10.00 | - | 100 | 0 |
| Domain specific | 1/2 | 5.50 | - | 4 | 0 |
| Instructions following | 2/2 | 9.50 | - | 48 | 0 |
| Puzzle Solving | 1/3 | 5.00 | - | 273 | 0 |
GLM 4.7 Flash Reasoning (medium)
Z.ai
Consistency 8.11 · Attempt pass rate 50.0%
Total Tests: 10
Fully passed tests: 4/10
Score: 5.45
Reasoning score: 8.04
Output Tokens: 5,579
Consistency: ⓘ 8.11
Attempt pass rate: ⓘ 50.0%
Flaky tests: ⓘ 2
| Category | Fully passed tests | Score | Reasoning score | Output Tokens | Reasoning Tokens |
|---|---|---|---|---|---|
| Anti-AI Tricks | 1/2 | 5.50 | 9.25 | 384 | 1,056 |
| Data parsing and extraction | 1/1 | 10.00 | 9.40 | 468 | 2,195 |
| Domain specific | 1/2 | 5.50 | 7.00 | 3,798 | 4,035 |
| Instructions following | 1/2 | 7.25 | 9.75 | 265 | 1,428 |
| Puzzle Solving | 0/3 | 2.67 | 6.33 | 664 | 4,078 |
Claude Opus 4.6 Reasoning (medium)
Anthropic
Consistency 8.11 · Attempt pass rate 50.0%
Total Tests: 10
Fully passed tests: 4/10
Score: 5.40
Reasoning score: 9.50
Output Tokens: 5,900
Consistency: ⓘ 8.11
Attempt pass rate: ⓘ 50.0%
Flaky tests: ⓘ 2
| Category | Fully passed tests | Score | Reasoning score | Output Tokens | Reasoning Tokens |
|---|---|---|---|---|---|
| Anti-AI Tricks | 0/2 | 3.25 | 10.00 | 398 | 340 |
| Data parsing and extraction | 0/1 | 5.50 | 9.75 | 351 | 436 |
| Domain specific | 0/2 | 1.00 | 8.88 | 4,606 | 3,015 |
| Instructions following | 2/2 | 9.50 | 9.25 | 173 | 332 |
| Puzzle Solving | 2/3 | 7.00 | 9.67 | 372 | 395 |
GLM 5 No Reasoning
Z.ai
Consistency 9.27 · Attempt pass rate 45.0%
Total Tests: 10
Fully passed tests: 4/10
Score: 5.30
Reasoning score: -
Output Tokens: 337
Consistency: ⓘ 9.27
Attempt pass rate: ⓘ 45.0%
Flaky tests: ⓘ 1
| Category | Fully passed tests | Score | Reasoning score | Output Tokens | Reasoning Tokens |
|---|---|---|---|---|---|
| Anti-AI Tricks | 0/2 | 1.00 | - | 7 | 0 |
| Data parsing and extraction | 1/1 | 10.00 | - | 81 | 0 |
| Domain specific | 0/2 | 1.00 | - | 6 | 0 |
| Instructions following | 2/2 | 10.00 | - | 42 | 0 |
| Puzzle Solving | 1/3 | 6.33 | - | 201 | 0 |
GLM 4.7 Flash No Reasoning
Z.ai
Consistency 9.26 · Attempt pass rate 35.0%
Total Tests: 10
Fully passed tests: 3/10
Score: 5.05
Reasoning score: -
Output Tokens: 207
Consistency: ⓘ 9.26
Attempt pass rate: ⓘ 35.0%
Flaky tests: ⓘ 1
| Category | Fully passed tests | Score | Reasoning score | Output Tokens | Reasoning Tokens |
|---|---|---|---|---|---|
| Anti-AI Tricks | 0/2 | 1.00 | - | 8 | 0 |
| Data parsing and extraction | 1/1 | 10.00 | - | 82 | 0 |
| Domain specific | 2/2 | 10.00 | - | 8 | 0 |
| Instructions following | 0/2 | 4.50 | - | 32 | 0 |
| Puzzle Solving | 0/3 | 3.17 | - | 77 | 0 |
MiMo-V2-Flash No Reasoning
Xiaomi
Consistency 8.11 · Attempt pass rate 40.0%
Total Tests: 10
Fully passed tests: 3/10
Score: 4.80
Reasoning score: -
Output Tokens: 66,101
Consistency: ⓘ 8.11
Attempt pass rate: ⓘ 40.0%
Flaky tests: ⓘ 2
| Category | Fully passed tests | Score | Reasoning score | Output Tokens | Reasoning Tokens |
|---|---|---|---|---|---|
| Anti-AI Tricks | 0/2 | 1.00 | - | 12 | 0 |
| Data parsing and extraction | 0/1 | 5.50 | - | 106 | 0 |
| Domain specific | 1/2 | 7.75 | - | 8 | 0 |
| Instructions following | 1/2 | 5.25 | - | 43 | 0 |
| Puzzle Solving | 1/3 | 4.83 | - | 65,932 | 0 |
Kimi K2.5 No Reasoning
MoonshotAI
Consistency 10.00 · Attempt pass rate 30.0%
Total Tests: 10
Fully passed tests: 3/10
Score: 4.00
Reasoning score: -
Output Tokens: 284
Consistency: ⓘ 10.00
Attempt pass rate: ⓘ 30.0%
Flaky tests: ⓘ 0
| Category | Fully passed tests | Score | Reasoning score | Output Tokens | Reasoning Tokens |
|---|---|---|---|---|---|
| Anti-AI Tricks | 0/2 | 1.00 | - | 7 | 0 |
| Data parsing and extraction | 1/1 | 10.00 | - | 72 | 0 |
| Domain specific | 1/2 | 5.50 | - | 10 | 0 |
| Instructions following | 1/2 | 5.50 | - | 40 | 0 |
| Puzzle Solving | 0/3 | 2.00 | - | 155 | 0 |
GPT-4o-mini No Reasoning
OpenAI
Consistency 9.97 · Attempt pass rate 20.0%
Total Tests: 10
Fully passed tests: 2/10
Score: 3.55
Reasoning score: -
Output Tokens: 323
Consistency: ⓘ 9.97
Attempt pass rate: ⓘ 20.0%
Flaky tests: ⓘ 0
| Category | Fully passed tests | Score | Reasoning score | Output Tokens | Reasoning Tokens |
|---|---|---|---|---|---|
| Anti-AI Tricks | 0/2 | 1.00 | - | 4 | 0 |
| Data parsing and extraction | 1/1 | 10.00 | - | 74 | 0 |
| Domain specific | 0/2 | 1.00 | - | 4 | 0 |
| Instructions following | 1/2 | 5.50 | - | 46 | 0 |
| Puzzle Solving | 0/3 | 3.50 | - | 195 | 0 |
Qwen3 Coder Next No Reasoning
Qwen
Consistency 10.00 · Attempt pass rate 20.0%
Total Tests: 10
Fully passed tests: 2/10
Score: 3.00
Reasoning score: -
Output Tokens: 736
Consistency: ⓘ 10.00
Attempt pass rate: ⓘ 20.0%
Flaky tests: ⓘ 0
| Category | Fully passed tests | Score | Reasoning score | Output Tokens | Reasoning Tokens |
|---|---|---|---|---|---|
| Anti-AI Tricks | 0/2 | 1.00 | - | 14 | 0 |
| Data parsing and extraction | 0/1 | 1.00 | - | 100 | 0 |
| Domain specific | 1/2 | 5.50 | - | 8 | 0 |
| Instructions following | 1/2 | 5.00 | - | 42 | 0 |
| Puzzle Solving | 0/3 | 2.00 | - | 572 | 0 |
Qwen3 Coder Next Reasoning (medium)
Qwen
Consistency 9.96 · Attempt pass rate 20.0%
Total Tests: 10
Fully passed tests: 2/10
Score: 2.95
Reasoning score: 3.83
Output Tokens: 671
Consistency: ⓘ 9.96
Attempt pass rate: ⓘ 20.0%
Flaky tests: ⓘ 0
| Category | Fully passed tests | Score | Reasoning score | Output Tokens | Reasoning Tokens |
|---|---|---|---|---|---|
| Anti-AI Tricks | 0/2 | 1.00 | 1.00 | 12 | 0 |
| Data parsing and extraction | 0/1 | 1.00 | 4.00 | 100 | 0 |
| Domain specific | 1/2 | 5.50 | 5.00 | 8 | 0 |
| Instructions following | 1/2 | 5.00 | 8.00 | 50 | 0 |
| Puzzle Solving | 0/3 | 1.83 | 3.50 | 501 | 0 |
Comments