Show HN: AIBenchy – Independent AI Leaderboard

Qwen3.5 Plus 2026-02-15 Reasoning (medium)

Qwen

Consistency 10.00 · Attempt pass rate 100.0%

10.00 7.83 1.3203 $0.13203 10/10

Total Tests: 10

Fully passed tests: 10/10

Score: 10.00

Reasoning score: 7.83

Output Tokens: 407

Consistency: ⓘ 10.00

Attempt pass rate: ⓘ 100.0%

Flaky tests: ⓘ 0

Category	Fully passed tests	Score	Reasoning score	Output Tokens	Reasoning Tokens
Anti-AI Tricks	2/2	10.00	9.25	25	2,255
Data parsing and extraction	1/1	10.00	9.75	110	3,822
Domain specific	2/2	10.00	5.38	23	17,004
Instructions following	2/2	10.00	7.63	69	13,421
Puzzle Solving	3/3	10.00	8.00	180	17,572

GLM 5 Reasoning (medium)

Z.ai

Consistency 9.05 · Attempt pass rate 95.0%

9.50 9.22 0.7003 $0.06303 9/10

Total Tests: 10

Fully passed tests: 9/10

Score: 9.50

Reasoning score: 9.22

Output Tokens: 15,960

Consistency: ⓘ 9.05

Attempt pass rate: ⓘ 95.0%

Flaky tests: ⓘ 1

Category	Fully passed tests	Score	Reasoning score	Output Tokens	Reasoning Tokens
Anti-AI Tricks	2/2	10.00	9.50	449	1,559
Data parsing and extraction	1/1	10.00	9.90	214	2,232
Domain specific	1/2	7.75	8.00	13,012	13,656
Instructions following	2/2	9.75	9.63	115	1,919
Puzzle Solving	3/3	10.00	9.37	2,170	4,817

GPT-5.2 Reasoning (medium)

OpenAI

Consistency 9.19 · Attempt pass rate 75.0%

8.10 7.34 2.3140 $0.16199 7/10

Total Tests: 10

Fully passed tests: 7/10

Score: 8.10

Reasoning score: 7.34

Output Tokens: 630

Consistency: ⓘ 9.19

Attempt pass rate: ⓘ 75.0%

Flaky tests: ⓘ 1

Category	Fully passed tests	Score	Reasoning score	Output Tokens	Reasoning Tokens
Anti-AI Tricks	2/2	10.00	9.75	28	526
Data parsing and extraction	1/1	10.00	9.50	86	84
Domain specific	1/2	5.50	2.50	28	8,988
Instructions following	2/2	9.50	-	62	525
Puzzle Solving	1/3	7.00	8.25	426	464

Kimi K2.5 Reasoning (medium)

MoonshotAI

Consistency 9.06 · Attempt pass rate 75.0%

7.95 8.98 1.1257 $0.07880 7/10

Total Tests: 10

Fully passed tests: 7/10

Score: 7.95

Reasoning score: 8.98

Output Tokens: 10,043

Consistency: ⓘ 9.06

Attempt pass rate: ⓘ 75.0%

Flaky tests: ⓘ 1

Category	Fully passed tests	Score	Reasoning score	Output Tokens	Reasoning Tokens
Anti-AI Tricks	2/2	10.00	9.78	223	1,061
Data parsing and extraction	1/1	10.00	9.75	318	2,247
Domain specific	1/2	7.75	7.38	7,094	16,171
Instructions following	2/2	9.50	9.45	1,966	2,663
Puzzle Solving	1/3	5.00	8.97	442	4,237

Claude Sonnet 4.6 Reasoning (medium)

Anthropic

Consistency 9.06 · Attempt pass rate 75.0%

7.75 8.72 12.1517 $0.85062 7/10

Total Tests: 10

Fully passed tests: 7/10

Score: 7.75

Reasoning score: 8.72

Output Tokens: 33,105

Consistency: ⓘ 9.06

Attempt pass rate: ⓘ 75.0%

Flaky tests: ⓘ 1

Category	Fully passed tests	Score	Reasoning score	Output Tokens	Reasoning Tokens
Anti-AI Tricks	1/2	5.50	9.88	455	445
Data parsing and extraction	1/1	10.00	9.50	531	433
Domain specific	0/2	3.25	4.75	31,512	21,237
Instructions following	2/2	10.00	10.00	201	328
Puzzle Solving	3/3	10.00	9.50	406	456

StepFun: Step 3.5 Flash Reasoning (medium) Free Available

Stepfun

Consistency 10.00 · Attempt pass rate 70.0%

7.55 9.13 0.0000 $0.00000 7/10

Total Tests: 10

Fully passed tests: 7/10

Score: 7.55

Reasoning score: 9.13

Output Tokens: 28,305

Consistency: ⓘ 10.00

Attempt pass rate: ⓘ 70.0%

Flaky tests: ⓘ 0

Category	Fully passed tests	Score	Reasoning score	Output Tokens	Reasoning Tokens
Anti-AI Tricks	2/2	10.00	10.00	567	2,199
Data parsing and extraction	1/1	10.00	9.50	195	3,413
Domain specific	1/2	5.50	7.38	24,525	27,552
Instructions following	2/2	9.75	10.00	1,120	1,769
Puzzle Solving	1/3	5.00	9.00	1,898	5,657

GPT-5 Nano Reasoning (medium)

OpenAI

Consistency 9.99 · Attempt pass rate 70.0%

7.50 6.05 0.1796 $0.01258 7/10

Total Tests: 10

Fully passed tests: 7/10

Score: 7.50

Reasoning score: 6.05

Output Tokens: 1,385

Consistency: ⓘ 9.99

Attempt pass rate: ⓘ 70.0%

Flaky tests: ⓘ 0

Category	Fully passed tests	Score	Reasoning score	Output Tokens	Reasoning Tokens
Anti-AI Tricks	2/2	10.00	5.38	220	5,952
Data parsing and extraction	1/1	10.00	8.00	115	3,520
Domain specific	1/2	5.50	4.00	102	9,152
Instructions following	2/2	9.75	8.90	192	3,008
Puzzle Solving	1/3	4.83	5.33	756	8,064

gpt-oss-120b Reasoning (medium) Free Available

OpenAI

Consistency 5.38 · Attempt pass rate 65.0%

7.25 8.35 0.0372 $0.00149 4/10

Total Tests: 10

Fully passed tests: 4/10

Score: 7.25

Reasoning score: 8.35

Output Tokens: 3,636

Consistency: ⓘ 5.38

Attempt pass rate: ⓘ 65.0%

Flaky tests: ⓘ 5

Category	Fully passed tests	Score	Reasoning score	Output Tokens	Reasoning Tokens
Anti-AI Tricks	2/2	10.00	10.00	175	683
Data parsing and extraction	0/1	5.50	9.50	113	285
Domain specific	0/2	5.50	6.75	2,041	2,221
Instructions following	2/2	10.00	9.50	69	1,202
Puzzle Solving	0/3	5.33	7.17	1,238	1,217

MiniMax M2.5 Reasoning (medium)

MiniMax

Consistency 7.28 · Attempt pass rate 65.0%

7.15 7.69 3.1568 $0.15784 5/10

Total Tests: 10

Fully passed tests: 5/10

Score: 7.15

Reasoning score: 7.69

Output Tokens: 9,753

Consistency: ⓘ 7.28

Attempt pass rate: ⓘ 65.0%

Flaky tests: ⓘ 3

Category	Fully passed tests	Score	Reasoning score	Output Tokens	Reasoning Tokens
Anti-AI Tricks	1/2	7.75	8.25	7	3,492
Data parsing and extraction	1/1	10.00	9.65	102	2,046
Domain specific	1/2	5.50	5.13	8,835	116,546
Instructions following	1/2	8.50	8.25	629	1,803
Puzzle Solving	1/3	6.00	8.00	180	8,796

#10

MiMo-V2-Flash Reasoning (medium)

Xiaomi

Consistency 8.16 · Attempt pass rate 60.0%

6.60 7.92 0.1318 $0.00660 5/10

Total Tests: 10

Fully passed tests: 5/10

Score: 6.60

Reasoning score: 7.92

Output Tokens: 5,691

Consistency: ⓘ 8.16

Attempt pass rate: ⓘ 60.0%

Flaky tests: ⓘ 2

Category	Fully passed tests	Score	Reasoning score	Output Tokens	Reasoning Tokens
Anti-AI Tricks	1/2	5.50	9.75	185	839
Data parsing and extraction	0/1	5.50	7.00	103	4,367
Domain specific	2/2	10.00	7.38	4,867	9,578
Instructions following	2/2	10.00	9.75	115	1,854
Puzzle Solving	0/3	3.17	6.75	421	4,923

#11

Claude Sonnet 4.6 No Reasoning

Anthropic

Consistency 9.29 · Attempt pass rate 55.0%

6.25 - 0.5202 $0.02601 5/10

Total Tests: 10

Fully passed tests: 5/10

Score: 6.25

Reasoning score: -

Output Tokens: 1,098

Consistency: ⓘ 9.29

Attempt pass rate: ⓘ 55.0%

Flaky tests: ⓘ 1

Category	Fully passed tests	Score	Reasoning score	Output Tokens
Anti-AI Tricks	0/2	1.00	-	454
Data parsing and extraction	1/1	10.00	-	98
Domain specific	2/2	10.00	-	18
Instructions following	1/2	5.25	-	136
Puzzle Solving	1/3	6.67	-	392

#12

Qwen3.5 Plus 2026-02-15 No Reasoning

Qwen

Consistency 10.00 · Attempt pass rate 50.0%

5.70 - 0.0466 $0.00234 5/10

Total Tests: 10

Fully passed tests: 5/10

Score: 5.70

Reasoning score: -

Output Tokens: 433

Consistency: ⓘ 10.00

Attempt pass rate: ⓘ 50.0%

Flaky tests: ⓘ 0

Category	Fully passed tests	Score	Reasoning score	Output Tokens
Anti-AI Tricks	0/2	1.00	-	8
Data parsing and extraction	1/1	10.00	-	100
Domain specific	1/2	5.50	-	4
Instructions following	2/2	9.50	-	48
Puzzle Solving	1/3	5.00	-	273

#13

GLM 4.7 Flash Reasoning (medium)

Z.ai

Consistency 8.11 · Attempt pass rate 50.0%

5.45 8.04 0.1476 $0.00591 4/10

Total Tests: 10

Fully passed tests: 4/10

Score: 5.45

Reasoning score: 8.04

Output Tokens: 5,579

Consistency: ⓘ 8.11

Attempt pass rate: ⓘ 50.0%

Flaky tests: ⓘ 2

Category	Fully passed tests	Score	Reasoning score	Output Tokens	Reasoning Tokens
Anti-AI Tricks	1/2	5.50	9.25	384	1,056
Data parsing and extraction	1/1	10.00	9.40	468	2,195
Domain specific	1/2	5.50	7.00	3,798	4,035
Instructions following	1/2	7.25	9.75	265	1,428
Puzzle Solving	0/3	2.67	6.33	664	4,078

#14

Claude Opus 4.6 Reasoning (medium)

Anthropic

Consistency 8.11 · Attempt pass rate 50.0%

5.40 9.50 6.9512 $0.27806 4/10

Total Tests: 10

Fully passed tests: 4/10

Score: 5.40

Reasoning score: 9.50

Output Tokens: 5,900

Consistency: ⓘ 8.11

Attempt pass rate: ⓘ 50.0%

Flaky tests: ⓘ 2

Category	Fully passed tests	Score	Reasoning score	Output Tokens	Reasoning Tokens
Anti-AI Tricks	0/2	3.25	10.00	398	340
Data parsing and extraction	0/1	5.50	9.75	351	436
Domain specific	0/2	1.00	8.88	4,606	3,015
Instructions following	2/2	9.50	9.25	173	332
Puzzle Solving	2/3	7.00	9.67	372	395

#15

GLM 5 No Reasoning

Z.ai

Consistency 9.27 · Attempt pass rate 45.0%

5.30 - 0.0426 $0.00171 4/10

Total Tests: 10

Fully passed tests: 4/10

Score: 5.30

Reasoning score: -

Output Tokens: 337

Consistency: ⓘ 9.27

Attempt pass rate: ⓘ 45.0%

Flaky tests: ⓘ 1

Category	Fully passed tests	Score	Reasoning score	Output Tokens
Anti-AI Tricks	0/2	1.00	-	7
Data parsing and extraction	1/1	10.00	-	81
Domain specific	0/2	1.00	-	6
Instructions following	2/2	10.00	-	42
Puzzle Solving	1/3	6.33	-	201

#16

GLM 4.7 Flash No Reasoning

Z.ai

Consistency 9.26 · Attempt pass rate 35.0%

5.05 - 0.0084 $0.00026 3/10

Total Tests: 10

Fully passed tests: 3/10

Score: 5.05

Reasoning score: -

Output Tokens: 207

Consistency: ⓘ 9.26

Attempt pass rate: ⓘ 35.0%

Flaky tests: ⓘ 1

Category	Fully passed tests	Score	Reasoning score	Output Tokens
Anti-AI Tricks	0/2	1.00	-	8
Data parsing and extraction	1/1	10.00	-	82
Domain specific	2/2	10.00	-	8
Instructions following	0/2	4.50	-	32
Puzzle Solving	0/3	3.17	-	77

#17

MiMo-V2-Flash No Reasoning

Xiaomi

Consistency 8.11 · Attempt pass rate 40.0%

4.80 - 0.6484 $0.01946 3/10

Total Tests: 10

Fully passed tests: 3/10

Score: 4.80

Reasoning score: -

Output Tokens: 66,101

Consistency: ⓘ 8.11

Attempt pass rate: ⓘ 40.0%

Flaky tests: ⓘ 2

Category	Fully passed tests	Score	Reasoning score	Output Tokens
Anti-AI Tricks	0/2	1.00	-	12
Data parsing and extraction	0/1	5.50	-	106
Domain specific	1/2	7.75	-	8
Instructions following	1/2	5.25	-	43
Puzzle Solving	1/3	4.83	-	65,932

#18

Kimi K2.5 No Reasoning

MoonshotAI

Consistency 10.00 · Attempt pass rate 30.0%

4.00 - 0.0507 $0.00153 3/10

Total Tests: 10

Fully passed tests: 3/10

Score: 4.00

Reasoning score: -

Output Tokens: 284

Consistency: ⓘ 10.00

Attempt pass rate: ⓘ 30.0%

Flaky tests: ⓘ 0

Category	Fully passed tests	Score	Reasoning score	Output Tokens
Anti-AI Tricks	0/2	1.00	-	7
Data parsing and extraction	1/1	10.00	-	72
Domain specific	1/2	5.50	-	10
Instructions following	1/2	5.50	-	40
Puzzle Solving	0/3	2.00	-	155

#19

GPT-4o-mini No Reasoning

OpenAI

Consistency 9.97 · Attempt pass rate 20.0%

3.55 - 0.0310 $0.00063 2/10

Total Tests: 10

Fully passed tests: 2/10

Score: 3.55

Reasoning score: -

Output Tokens: 323

Consistency: ⓘ 9.97

Attempt pass rate: ⓘ 20.0%

Flaky tests: ⓘ 0

Category	Fully passed tests	Score	Reasoning score	Output Tokens
Anti-AI Tricks	0/2	1.00	-	4
Data parsing and extraction	1/1	10.00	-	74
Domain specific	0/2	1.00	-	4
Instructions following	1/2	5.50	-	46
Puzzle Solving	0/3	3.50	-	195

#20

Qwen3 Coder Next No Reasoning

Qwen

Consistency 10.00 · Attempt pass rate 20.0%

3.00 - 0.0405 $0.00081 2/10

Total Tests: 10

Fully passed tests: 2/10

Score: 3.00

Reasoning score: -

Output Tokens: 736

Consistency: ⓘ 10.00

Attempt pass rate: ⓘ 20.0%

Flaky tests: ⓘ 0

Category	Fully passed tests	Score	Reasoning score	Output Tokens
Anti-AI Tricks	0/2	1.00	-	14
Data parsing and extraction	0/1	1.00	-	100
Domain specific	1/2	5.50	-	8
Instructions following	1/2	5.00	-	42
Puzzle Solving	0/3	2.00	-	572

#21

Qwen3 Coder Next Reasoning (medium)

Qwen

Consistency 9.96 · Attempt pass rate 20.0%

2.95 3.83 0.0381 $0.00077 2/10

Total Tests: 10

Fully passed tests: 2/10

Score: 2.95

Reasoning score: 3.83

Output Tokens: 671

Consistency: ⓘ 9.96

Attempt pass rate: ⓘ 20.0%

Flaky tests: ⓘ 0

Category	Fully passed tests	Score	Reasoning score	Output Tokens
Anti-AI Tricks	0/2	1.00	1.00	12
Data parsing and extraction	0/1	1.00	4.00	100
Domain specific	1/2	5.50	5.00	8
Instructions following	1/2	5.00	8.00	50
Puzzle Solving	0/3	1.83	3.50	501

Show HN: AIBenchy – Independent AI Leaderboard

Comments