← back

Show HN: AIBenchy – Independent AI Leaderboard

XCSme | 2026-02-18 02:31 UTC | source
#1

Qwen3.5 Plus 2026-02-15 Reasoning (medium)

Qwen

Consistency 10.00 · Attempt pass rate 100.0%

10.00 7.83 1.3203 $0.13203 10/10

Total Tests: 10

Fully passed tests: 10/10

Score: 10.00

Reasoning score: 7.83

Output Tokens: 407

Consistency: 10.00

Attempt pass rate: 100.0%

Flaky tests: 0

Category Fully passed tests Score Reasoning score Output Tokens Reasoning Tokens
Anti-AI Tricks 2/2 10.00 9.25 25 2,255
Data parsing and extraction 1/1 10.00 9.75 110 3,822
Domain specific 2/2 10.00 5.38 23 17,004
Instructions following 2/2 10.00 7.63 69 13,421
Puzzle Solving 3/3 10.00 8.00 180 17,572
#2

GLM 5 Reasoning (medium)

Z.ai

Consistency 9.05 · Attempt pass rate 95.0%

9.50 9.22 0.7003 $0.06303 9/10

Total Tests: 10

Fully passed tests: 9/10

Score: 9.50

Reasoning score: 9.22

Output Tokens: 15,960

Consistency: 9.05

Attempt pass rate: 95.0%

Flaky tests: 1

Category Fully passed tests Score Reasoning score Output Tokens Reasoning Tokens
Anti-AI Tricks 2/2 10.00 9.50 449 1,559
Data parsing and extraction 1/1 10.00 9.90 214 2,232
Domain specific 1/2 7.75 8.00 13,012 13,656
Instructions following 2/2 9.75 9.63 115 1,919
Puzzle Solving 3/3 10.00 9.37 2,170 4,817
#3

GPT-5.2 Reasoning (medium)

OpenAI

Consistency 9.19 · Attempt pass rate 75.0%

8.10 7.34 2.3140 $0.16199 7/10

Total Tests: 10

Fully passed tests: 7/10

Score: 8.10

Reasoning score: 7.34

Output Tokens: 630

Consistency: 9.19

Attempt pass rate: 75.0%

Flaky tests: 1

Category Fully passed tests Score Reasoning score Output Tokens Reasoning Tokens
Anti-AI Tricks 2/2 10.00 9.75 28 526
Data parsing and extraction 1/1 10.00 9.50 86 84
Domain specific 1/2 5.50 2.50 28 8,988
Instructions following 2/2 9.50 - 62 525
Puzzle Solving 1/3 7.00 8.25 426 464
#4

Kimi K2.5 Reasoning (medium)

MoonshotAI

Consistency 9.06 · Attempt pass rate 75.0%

7.95 8.98 1.1257 $0.07880 7/10

Total Tests: 10

Fully passed tests: 7/10

Score: 7.95

Reasoning score: 8.98

Output Tokens: 10,043

Consistency: 9.06

Attempt pass rate: 75.0%

Flaky tests: 1

Category Fully passed tests Score Reasoning score Output Tokens Reasoning Tokens
Anti-AI Tricks 2/2 10.00 9.78 223 1,061
Data parsing and extraction 1/1 10.00 9.75 318 2,247
Domain specific 1/2 7.75 7.38 7,094 16,171
Instructions following 2/2 9.50 9.45 1,966 2,663
Puzzle Solving 1/3 5.00 8.97 442 4,237
#5

Claude Sonnet 4.6 Reasoning (medium)

Anthropic

Consistency 9.06 · Attempt pass rate 75.0%

7.75 8.72 12.1517 $0.85062 7/10

Total Tests: 10

Fully passed tests: 7/10

Score: 7.75

Reasoning score: 8.72

Output Tokens: 33,105

Consistency: 9.06

Attempt pass rate: 75.0%

Flaky tests: 1

Category Fully passed tests Score Reasoning score Output Tokens Reasoning Tokens
Anti-AI Tricks 1/2 5.50 9.88 455 445
Data parsing and extraction 1/1 10.00 9.50 531 433
Domain specific 0/2 3.25 4.75 31,512 21,237
Instructions following 2/2 10.00 10.00 201 328
Puzzle Solving 3/3 10.00 9.50 406 456
#6

StepFun: Step 3.5 Flash Reasoning (medium) Free Available

Stepfun

Consistency 10.00 · Attempt pass rate 70.0%

7.55 9.13 0.0000 $0.00000 7/10

Total Tests: 10

Fully passed tests: 7/10

Score: 7.55

Reasoning score: 9.13

Output Tokens: 28,305

Consistency: 10.00

Attempt pass rate: 70.0%

Flaky tests: 0

Category Fully passed tests Score Reasoning score Output Tokens Reasoning Tokens
Anti-AI Tricks 2/2 10.00 10.00 567 2,199
Data parsing and extraction 1/1 10.00 9.50 195 3,413
Domain specific 1/2 5.50 7.38 24,525 27,552
Instructions following 2/2 9.75 10.00 1,120 1,769
Puzzle Solving 1/3 5.00 9.00 1,898 5,657
#7

GPT-5 Nano Reasoning (medium)

OpenAI

Consistency 9.99 · Attempt pass rate 70.0%

7.50 6.05 0.1796 $0.01258 7/10

Total Tests: 10

Fully passed tests: 7/10

Score: 7.50

Reasoning score: 6.05

Output Tokens: 1,385

Consistency: 9.99

Attempt pass rate: 70.0%

Flaky tests: 0

Category Fully passed tests Score Reasoning score Output Tokens Reasoning Tokens
Anti-AI Tricks 2/2 10.00 5.38 220 5,952
Data parsing and extraction 1/1 10.00 8.00 115 3,520
Domain specific 1/2 5.50 4.00 102 9,152
Instructions following 2/2 9.75 8.90 192 3,008
Puzzle Solving 1/3 4.83 5.33 756 8,064
#8

gpt-oss-120b Reasoning (medium) Free Available

OpenAI

Consistency 5.38 · Attempt pass rate 65.0%

7.25 8.35 0.0372 $0.00149 4/10

Total Tests: 10

Fully passed tests: 4/10

Score: 7.25

Reasoning score: 8.35

Output Tokens: 3,636

Consistency: 5.38

Attempt pass rate: 65.0%

Flaky tests: 5

Category Fully passed tests Score Reasoning score Output Tokens Reasoning Tokens
Anti-AI Tricks 2/2 10.00 10.00 175 683
Data parsing and extraction 0/1 5.50 9.50 113 285
Domain specific 0/2 5.50 6.75 2,041 2,221
Instructions following 2/2 10.00 9.50 69 1,202
Puzzle Solving 0/3 5.33 7.17 1,238 1,217
#9

MiniMax M2.5 Reasoning (medium)

MiniMax

Consistency 7.28 · Attempt pass rate 65.0%

7.15 7.69 3.1568 $0.15784 5/10

Total Tests: 10

Fully passed tests: 5/10

Score: 7.15

Reasoning score: 7.69

Output Tokens: 9,753

Consistency: 7.28

Attempt pass rate: 65.0%

Flaky tests: 3

Category Fully passed tests Score Reasoning score Output Tokens Reasoning Tokens
Anti-AI Tricks 1/2 7.75 8.25 7 3,492
Data parsing and extraction 1/1 10.00 9.65 102 2,046
Domain specific 1/2 5.50 5.13 8,835 116,546
Instructions following 1/2 8.50 8.25 629 1,803
Puzzle Solving 1/3 6.00 8.00 180 8,796
#10

MiMo-V2-Flash Reasoning (medium)

Xiaomi

Consistency 8.16 · Attempt pass rate 60.0%

6.60 7.92 0.1318 $0.00660 5/10

Total Tests: 10

Fully passed tests: 5/10

Score: 6.60

Reasoning score: 7.92

Output Tokens: 5,691

Consistency: 8.16

Attempt pass rate: 60.0%

Flaky tests: 2

Category Fully passed tests Score Reasoning score Output Tokens Reasoning Tokens
Anti-AI Tricks 1/2 5.50 9.75 185 839
Data parsing and extraction 0/1 5.50 7.00 103 4,367
Domain specific 2/2 10.00 7.38 4,867 9,578
Instructions following 2/2 10.00 9.75 115 1,854
Puzzle Solving 0/3 3.17 6.75 421 4,923
#11

Claude Sonnet 4.6 No Reasoning

Anthropic

Consistency 9.29 · Attempt pass rate 55.0%

6.25 - 0.5202 $0.02601 5/10

Total Tests: 10

Fully passed tests: 5/10

Score: 6.25

Reasoning score: -

Output Tokens: 1,098

Consistency: 9.29

Attempt pass rate: 55.0%

Flaky tests: 1

Category Fully passed tests Score Reasoning score Output Tokens Reasoning Tokens
Anti-AI Tricks 0/2 1.00 - 454 0
Data parsing and extraction 1/1 10.00 - 98 0
Domain specific 2/2 10.00 - 18 0
Instructions following 1/2 5.25 - 136 0
Puzzle Solving 1/3 6.67 - 392 0
#12

Qwen3.5 Plus 2026-02-15 No Reasoning

Qwen

Consistency 10.00 · Attempt pass rate 50.0%

5.70 - 0.0466 $0.00234 5/10

Total Tests: 10

Fully passed tests: 5/10

Score: 5.70

Reasoning score: -

Output Tokens: 433

Consistency: 10.00

Attempt pass rate: 50.0%

Flaky tests: 0

Category Fully passed tests Score Reasoning score Output Tokens Reasoning Tokens
Anti-AI Tricks 0/2 1.00 - 8 0
Data parsing and extraction 1/1 10.00 - 100 0
Domain specific 1/2 5.50 - 4 0
Instructions following 2/2 9.50 - 48 0
Puzzle Solving 1/3 5.00 - 273 0
#13

GLM 4.7 Flash Reasoning (medium)

Z.ai

Consistency 8.11 · Attempt pass rate 50.0%

5.45 8.04 0.1476 $0.00591 4/10

Total Tests: 10

Fully passed tests: 4/10

Score: 5.45

Reasoning score: 8.04

Output Tokens: 5,579

Consistency: 8.11

Attempt pass rate: 50.0%

Flaky tests: 2

Category Fully passed tests Score Reasoning score Output Tokens Reasoning Tokens
Anti-AI Tricks 1/2 5.50 9.25 384 1,056
Data parsing and extraction 1/1 10.00 9.40 468 2,195
Domain specific 1/2 5.50 7.00 3,798 4,035
Instructions following 1/2 7.25 9.75 265 1,428
Puzzle Solving 0/3 2.67 6.33 664 4,078
#14

Claude Opus 4.6 Reasoning (medium)

Anthropic

Consistency 8.11 · Attempt pass rate 50.0%

5.40 9.50 6.9512 $0.27806 4/10

Total Tests: 10

Fully passed tests: 4/10

Score: 5.40

Reasoning score: 9.50

Output Tokens: 5,900

Consistency: 8.11

Attempt pass rate: 50.0%

Flaky tests: 2

Category Fully passed tests Score Reasoning score Output Tokens Reasoning Tokens
Anti-AI Tricks 0/2 3.25 10.00 398 340
Data parsing and extraction 0/1 5.50 9.75 351 436
Domain specific 0/2 1.00 8.88 4,606 3,015
Instructions following 2/2 9.50 9.25 173 332
Puzzle Solving 2/3 7.00 9.67 372 395
#15

GLM 5 No Reasoning

Z.ai

Consistency 9.27 · Attempt pass rate 45.0%

5.30 - 0.0426 $0.00171 4/10

Total Tests: 10

Fully passed tests: 4/10

Score: 5.30

Reasoning score: -

Output Tokens: 337

Consistency: 9.27

Attempt pass rate: 45.0%

Flaky tests: 1

Category Fully passed tests Score Reasoning score Output Tokens Reasoning Tokens
Anti-AI Tricks 0/2 1.00 - 7 0
Data parsing and extraction 1/1 10.00 - 81 0
Domain specific 0/2 1.00 - 6 0
Instructions following 2/2 10.00 - 42 0
Puzzle Solving 1/3 6.33 - 201 0
#16

GLM 4.7 Flash No Reasoning

Z.ai

Consistency 9.26 · Attempt pass rate 35.0%

5.05 - 0.0084 $0.00026 3/10

Total Tests: 10

Fully passed tests: 3/10

Score: 5.05

Reasoning score: -

Output Tokens: 207

Consistency: 9.26

Attempt pass rate: 35.0%

Flaky tests: 1

Category Fully passed tests Score Reasoning score Output Tokens Reasoning Tokens
Anti-AI Tricks 0/2 1.00 - 8 0
Data parsing and extraction 1/1 10.00 - 82 0
Domain specific 2/2 10.00 - 8 0
Instructions following 0/2 4.50 - 32 0
Puzzle Solving 0/3 3.17 - 77 0
#17

MiMo-V2-Flash No Reasoning

Xiaomi

Consistency 8.11 · Attempt pass rate 40.0%

4.80 - 0.6484 $0.01946 3/10

Total Tests: 10

Fully passed tests: 3/10

Score: 4.80

Reasoning score: -

Output Tokens: 66,101

Consistency: 8.11

Attempt pass rate: 40.0%

Flaky tests: 2

Category Fully passed tests Score Reasoning score Output Tokens Reasoning Tokens
Anti-AI Tricks 0/2 1.00 - 12 0
Data parsing and extraction 0/1 5.50 - 106 0
Domain specific 1/2 7.75 - 8 0
Instructions following 1/2 5.25 - 43 0
Puzzle Solving 1/3 4.83 - 65,932 0
#18

Kimi K2.5 No Reasoning

MoonshotAI

Consistency 10.00 · Attempt pass rate 30.0%

4.00 - 0.0507 $0.00153 3/10

Total Tests: 10

Fully passed tests: 3/10

Score: 4.00

Reasoning score: -

Output Tokens: 284

Consistency: 10.00

Attempt pass rate: 30.0%

Flaky tests: 0

Category Fully passed tests Score Reasoning score Output Tokens Reasoning Tokens
Anti-AI Tricks 0/2 1.00 - 7 0
Data parsing and extraction 1/1 10.00 - 72 0
Domain specific 1/2 5.50 - 10 0
Instructions following 1/2 5.50 - 40 0
Puzzle Solving 0/3 2.00 - 155 0
#19

GPT-4o-mini No Reasoning

OpenAI

Consistency 9.97 · Attempt pass rate 20.0%

3.55 - 0.0310 $0.00063 2/10

Total Tests: 10

Fully passed tests: 2/10

Score: 3.55

Reasoning score: -

Output Tokens: 323

Consistency: 9.97

Attempt pass rate: 20.0%

Flaky tests: 0

Category Fully passed tests Score Reasoning score Output Tokens Reasoning Tokens
Anti-AI Tricks 0/2 1.00 - 4 0
Data parsing and extraction 1/1 10.00 - 74 0
Domain specific 0/2 1.00 - 4 0
Instructions following 1/2 5.50 - 46 0
Puzzle Solving 0/3 3.50 - 195 0
#20

Qwen3 Coder Next No Reasoning

Qwen

Consistency 10.00 · Attempt pass rate 20.0%

3.00 - 0.0405 $0.00081 2/10

Total Tests: 10

Fully passed tests: 2/10

Score: 3.00

Reasoning score: -

Output Tokens: 736

Consistency: 10.00

Attempt pass rate: 20.0%

Flaky tests: 0

Category Fully passed tests Score Reasoning score Output Tokens Reasoning Tokens
Anti-AI Tricks 0/2 1.00 - 14 0
Data parsing and extraction 0/1 1.00 - 100 0
Domain specific 1/2 5.50 - 8 0
Instructions following 1/2 5.00 - 42 0
Puzzle Solving 0/3 2.00 - 572 0
#21

Qwen3 Coder Next Reasoning (medium)

Qwen

Consistency 9.96 · Attempt pass rate 20.0%

2.95 3.83 0.0381 $0.00077 2/10

Total Tests: 10

Fully passed tests: 2/10

Score: 2.95

Reasoning score: 3.83

Output Tokens: 671

Consistency: 9.96

Attempt pass rate: 20.0%

Flaky tests: 0

Category Fully passed tests Score Reasoning score Output Tokens Reasoning Tokens
Anti-AI Tricks 0/2 1.00 1.00 12 0
Data parsing and extraction 0/1 1.00 4.00 100 0
Domain specific 1/2 5.50 5.00 8 0
Instructions following 1/2 5.00 8.00 50 0
Puzzle Solving 0/3 1.83 3.50 501 0
1 points | 1 comments | original link
Hey HN, Like many of you, I'm tired of public AI leaderboards that mostly recycle the same saturated/overfitted benchmarks (MMLU, HumanEval, etc.) and often miss fast/cheap variants or real daily pain points.

A couple days ago I launched AIBenchy — a small, opinionated leaderboard running my own custom tests focused on end-user/dev scenarios that actually trip up models today.

Current tests cover categories like:

- Anti-AI Tricks (classic gotchas like "count the Rs in strawberry", logic traps)

- Instruction following & consistency

- Data parsing/extraction

- Domain-specific tasks

- Puzzle solving / edge-case reasoning

Recent additions (just pushed today):

- Reasoning score (new!): A separate judge LLM evaluates the chain-of-thought for efficiency — does it repeat itself, loop, think forever, brute-force enumerate every possibility (looking at you, some Qwen-3.5 runs), or get to the point cleanly? This penalizes "cheaty" high-token reasoning even if the final answer is correct. Goal: reward smart, concise thinking over exhaustive trial-and-error.

- Stability metric: Measures consistency across runs (some models flake on the same prompt).

Right now the leaderboard has ~20 models (Qwen3.5 Plus currently topping it, followed by GLM 5, various GPT/Claude variants, etc.), but it's super early/WIP:

- Manual runs + small test set - No public submission of tests yet (open to ideas!) - Focused on transparency & practical usefulness over massive scale

I'd love feedback from HN:

- What custom tests / gotchas / use-cases should I add next?

- Thoughts on the reasoning score — fair way to judge efficiency, or too subjective?

- Models/variants I'm missing (especially fast/cheap ones ignored elsewhere)?

- Should I let people submit their own prompts/tests eventually?

Thanks for checking it out: https://aibenchy.com

Appreciate any roast/ideas — building this to scratch my own itch.

Comments

XCSme | 2026-02-18 02:38 UTC
Offtopic: Dang, I'm fighting with the HackerNews formatting. Anyone has a link to the HN formatting guide?