Show HN: I taught LLMs to play Magic: The Gathering against each other

GregorStocks | 2026-02-17 16:22 UTC | source

mage-bench is a fork of XMage that enables large language models to play Magic: The Gathering against each other across multiple formats — Commander, Standard, Modern, and Legacy.

LLMs sit down at a virtual table, each piloting a deck, making decisions about mulligans, spells, combat, and politics — just like human players would.

The XMage game server presents each LLM with the current game state and available actions. The LLM chooses what to do, and the game engine enforces the rules. No shortcuts, no simplified rulesets — the full complexity of Magic.

94 points | 73 comments | original link

I've been teaching LLMs to play Magic: The Gathering recently, via MCP tools hooked up to the open-source XMage codebase. It's still pretty buggy and I think there's significant room for existing models to get better at it via tooling improvements, but it pretty much works today. The ratings for expensive frontier models are artificially low right now because I've been focusing on cheaper models until I work out the bugs, so they don't have a lot of games in the system.

Comments

aethrum | 2026-02-17 17:48 UTC

I love magic. Can these do politics or is it just board state?

GregorStocks | 2026-02-17 17:55 UTC

I want them to do politics in Commander, and theoretically they should - the chat log is exposed in the MCP tools just like the rest of the game history, and their prompts tell them to use chat.

In practice they haven't really talked to each other, though. They've mostly just interpreted the prompts as "you should have a running monologue in chat". Not sure how much of this is issues with the harness vs the prompt, but I'm hoping to dig into it in the future.

steveBK123 | 2026-02-17 17:48 UTC

Why are all these Show HN posts overloaded with “i taught AI how to do things i used to do for entertainment” ?

Can we automate the unpleasantries in life instead of the pleasures?

kenforthewin | 2026-02-17 17:53 UTC

Does an AI also playing your game somehow detract from the pleasure you derive from it? I find it entertaining both to play the games, and see how LLMs perform on them; I don't see how these are in any way mutually exclusive.

qsort | 2026-02-17 17:58 UTC

Game AIs are probably one of the most harmless and unambiguously good applications of technology. As I said in another message, I used to play competitive MtG and I would have loved to have a competent AI opponent. Imagine the possibilities: after a tournament you could get to review the games and figure out what you did wrong and improve, like you would do in chess or backgammon.

I get the complaint, but how is this something that removes the human element at all?

zahlman | 2026-02-17 19:25 UTC

I think Show HN is far more overloaded with "I one-shotted an automation I find useful and then asked an LLM to explain why this is actually revolutionary".

kenforthewin | 2026-02-17 17:51 UTC

Nice work. I think games are a great way to benchmark AI, especially games that involve long term strategy. I recently built an agent harness for NetHack - https://glyphbox.app/ - like you I suspect that there's a lot you can do at the harness / tool level to improve performance with existing models.

qsort | 2026-02-17 17:54 UTC

This is a fantastic idea, I used to play MtG competitively and a strong artificial opponent was always something I'd have loved.

The issue I see is that you'd need a huge amount of games to tell who's better (you need that between humans too, the game is very high variance.)

Another problem is that giving a positional evaluation to count mistakes is hard because MtG, in addition to having randomness, has private information. It could be rational for both players to believe they're currently winning even if they're both perfect bayesians. You'd need to have something that approximates "this is the probability of winning the game from this position, given all the information I have," which is almost certainly asymmetric and much more complicated than the equivalent for a game with randomness but not private information such as backgammon.

GregorStocks | 2026-02-17 18:00 UTC

You wouldn't really need a _ton_ of games to get plausible data, but unfortunately today each game costs real money - typically a dollar or more with my current harness, though I'm hoping to optimize it and of course I expect model costs to continue to decline over time. But even reasonably-expensive models today are making tons of blunders that a tournament grinder wouldn't.

I'm not trying to compute a chess-style "player X was at 0.4 before this move and at 0.2 afterwards, so it was a -0.2 blunder", but I do have "blunder analysis" where I just ask Opus to second-guess every decision after the game is over - there's a bit more information on the Methodology page. So then you can compare models by looking at how often they blunder, rather than the binary win/loss data. If you look at individual games you can jump to the "blunders" on the timeline - most of the time I agree with Opus's analysis.

oflannabhra | 2026-02-17 17:54 UTC

This is really cool! I really liked the architecture explanation.

Once you get solid rankings for the different LLMs, I think a huge feature of a system like this would be to allow LLMs to pilot user decks to evaluate changes to the deck.

I'm guessing the costs of that would be pretty big, but if decent piloting is ever enabled by the cheaper models, it could be a huge change to how users evaluate their deck construction.

Especially for formats like Commander where cooperation and coordination amongst players can't be evaluated through pure simulation, and the singleton nature makes specific card changes very difficult to evaluate as testing requires many, many games.

jamilton | 2026-02-17 17:55 UTC

Cool. How’d you pick decks?

GregorStocks | 2026-02-17 18:02 UTC

For the 1v1 formats (Standard, Modern, Legacy) I'm basically just using the current metagame from MTGGoldfish. For Commander they get a random precon. At some point I might want a 1v1 "less complicated lines than Standard" format, the LLMs don't always understand the strategy of weird decks like Doomsday or Mill.

chc4 | 2026-02-17 17:58 UTC

It's really funny reading the thought processes, where most of the time the agent doesn't actually remember trivial things about the cards they or their opponent are playing (thinking they have different mana costs, have different effects, mix up their effect with another card). The fact they're able to take game actions and win against other agants is cute, but it doesn't inspire much confidence.

The agents also constantly seem to evaluate if they're "behind" or "ahead" based on board state, which is a weird way of thinking about most games and often hard to evalaute, especially for decks like control which card more about resources like mana and card advantage, and always plan on stabalizing late game.

GregorStocks | 2026-02-17 18:20 UTC

You might be looking at really old games (meaning, like, Saturday) - I've made a lot of harness improvements recently which should make the "what does this card do?" hallucinations less common. But yeah, it still happens, especially with cheaper models - it's hard to balance "shoving everything they need into the context" against "avoid paying a billion dollars per game or overwhelming their short-term memory". I think the real solution here will be to expose more powerful MCP tools and encourage them to use the tools heavily, but most current models have problems with large MCP toolsets so I'm leaving that as a TODO for now until solutions like Anthropic's https://www.anthropic.com/engineering/code-execution-with-mc... become widespread.

spelunker | 2026-02-17 18:00 UTC

This is neat! What kind of steering or context did you provide to the LLMs? Super basic like "You are playing a card game called Magic: The Gathering", or more complex?

GregorStocks | 2026-02-17 18:03 UTC

My general intention is to tell them "you're playing MTG, your goal is to win, here are the tools available to you, follow whatever strategy you want" - I don't want to spoon-feed them strategy, that defeats the purpose of the benchmark.

You can see the current prompt at https://github.com/GregorStocks/mage-bench/blob/master/puppe...:

  "default": "You are a competitive Magic: The Gathering player. Your goal is to WIN the game. Play to maximize your win rate \u2014 make optimal strategic decisions, not flashy or entertaining ones. Think carefully about sequencing, card evaluation, and combat math.\n\nGAME LOOP - follow this exactly:\n1. Call pass_priority - this blocks until you have a decision to make, then returns your choices (response_type, choices, context, etc.)\n2. Read the choices, then call choose_action with your decision\n3. Go back to step 1\n\nCRITICAL RULES:\n- pass_priority returns your choices directly. Read them before calling choose_action.\n- When pass_priority shows playable cards, you should play them before passing. Only pass (answer=false) when you have nothing more you want to play this phase.\n\nUNDERSTANDING pass_priority OUTPUT:\n- All cards listed in response_type=select are confirmed castable with your current mana. The server pre-filters to only show cards you can legally play right now.\n- mana_pool shows your current floating mana (e.g. {\"R\": 2, \"W\": 1}).\n- untapped_lands shows how many untapped lands you control.\n- Cards with [Cast] are spells from your hand. Cards with [Activate] are abilities on permanents you control.\n\nMULLIGAN DECISIONS:\nWhen you see \"Mulligan\" in GAME_ASK, your_hand shows your current hand.\n- choose_action(answer=true) means YES MULLIGAN - throw away this hand and draw new cards\n- choose_action(answer=false) means NO KEEP - keep this hand and start playing\nThink carefully: answer=false means KEEP, answer=true means MULLIGAN.\n\nOBJECT IDs:\nEvery game object (cards in hand, permanents, stack items, graveyard/exile cards) has a short ID like \"p1\", \"p2\", etc. These IDs are stable \u2014 a card keeps its ID as it moves between zones. Use the id parameter in choose_action(id=\"p3\") instead of index when selecting objects. Use short IDs with get_oracle_text(object_id=\"p3\") and in mana_plan entries ({\"tap\":\"p3\"}).\n\nHOW ACTIONS WORK:\n- response_type=select: Cards listed are confirmed playable with your current mana. Play a card with choose_action(id=\"p3\"). Pass with choose_action(answer=false) only when you are done playing cards this phase.\n- response_type=boolean with no playable cards: Pass with choose_action(answer=false).\n- GAME_ASK (boolean): Answer true/false based on what's being asked.\n- GAME_CHOOSE_ABILITY (index): Pick an ability by index.\n- GAME_TARGET (index or id): Pick a target. If required=true, you must pick one.\n\nCOMBAT - ATTACKING:\nWhen you see combat_phase=\"declare_attackers\", use batch declaration:\n- choose_action(attackers=[\"p1\",\"p2\",\"p3\"]) declares multiple attackers at once and auto-confirms.\n- choose_action(attackers=[\"all\"]) declares all possible attackers.\n- To skip attacking, call choose_action(answer=false).\n\nCOMBAT - BLOCKING:\nWhen you see combat_phase=\"declare_blockers\", use batch declaration:\n- choose_action(blockers=[{\"id\":\"p5\",\"blocks\":\"p1\"},{\"id\":\"p6\",\"blocks\":\"p2\"}]) declares blockers and their assignments at once.\n- Use IDs from incoming_attackers for the \"blocks\" field.\n- To not block, call choose_action(answer=false).\n\nCHAT:\nUse send_chat_message to talk to your opponents during the game. React to big plays, comment on the board state, or just have fun. Check the recent_chat field in pass_priority results to see what others are saying."

They also get a small "personality" on top of that, e.g.:

"grudge-holder": { "name_part": "Grudge", "prompt_suffix": "You remember every card that wronged you. Take removal personally. Target whoever hurt you last. Keep a mental scoreboard of grievances. Forgive nothing. When a creature you liked dies, vow revenge." }, "teacher": { "name_part": "Teach", "prompt_suffix": "You explain your reasoning like you're coaching a newer player. Talk through sequencing decisions, threat evaluation, and common mistakes. Be patient and clear. Point out what the correct play is and why." },

Then they also see the documentation for the MCP tools: https://mage-bench.com/mcp-tools/. For now I've tried to keep that concise to avoid "too many MCP tools in context" issues - I expect that as solutions like tool search (https://www.anthropic.com/engineering/code-execution-with-mc...) become widespread I'll be able to add fancier tools for some models.

zahlman | 2026-02-17 19:23 UTC

How do the models know the rules of the game? Are they just supposed to use the MCP tools to figure it out? (Do they have to keep doing that from scratch?)

protocolture | 2026-02-17 22:43 UTC

>You are a competitive Magic: The Gathering player.

"If I get access to a deodorant item I should definitely not use it"

yomismoaqui | 2026-02-17 18:15 UTC

I was curious if there is something equivalent to AlphaGo but for MTG.

From the little I have seen they are different beasts (hidden information, number and complexity of rules...).

PS: Does this count as nerdsniping?

GregorStocks | 2026-02-17 18:27 UTC

I'm not aware of any good ML models for MTG. I'm just using off-the-shelf LLMs with a custom harness. It'd certainly be possible to do RLHF or something using the harness I've built, but it'd be expensive - anybody want to give me a few million dollars of OpenRouter credits so I can give it a shot?

portly | 2026-02-17 18:20 UTC

With the direction MtG is currently heading, I kind of want to break out and just play some in-Universe sets that are community made on an FOSS client. How nice would it be to just play the game in its original spirit.

GregorStocks | 2026-02-17 18:25 UTC

You might be interested in Premodern: https://premodernmagic.com/. You can play it on regular old MTGO.

FOSS Magic clients are in a legal gray area at best. My mental model is that Wizards de facto tolerate clients like XMage and Forge because their UX is awful, but if you made something that's actually as user-friendly as MTGO/Arena, they'd sue you and you would lose.

ddtaylor | 2026-02-17 18:47 UTC

GCCG has been around for a while and the clients at times had to download card images and metadata from the public Wizards site

dgxyz | 2026-02-17 18:37 UTC

I still play 4th edition against some friends. We have had the decks well over a couple of decades after we bought them! That and Catan.

Best to do this stuff in person I find.

saghm | 2026-02-17 19:47 UTC

Sounds like you want Cockatrice: https://cockatrice.github.io/

The rules aren't embedded into the client; it's "just" a virtual tabletop where you enforce the rules the same way you would playing with a friend in person. Cards have to be imported but it's fairly automatic (basically just clicking a few buttons after startup), so you could either only import the sets you want or just not use the ones you don't want (which is also how it tends to work when playing informally in person; it's not like you usually have a judge to enforce that you or your friends are playing by whatever rules you agree to).