ClashBot
I challenged myself to build an AI system to play Clash Royale, a complex real-time strategy game. The objective of the game is to place troops to destroy your opponents towers. What made this difficult was that the matches had a long horizon, being ~3 minutes long, and there were over 100 different cards, each with their own unique interactions.
To build such a system, I modeled my approach off of AlphaStar, an AI by Google Deepmind that beat players at StarCraft. They first used supervised learning to learn how to imitate the human player from millions of human replays, then used reinforcement learning in a simulated environment to further refine their playstyles.
For the supervised learning phase, there were no official game replays I could access, so I looked online and found Stats Royale, a youtube channel that had automatically uploaded tens of thousands of match videos. By using computer vision, I theorized I could turn this into an effective dataset.
For the reinforcement learning phrase, I wanted to make an environment that could simulate the game thousands of times faster than normal speed. So, I attempted to recreate a 1:1 replica of the game interactions, utilizing a CSV of card stats that I found in the game files.
Because the game was quite complex, I needed a custom ML structure. I feed the game state and grid of troops into a LSTM, which also looks over the past several frames. Its output is fed into two heads that make the game decisions and a value network. The first determines which card is played, either 1, 2, 3, 4 or NO-OP when no card is played. The second decides on where the card is played. The value network evaluates which player is winning (and will assist in Actor Critic learning for RL).
The first challenge I tackled was reading the board state. The game had no official API to extract the exact board state, so I needed to do everything based on computer vision. In order to count each player’s elixir (currency), I masked the frame for a certain purple color and counted the length of the bar. To read the timer, I initially used EasyOCR, but this was slower than what I wanted. So, I first found the numbers by shooting a ray from the right side of the screen and flood-filling white pixels. Then, I classified the numbers matching those pixels against a mask of pixels for each number and seeing which mask matched the closest. This was over a hundred times faster.
Reading the cards in hand was another challenge. Cards could either be in color or black and which depending if the player could afford the card. I decided to only classify colored cards, because the greyed out version had a timer overlay that would be difficult and prone to error from pixel-based matching. After processing a match, I could analytically solve for what the grey cards must have been. To classify the colored cards, I split the card into 16 regions and in each region got the average color. I compared those averages against reference images for each card and matched it with whichever one was the closest. I found that matching in the LAB instead of RGB color space further improved accuracy. I tested this out on hundreds of random frames and was surprised in how consistent it was.
Getting the placement of cards would also be necessary for the dataset. I determined when cards were played based off of when the elixir count decreased. There, I would check for the empty slot, and that was the card played. To determine where the card was played, I used a YOLO v11 model which was trained to look at the field and locate where a stopwatch icon appeared, since that was always the exact tile where a troop was played.
I didn’t just want to determine what cards were in hard, but also the exact order of the cards in the deck (cards played always go to the bottom of a players deck). Doing so would also let me solve for the greyed out cards. I assigned each card an index (0-7) and kept track of the exact order of the deck and hand. I would keep track of which detected label was associated with each card index. I could use the information of the when cards were played to reconstruct the hand at any given timestamp. This worked nearly all of the time, but there were some rare edge cases where card placements were or missed and there were false positives. These were dangerous because they completely messed up the solve at the end. To fix this problem, I tried a bunch of things, but ended up with an algorithm that experimentally tested out removing placement instances and adding new placement instances in between two existing ones. If it increased the solve accuracy, I added that change. Usually 1-2 edits were made per match, and this did a great job at improving reliability.
Next, I tackled troop detection, and for that I wanted a YOLO model. I first looked at RoboFlow Universe to find datasets by other people, and while there were many out there, since there were 240+ different labels needed, their size of around one to two thousand images would not suffice. So, I decided on creating a synthetic dataset. I extracted the game assets from the .apkg, and placed the sprites on a blank arena, with annotations automatically added. This saved me from dozens of hours of annotating by hand.