In my first year of coding, I wrote a really cute, copyright-safe Flappy Bird knock-off called “Bumble B.” With a little more experience under my belt, I thought it’d be fun to beat a game like Bumble B with machine learning.
Last December I started looking into reinforcement learning. I decided I’d tackle Bumble B with an algorithm called Q-learning, which I am in no way qualified to explain. If you want a real explanation, see this resource that I referenced.
Q-learning trains a neural network by trying random stuff and learning from the stuff that works. Every run, the AI tries the game, and for every frame of the game, we either give the AI cookies or take away cookies. If we give it cookies, the AI will use the information in the frame to train the model. That means it’ll remember where the bee was with respect to the next gate and whether it decided to jump or not. If we revoke cookies, that means the AI did something stupid, and it’ll try to avoid doing the same stupid thing in the future.
Neural network architecture for Bumble B. The inputs are the x and y distances to the next gate, and the outputs are jump or do nothing.
I tried this last December. First, I had to make a lightweight version of Bumble B in Python. As much as I adore Matplotlib, I needed something faster, so I went with John Zelle’s graphics.py. The bare-bones game is pretty straightforward to code. Then, I had to make it compatible with the Q-learning wrapper and design the neural network architecture. I initially had many hidden layers and many inputs for the bee velocity, position, the pipe positions, etc. Eventually I realized that the only inputs that really matter are the x and y distances to the center of the next gate, as depicted in the figure. And more neurons mean more training, so I only included one hidden layer with six neurons.
The trickiest part of this problem is the cookie-allocation algorithm. Some other folks who have done Flappy Bird with Q-learning give a small cookie reward for not dying, a big cookie reward for getting through a gate, and a big cookie deduction for dying. This is what I used at first. I spent some frustrating days over winter break watching my Python console output failure after failure, each run reporting that the bee learned to fold its wings and plummet into oblivion, incentivized by the meager pity points I was giving it for not jumping into a pipe or the ceiling. Q-learning for Flappy Bird-style games is tricky business because the game is so high stakes. Anyone who’s played it knows how tiny mistakes in timing lead to instant death. You can imagine how hard the game is, then, for a blind AI who only knows two numbers about its environment and learns through receiving cookies from a mysterious higher power.
Fast forward six months. I was chatting with my roommate about this problem when he suggested a cookie gradient to coax the bee towards the center of the next gate. This way, the bee is trained to survive, but also learns to yearn for something greater than 41 frames of cookie-flavored freefall. After some experimentation, I went with a 1/r3 reward (maxing it out at 1 so it doesn’t blow up), where r denotes the distance to the center of the next gate.
I started by training on a game where the gate width is 12.5 times the height of the bee. To visualize the network as we watch the AI play, I dusted off some code I wrote for making pretty neural network art. Let’s look at our champions.
This is the first bot that could jump through ten gates. The top left node represents the vertical distance to the center of the next gate while the bottom left node represents the horizontal distance. See how the top left neuron oscillates as the bee jumps up and down, and how the bottom left neuron cycles from large to small values as the bee passes the gate. White lines are weak connections, and orange and purple lines are stronger positive and negative connections. Watch how signals propagate through the network with quasi-rhythmic pulses. Normally, the “do nothing” output is activated, but when the hidden layer neurons collaborate in just the right way, the “jump” output outshines it, sending the bee skyward!
I call the next bot The Divebomber. It learned a cool strategy. Before each gate, it plummets dramatically, bouncing at the very last moment, and then it jumps a second time to get the air for the next divebomb. Notice how it seems to know exactly where the bottom pipe is. That’s a totally learned behavior! I never gave it any information about how far the pipes are–it only knows where the middle of the gate is. That information became encoded in the network after my cookie conditioning. I find the pulsing of this network particularly gorgeous.
Those earlier models were quick to train. When I decreased the gate width down to ten times the bee height, it took 3828 tries (7 hours) to get past ten gates! Remarkably, this model continued to perform strongly when I lowered the width even further down to 8.25 times the bee height. Even though I designed the network with six hidden neurons, this network found a highly efficient solution that only needs three.
While I don’t think Q-learning was the most efficient way to attack this problem, I’m delighted by the elegance and sophistication that emerged from this silly cookie-fueled experiment. I know neural networks are just matrices. But watching the little guy fly while pulses bounce around this electronic brain we’ve created… doesn’t it feel just a tiny bit alive?
My code is available here.