Using Deep Reinforcement Learning To Play Atari Space Invaders

Chloe Wang
6 min readNov 19, 2021

If you aren’t familiar with DRL, I would recommend reading this article before this one. It provides context on many concepts that were used in this model that I do not discuss here.

Deep reinforcement learning (DRL) is a subset of machine learning that essentially combines neural networks with reinforcement learning to help software agents learn to achieve their goals. It brings together function approximation and target optimization, allowing agents to map states and actions to the rewards they lead to.

Based on this concept, I was able to teach a DRL agent how to play Atari Space Invaders.

My agent playing Atari Space Invaders


  • I was able to teach an RL agent how to play Atari Space Invaders using concepts from both RL and DL.
  • I used OpenAI Gym Retro to create the environment that my agent played in. It’s from an initiative that encouraged DRL design across many different but similar environments.
  • The neural network in this model is used to process frames from the game to understand where objects are and what the agent is doing. It uses 3 convolutional layers and 3 dense (fully-connected) layers to do so.
  • My model trained for 10k steps (~4 hrs) and played decently, but Google DeepMind recommends training for 10M-40M steps for optimal playing.
  • Here is a video of me explaining the project and watching the agent play.
  • This project was a replicate of Nicholas Renotte’s DRL model.

Model Walkthrough

This model used TensorFlow 2.3.1, keras-r12 (a reinforcement learning library), and OpenAI Gym Retro. OpenAI Gym Retro was released a few years ago and can work in over 1000 retro game environments.

Creating The Environment

‘NOOP’ is no action.

Here, I created the environment. Then I pulled the height, width, and channels out of a particular environment frame (equivalent to part of a state) so that I could pass it through a neural network. I also grabbed the actions and unwrapped them so I could see what they were.

Then I tested the environment with 5 episodes (5 games) to see how it performed. In each episode, I reset the environment and set the score to 0. As the episode was running, the agent chose a random action and provided information like the reward and state. I then used this information to output the score.

In general, the agent does okay, but that’s just because it’s taking random actions.

Building The DL Model

After that I started building the deep learning model with Keras, NumPy, and TensorFlow. I defined a model that passes through the height, width, channels (pulled from observation space) and actions (pulled from action space). The shape of these parameters defines the model.

Inside the function I assigned Sequential() to model, which passes the frames through sequentially. Then using model.add, I could stack layers in the DL network.

The first layer is Convolution2D, where I passed through a few parameters, including:

  • The number of filters (32). The goal was to train the filters so that they could detect objects in the frames, like the enemies.
  • The size of the filters (8x8 units) and the number of strides.
  • The ReLU activation function and the shape of the frame.

Note that in this first layer, I actually pass through multiple frames from the memory data frames. The second and third layers are also convolutional layers, with some different filters, filter sizes, and strides.

Then I flattened all the layers into a single layer, allowing me to pass it through a dense (fully-connected) layer. The first dense layer has 512 units. The second dense layer compresses this slightly and is 256 units. The third dense layer has the number of actions, which is 6 units.

Finally, I build the model by assigning the build_model function to model.

Here, you can see the full model in a more readable form. Notice how the size of the outputs in the convolutional layers decreases, since the goal is to basically compress the inputs. The flatten layer then transforms it, allowing the dense layers to fully compress the image to just 6 actions. The right column shows the parameters, and at the bottom, the total number of parameters; 34,812,326. This goes to show why it’s so useful to incorporate the data interpretation abilities of deep learning in reinforcement learning models.

Creating The RL Agent

Finally, I built the RL agent with Keras. I imported SequentialMemory, which allows the agent to retain memory from previous games. I also imported LinearAnnealedPolicy, which added decay as we got closer to the optimal strategy, and EpsGreedyPolicy, which allowed the agent to find the best reward outcome.

I defined the agent and passed through the model and actions. Then I defined the policy by passing the epsilon greedy policy through the linear anneal policy. I defined the memory using sequential memory with a buffer limit of 1000 episodes with a window length of (meaning for 1000 episodes, the model stores the past 3 windows to capture what our previous steps look like).

Then I defined the agent by passing through the model, memory, and policy. I also added a dueling network, which splits the value and advantage and helps the model learn when to take actions and when not to take actions. I also added which actions to take and the number of steps the model should take.

Training and Testing

To start training the model, I assigned the build_agent function to dqn. Then I added an optimizer for the neural network and a learning rate of .0001. I fitted the model and passed through how long I want to train the model (10000 steps), and specified that I don’t want to visualize the training to speed up the process.

Finally, I tested the model for 10 steps and output the reward from each episode. At this point in its training, it was playing decently and started to develop a strategy by targeting the mothership (the purple ship at the top). Essentially, it was learning to play the game.

Ideally, the model should actually be trained for 10M-40M steps according to Google DeepMind, but I didn’t exactly have the time for that (considering that training for only 10k steps took 14675 seconds, or 4 hours). But if the model was trained for that many steps, it would reach the optimal playing state.

Here is a video of me explaining the model and watching the agent play.

Have feedback or questions? Send me an email at and I’ll be happy to respond!

You can check out more of what I’m up to in my quarterly newsletters. Sign up here.