A Look Into Neural Networks and Deep Reinforcement Learning
I would recommend reading my article that covers DL at a surface level before reading this article. It provides context on ML, perceptrons, and the basic operation of ANNs.
Machine learning (ML), which provides computers the ability to learn automatically and improve from experience without being explicitly programmed to do so, is the largest and most popular subset of AI. However, a standard ML model cannot handle high-dimensional data found in realistic problems, and struggles to extract relevant features from a dataset.
Deep learning (DL) is defined as a collection of statistical ML techniques that are used to learn feature hierarchies based on neural networks, which mimic the brain.
DL is capable of extracting features autonomously, meaning they require little assistance from the programmer. Additionally, the structure of neural networks, the main component of any deep learning model, can partially solve the dimensionality challenge by mimicking the way our brains work. Thus, DL can overcome a number of the challenges that ML models face.
DL models are built primarily out of neural networks. There are many types of neural networks, each with different operations and applications, including ANNs and CNNs. These can then be applied to reinforcement learning, another subset of ML that allows an agent to learn in an unknown environment by taking actions that optimize for reward.
Artificial Neural Networks
Artificial Neural Networks (ANNs) are the baseline network in deep learning (i.e. they are regular neural networks). ANNs are sets of algorithms, modeled after the human brain and its neurons, that are designed to recognize patterns in data. They can interpret sensory data through a version of machine perception, labeling, or clustering raw input. They can also recognize numerical patterns contained in vectors, which is realistic of real-world data.
ANNs help cluster unlabeled data according to similarities among the example input data, such as comparing similar documents, image, or sounds.
They can also classify data according to a labeled dataset they were trained on, such as detecting faces and objects, or recognizing gestures in a video.
ANNs can extract features that are fed to additional algorithms for clustering and classification. In this way, ANNs can be thought of as components of larger ML applications for RL, classification, regression, etc..
Key Concepts of ANNs
ANNs are distinguished from single-hidden-layer neural networks (single- layer perceptrons) by their depth; they have more than one hidden layer. Each layer of nodes in an ANN on a distinct set of features according to the previous layer’s output. Thus, the farther you advance into an ANN, the more complex the features your nodes can recognize, since they aggregate and recombine features from the previous layer.
This leads us to a concept known as feature hierarchy, which is a hierarchy of increasing complexity and abstraction that makes ANNs capable of handling extremely large, high dimensional data with billions of parameters that pass through (non-linear) activation functions. Feature hierarchy allows neural networks to discover patterns in unlabeled, unstructured data that humans would otherwise be unable to find. Thus, one of the best problems DL solves is in processing and clustering realistic, raw, unlabeled data.
ANNs also perform automatic feature extraction without human intervention, unlike normal ML algorithms. A major part of designing a traditional ML algorithm is extracting which features in the data are relevant. This part of the process usually takes programmers a lot of time, and its effective depends on the skillfulness of the programmers themselves.
ANNs overcome this and perform feature extraction that usually takes data scientists years to accomplish. By using each node layer, ANNs can learn features automatically as they repeatedly try to reconstruct the input from where it draws its samples, attempting to minimize the difference between the network’s predictions and the probability distribution of the input itself.
These ANNs can recognize correlations between relevant features and optimal results — they draw connections between feature signals and what those features represent, either as a full reconstruction or with labeled data.
ANNs end in an output layer — a logistic classifier that assigns a likelihood to a particular outcome/label.
Artificial Neural Network Elements
ANNs use a number of elements, such as nodes, weights, and functions, to operate.
All forms of deep learning consist of stacked neural networks, which are layers made of nodes. A node is a spot where a computation occurs, and is loosely modeled after a neuron in the brain. On a technical level, it combines input data with coefficients known as weights, which amplify or dampen that input. The weights assign significance to the different inputs, which helps the ANN determine which input is the most helpful in classifying data without error.
In the full ANN, input-weight products are summed. That sum is then passed through a node’s activation function to determine how far that signal should progress and ultimately affect the final outcome of the ANN.
Note that a node layer is a row of nodes that turn on or off as the input is fed through the network, and that each layer’s output is also the next layer’s input. By pairing models’ adjustable weights with input features, we can assign significance to those features.
The goal of an ANN is to arrive at a position of least error as quickly as possible. Each step of an ANNs is a guess (an error measurement and slight update in weights). Models improve over time as it consistently updates its parameters. By incrementally adjusting the weight coefficients and comparing the outputs, the ANN slowly notices which features of the data are important.
Essentially, ANNs are corrective feedback loops:
- input * weight = guess
- real data - guess = error
- error * weight’s contribution to error = adjustment
This process is known as backpropagation. It is a key supervised learning method for ANNs. Weights are randomly assigned to each input, and if the output is incorrect, the weights are incorrect, so the model ‘propagates backwards’ to update the weights and reduce error.
Gradient descent is a common optimization function that adjusts weights according to error (the word ‘gradient’ can be thought of as slope). In gradient descent, you’re observing how the ANNs error and a single weight are related to each other to determine which weight will produce the least error. In other words, what value of the weight leads to the correct output?
As an ANNs learns, it adjusts the weights so that it can map a signal to a meaning correctly. This relationship can be modeled by the derivative
which measures the degree in which a slight change in weight causes a slight change in error. Since each weight goes through many transformations, where it passes through many activations and sums over multiple layers, the chain rule can be used to look back through network activations and outputs to arrive at the weight and relationship.
By this process, we can think of ANNs as the process of adjusting models’ weights in response to error until it can’t reduce the error anymore.
Back-propagation and gradient descent are not the same. Back-propagation is the process of deriving the gradient (generalization of the derivative to multivariable functions; local slope of functions that allows you to predict the effect of a small change in any direction), while gradient descent is actively going back and updating the weights.
Types of Activation Functions
Remember that in an ANN, the input is fed to an input layer, and nodes perform transformations on the inputs using weights and biases. The output of these nodes pass through activation functions before they move onto the next hidden layer.
Many different activation functions exist. Without them, you would only be able to perform linear transformations on each node’s output, which isn’t ideal for trying to model complex, multi-dimensional problems.
Activation functions introduce a non-linearity that allows the model to recognize complex patterns in the data.
The first step is to introduce a threshold-based classifier that determines whether or not a neuron should be activated based on the value from its transformation. If the input to the activation function is greater than the threshold, then the neuron is activated. Otherwise, it isn’t activated and the output isn’t considered for the next hidden layer.
Binary Step Function
Binary step functions are essentially binary classifiers, meaning they aren’t useful when there are many classes in a target variable. A binary step function states that if the input is less than 0, then the gradient is 0, and if it’s greater than 0, the gradient is 1. When the gradient is 1, the next neuron is activated.
Having a gradient that only equals 1 or 0 can be a drawback in the backpropagation process. Gradients are calculated to update weights and biases, so if the gradient always equals 0, the weights and biases are never updated and error isn’t reduced.
The binary step function was binary because there was no component of x. Linear functions change this, where the activation is proportional to the input. The gradient, rather than becoming 0, is a constant that is not solely dependent on the input of x.
However, during back propagation, the updating factor would be the same (derivative is the same everywhere), so the ANN won’t really improve the error since the gradient is the same for every iteration. This means the ANN won’t be able to identify complex patterns in the data, which makes linear functions only ideal for simple tasks.
The sigmoid function is one of the most commonly used non-linear activation functions, where it transforms values between a range of 0 and 1. Since it’s non-linear, assigning multiple nodes with the sigmoid activation function will result in a non-linear output. This allows it to detect complex patterns in the data.
Sigmoid functions face challenges in backpropagation too, since the network doesn’t really learn near the asymptotes where the gradient value approaches 0.
The tanh function is similar to the sigmoid function, but it is symmetric around the origin and its range is from -1 to 1. This also means that the inputs to the next layers won’t always be positive, since the output will always be between -1 and 1. All the other properties of this function are the same as the sigmoid, where it is continuous and differentiable at all points.
The tanh function is often preferred over the sigmoid function because its symmetric around the origin (0-centered) and its gradients are not restricted to move in a certain direction (-1 to 1 is both negative and positive, vs 0 to 1 which is only positive).
ReLU (Rectified Linear Unit)
The ReLU is another non-linear activation function, but it’s advantageous in that it doesn’t activate all the nodes at the same time. The nodes are deactivated if the output of the linear transformation is less than 0; so for negative input values, the result is 0, and the node isn’t activated.
ReLU may face a similar problem to linear functions, since during back propagation, some gradients will equal 0, so those nodes are never updated and become ‘dead nodes’.
Convolutional Neural Networks
CNNs are types of ANNs used for processing structured arrays of data, such as images. They’re a type of feed forward neural network, usually with 20–30 layers. These hidden layers are generally convolutional layers follows by activation and pooling layers. CNNs are often used in problems for visual applications, so they become relevant when introducing deep reinforcement learning.
Feed forward neural networks are ANNs that are only connected to nodes in the next layer. Information is only processed in one direction. The opposite would be recurrent neural networks, where some pathways are recycled.
Regular ANNs don’t scale well to process full images. For instance, an image that is 32 x 32 x 3 pixels in size would have 3072 weights (32*32*3) in the first hidden layer. This fully connected structure is impractical at larger sizes (200 x 200 x 3 pixels would have 120,000 weights). This full connectivity isn’t necessary and can easily lead to overfitting (when a model is too attuned to one dataset and cannot be applied to any other dataset).
CNNs take advantage of the fact that their inputs consist of data arrays, so they constrain the architecture so that it’s more practical.
The layers of a CNN have 3 dimensions for width, height, and depth (depth as in third dimension of an activation volume). The nodes in a layer are only connected to a small region of the layers before it, instead of all the nodes being fully connected. This is how a CNN can turn a 32 x 32 x 3 image and output a 1 x 1 x 10 image, since the CNN will have reduced the full image into a single vector of class scores arranged along the depth dimension.
From this process, CNNs are much better at picking up image patterns like lines, gradients, circles, and even complex patterns such as eyes. They can operate on raw images and do not need preprocessing.
What makes CNNs distinctive is that they have convolutional layers, which, when stacked on top of each other, can recognize increasingly sophisticated shapes. For example, 3–4 convolutional layers may be able to recognize handwritten digits, while 25 layers could recognize faces.
Convolutional layers are the key building block of CNNs. You can visualize a convolutional layer as many small square matrices called convolutional kernels, which slide over an image to look for patterns. When a part of the image matches the kernel’s pattern, the kernel returns a larger positive value, and when there is not match, the kernel return a 0 or negative value.
For example, this kernel represents a plus sign.
If we wanted to know if this image had vertical lines, we could slide a vertical line convolutional kernel over the image at every position. We would then multiply each element of the image that it covers, and sum up the results.
Here, our vertical line kernel is a 3x3 matrix with 1’s in the 2nd column and 0’s in the 1st and 3rd columns to represent a vertical line. The matrix’s size means it can only be positioned at 7 different positions horizontally in the plus sign matrix. The result of the convolutional operation (multiplication) on the plus sign image using the vertical line convolutional kernel results in a new image that is a 7x7 matrix.
For the parts of the original matrix that contained a vertical line, the kernel returned a value or 3. In places that had horizontal lines, it returned a value of 1, and 0 for empty areas. This process is what allows us to detect features in images.
In practice, a convolution kernel also contains weights and biases, similar to the formula for linear regression. That way, each input pixel is multiplied by the weight and the bias is added.
An Example of a Convolution On An Image
Let’s perform a convolution on an image by detecting lines using a more sophisticated 9x9 convolutional kernel.
Then, let’s take an 204 x 175 image of a cat, which is represented as a matrix with values from 0 to 1, where 1 is white and 0 is black.
To perform the convolution, we can use our vertical line kernel, which acts as a filter for vertical line detection. The vertical stripes on the cat’s head are highlighted in the output image, which is 8 pixels smaller due to the size of the kernel.
Ultimately, although the concept is simple, convolutional operations have the ability to detect simple features like corners or vertical lines. When a layer is made up of a series of convolution kernels (a convolutional layer) the layers can add up to find complex features.
The feature detector kernels aren’t programmed by humans, but are in fact learned by neural networks during training and serve as the first stage of the image recognition process.
Applying Activation Functions
For our example, we can apply non-linear activation functions like the sigmoid or ReLU function. By introducing non-linearity, our CNN can create a much more defined output. If this activation function wasn’t present, all the layers of the neural network may be condensed into a single matrix multiplication and wouldn’t be as effective in detecting vertical lines.
Prerequisites To Deep Reinforcement Learning
Deep reinforcement learning essentially combines neural networks with reinforcement learning to help software agents learn to achieve their goals. It brings together function approximation and target optimization, and allows agents to map state and actions to the rewards they lead to (Pathmind).
Reinforcement learning (RL) is a type of ML where an agent is placed in an unknown environment and learns how to behave in that environment by performing certain actions and observing the rewards it gets from those actions. It refers to goal-oriented algorithms that learn to achieve a complex objective. RL models perform particularly well in ambiguous, real-life environments, where they can choose many possible actions, compared to being limited in a repeated video game.
Relevant Terms —
Agent: The object that takes actions.
Action (A): The set of all possible moves an agent can make. a is a specific action contained within the set.
Discount factor (y): The factor multiplied by future rewards as discovered by an agent in order to dampen the rewards’ effect on the agent’s choice of action. It’s designed to make future rewards worth less than immediate rewards, which encourages the agent to explore routes besides what it already knows.
State (S): A situation where the agent finds itself. An instantaneous configuration between the agent and other significant objects like tools, obstacles, enemies, or prizes.
Environment: The world in which an agent moves and responds to. It takes the agent’s current state and action as input and returns the output as the agent’s reward and next state.
Reward (R): Feedback in which the agent measures the success or failure of its actions in a given state.
Policy (π): The strategy an agent employs to determine its next action based on its current state. It maps states to actions with the highest reward.
Value (V): The expected long-term return with discount, opposed to the short-term reward (R). We discount rewards to encourage the exploration of other routes.
Action value (Q): The expected long-term return with discount and relation to the current action. Q maps state-action pairs to rewards.
In essence, environments are functions that transform an action taken in the current state into the next state and reward, and agents are functions that transform the new state and reward into the next action. Understand that in most RL models, we can know and set the agent’s function, but not the environment’s function.
The RL Learning Process
In the basic RL learning process, the environment first sends a state to the agent, and the agent takes action in response. The environment will then respond to that state and provides a reward. The agent updates its knowledge based on that reward, and the loop continues until the environment sends a terminal state.
In short, the agent is put in a Stage 0 environment and takes a random action. If they receive a reward, they move to a further stage. If they don’t, they stay in Stage 0.
This loop can be modeled by the following diagram.
RL algorithms are time efficient because they can run through the same states over and over again while experimenting with different actions, allowing it to infer which actions are the best from which states. This gives them the potential to learn much more than humans, since they can ‘relive’ the same ‘moment’ an innumerable number of times.
RL can really only be thought about sequentially in terms of state-action pairs that occur one after another. It judges actions based on the results they produce, since its goal it to learn a sequence of actions that’ll lead an agent to achieving its goal.
The goal of RL is to pick the best known action for any given state, so actions need to be ranked and assigned values relative to one another. Since these actions are state-dependent, they form state-action pairs (an action taken from a certain state).
Q-functions take the input from an agents state and action and map them to probable rewards. RL is the process of running the agent through sequences of state-action pairs, observing the rewards, and adapting the predictions of the Q-function to those rewards until it accurately predicts the best path for that agent to take. This prediction is also known as the policy, which was discussed earlier.
RL tries to model the complex probability distribution of rewards in relation to a many state-action pairs, which is why RL is often paired with methods like the Markov decision process (a method to sample a complex distribution to infer its properties).
Neural Networks and Deep Reinforcement Learning
Neural networks are function approximators, so they’re useful in RL when the state or action spaces are too large to be completely known, as they are in most real-world environments.
Neural networks can approximate value functions — they can learn to map states to values, or state-action pairs to Q-values. Instead of using a lookup table to store, index, and update all possible states and their values (which is not practical for large problems), a neural network can sample from the state or action space and learn to predict how valuable those are to the relative target.
In RL, CNNs can be used to recognize an agents’ state when the input is visual, like the screen of a video game or the terrain of a drone (image recognition applications). In RL, given an image that represents a state, a CNN can rank the actions possible to perform in that state.
For example, if you were trying to train an agent to play Super Mario, running right may return 5 points, jumping 7, and running 0.
In the beginning of RL, neural network weights are initialized randomly. Using feedback from the environment, the neural network updates the weights to improve its interpretation of state-action pairs. This is similar to backpropagation, except data in backpropagation is labeled.
Overall, deep reinforcement learning is highly effective because it allows agents to learn in incredibly complex, high-dimensional environments, which is practical when trying to learn in the real-world.
Case Study: DeepMind’s AlphaZero
One of the most famous deep reinforcement systems is AlphaZero from Google DeepMind, which taught itself how to play chess, shogi, and Go from scratch, and beat world champions at each game. To learn how to play each game, untrained neural networks were used to play millions of games against itself in conjunction with reinforcement learning. It started by playing randomly, but as it adjusted its parameters to optimize for a higher reward, allowing it to become a master at each game.
AlphaZero exhibited the quick learning capabilities of deep reinforcement learning systems. Over the course of the 9 hours it trained, it played 44 million games against itself on specialized Google hardware. After 2 hours, it performed better than most human players, and after 4, it could beat the best chess engine in the world.
So far, it is one of the most remarkable demonstrations of deep reinforcement learning, and the technology behind it is quickly being applied to many other situations to solve and improve many industries.
Have feedback or questions? Send me an email at firstname.lastname@example.org and I’ll be happy to respond!
You can check out more of what I’m up to in my monthly newsletters. Sign up here.