# An Introduction to Reinforcement Learning | by Angjelin Hila | May, 2024

## A deep dive into the rudiments of reinforcement learning, including model-based and model-free methods

## What is Reinforcement Learning?

One path toward engineering intelligence lies with emulating biological organisms.

Biological organisms transduce information from the environment, process it (what cognitive science studies), and output behaviour conducive to survival. Such behaviours, at the most basic level, involve foraging for food, reproducing, and avoiding harm. They also involves the wide spectrum of human activity such as play, creativity, problem-solving, design and engineering, socializing, romance, and intellectual life.

Now, how do we engineer a system that is able to do all of the above?

If we were to model a simple organism as a function of some environment, we would need a model of the agent, the environment, and some function that moves that agent from some present state to some desired state.

In psychology, two major schools of thought purport to explain human behaviour: behaviourism and cognitive science. Behaviourists understand behaviour as a function of learning mechanisms, where learning can be attributed to output behaviour. Cognitive science, on the other hand, models agent interaction with the environment through the information-processing approach. In this approach, the agent transduces external stimuli into an internal representation initially by the senses and subsequently subjects it to layers of transformation and integration all the way up to thinking and reasoning faculties, before returning some behaviourial output. In the former approach, learning is understood largely as a function of environmental conditioning, whereas in the latter, mental representations are considered indispensable in predicting behaviour. Reinforcement learning borrows mostly from the behaviorist approach where environmental reward dictates the evolution of the agent within search space.

Operant conditioning, the school of behaviorist psychology that reigned in the 1950s-60s, defined learning as the product of the environmental mechanisms of reward and punishment. Precursors to operant conditioning included the Law of Effect proposed by Edward Thorndike which proposed that behaviors that produce satisfying effects are more likely to recur, whereas behaviors that produce dissatisfying effects less likely. B.F. Skinner operationalized effects in terms of reinforcement and punishment. Reinforcement increases the likelihood of the recurrence of a behavior, whether it be approach or removal of the inhibitory factor. Approach is termed positive reinforcement, and the reversal of avoidance, negative reinforcement. An example of positive reinforcement includes becoming good at a sport and winning often. An example of negative reinforcement includes removing the inhibitory stimulus, e.g. the school bully who taunts you during games. Operant conditioning predicts that you’re likely to repeat behaviours that receive the greatest reward. Punishment, on the other hand, consists of controlling the behavioural effect by either adding a negative consequence (positive punishment) or removing the reward associated with the behaviour (negative punishment). When fouling causes expulsion from the game, it illustrates positive punishment. When you perform poorly and lose games it illustrates negative punishment, which may cause avoidance of playing in the future.

Much of the game of life in human society is replete with secondary reinforcers or socially constructed rewards and punishments that condition behaviour. These include money, grades, university admittance criteria, rules for winning and losing games, which build upon natural reinforcers that are closer to biological needs like food, reproduction, and social approbation.

Memory plays an important role in learning because it enables the retention of prior experiences. Evidence shows that memory encodes the rewards and punishments more so than the content of the experience (Tyng et al., 2017). Subjects are likely to remember rewarding experiences fondly and thereby likely to repeat them, and negative experiences unfavourably, and likely to avoid them in the future. The mechanisms of memory are complicated and diverse, and evidence suggests that subjects play an active role in reshaping their memories by recalling them (Spens & Burgess, 2024). This fact complicates the picture for behaviorism because the subject’s interpretation of an experience can be retrospectively modified and reframed, making prediction on conditioning principles alone difficult. Furthermore, rewards and punishments oversimplify the landscape of positive and negative affects, which comprises a complex terrain of valleys and troughs, nested dependencies, and is better modeled as a continuous spectrum rather than a binary space.

These complexities notwithstanding, reinforcement learning comprises an array of mathematical techniques that adapt the behavioural ontology of agent, environment, and rewards in order to model artificial intelligence. As we will see below, aspects of reinforcement learning emerge from control theory, whose precursors extend into physics and engineering, and other aspects emerge more directly from psychology and biology. Since both the objects of control theory and living systems comprise dynamical systems that must stay within an optimal range of far-from thermodynamic equilibrium, the underlying principles are amenable to the goals of reinforcement learning and artificial intelligence more broadly.

Dynamic programming emerged chiefly from control theory as a mathematical optimization method that enables larger problems to be broken down recursively into sub-problems as a means of solving the larger problem. Generally speaking, recursion refers to a function that passes itself directly or indirectly as a parameter.

In this article, we will focus chiefly on the elements of dynamic programming, with a focus on discrete and finite games. However, dynamic programming exhibits a variety of limitations that are in part addressed by model-free approaches to reinforcement learning and others by combining dynamic programming with artificial neural networks, once called neurodynamic programming. More broadly, the marriage of reinforcement learning and artificial neural networks is termed deep reinforcement learning. These models incorporate the strengths of deep learning within reinforcement learning techniques. The most popular of these algorithms include the Deep Q-Networks (DQN), which were introduced by DeepMind in 2013. This family of algorithms leverages deep learning to approximate the Q-function. Since function-approximation is one of the shortcomings of reinforcement learning, these algorithms represent a major improvement of the reinforcement paradigm.

Other shortcomings addressed by DQN include conferring flexibility in capturing nonlinear dynamics, admitting a much wider range of dimensions without becoming computationally intractable from the curse of dimensionality, and greater generalization capacity over the environment.

Neurodynamic programming represents a step in the direction of leveraging the cognitive paradigm in psychology to address the shortcomings of the purely behaviourist approach. It is worth noting, however, that while scientific progress has been made in understanding the hierarchical organization and processing of lower-level perceptual information, the scaffolding of that information to thought and consciousness remains, more or less, scientifically elusive. For this reason, artificial neural networks (ANNs) as yet lack the complex generalization capacity of human intelligence which tends to learn with exponentially smaller samples than ANNs. We will discuss the implications of adopting the principles of reinforcement learning toward artificial general intelligence (AGI) in the last section of the article.

## Decision Theory & Control Theory

Before delving into the mathematical elements of dynamic programming and reinforcement learning, it is important to flesh out the relationship between the philosophical and mathematical branch of decision theory and reinforcement learning. While decision theory consists primarily of mathematical formalizations of rational choice theory, they overlap with the goals of reinforcement learning insofar as reinforcement learning seeks to scaffold its models into successful artificial agents that can interact with complex environments and information landscapes.

Decision theory, also known as choice theory, was developed in the 20th century at the heel of the growing formalization of instrumental reason. Specifically, it uses probability theory to quantify the probability of agent actions given their preferences. A crowning achievement of this formalization effort was the Von-Neumann-Morgenstern utility procedure. In a nutshell, the procedure states that agents tend to choose actions that maximize utility given the utility expectations of available choices.

Control theory emerges from the fields of mechanical and electrical engineering and concerns optimizing the states and performance of dynamical systems relative to desired parameters, such as maintaining some steady-state temperature range. The essential mechanism consists of a controller that measures the desired variable and compares it to a set point, whose difference is fed as feedback for correction. The broad strokes of control theory mirror metabolic processes of living organisms, who maintain a set point of internal temperature against variable external conditions. The connection of control theory to decision theory is obvious: both rely on feedback from the environment to maintain or advance the state of the system toward some form of optimality.

Mathematically, subsets of both control and decision problems can be reduced to optimization problems solvable through dynamic programming. Dynamical programming solves general stochastic optimal control problems (afflicted by the curse of dimensionality — meaning that computational requirements grow exponentially with the number of state variables) by decomposing them into smaller sub-problems and computing the value function. As we demonstrate the rudiments of reinforcement learning, we will delve into the heart of dynamic programming: the recursive relationship between the state and value functions of the agent.

Reinforcement learning and decision theory overlap in defining a procedure for maximizing reward or utility. However, whereas utility is explicitly defined in decision theory, which aims to model economic behaviour, in reinforcement learning utility is substituted by cumulative reward. Different policies relative to different task goals can be applied toward maximizing cumulative reward, which, as we will see, depends on the inverse relationship between the polar directions of exploration and exploitation termed the exploration-exploitation dilemma.

Let’s begin by outlining the ontology underlying reinforcement models.

## States, Actions & Rewards

Reinforcement learning leverages the theoretical apparatus of decision theory to construct models comprising agents, environments, and a dynamic evolution rule. The evolution rule permits an agent to pursue rewards within its environment, also termed observation.

The agent is defined as an output from the environment to a decision. We call a particular decision an action. The mapping from the present state of the network to an action is called the policy. The policy guides actions as mappings from states to outcomes.

Formally, therefore, a policy is a function that maps a state to an action. It can be represented by the conditional probability of an action given the current state, where the Greek symbol 𝛑 stands for policy:

Transition dynamics define the next state given the input reward as a probability distribution over all possible states and reward values:

The formula above defines the probability of the next state and reward pair as equal to the conditional probability of the next state s’ and reward r given the current state s and action a.

An action changes the environment by accruing a reward. The reward, in turn, changes the agent state or observation. The reward input determines the future action outputs based on the policy.

Generally, there are two types of policies:

Deterministic: given present state/environment, there’s one and only one action the agent can take.

Stochastic: given present state/environment, there are multiple actions an agent can take.

A reward is typically formalized as a scalar value, x.

Given a particular reward, the agent is faced with an optimization dilemma: should the agent maximize short-term rewards or cumulative rewards over its complete life history?

This is known as the exploration-exploitation dilemma. In other words, the transition function should aim to optimize the trade-off between exploring the environment and exploiting the knowledge it has accumulated by reaping maximal reward.

Optimal solutions to the exploration-exploitation dilemma depend on the type of tasks we want the model to learn, which range from finite to undefined (continuously or discretely infinite). The game of chess, for example, can be formalized as an episodic task because it has a a finite configuration space and a predefined end-state of three possible outcomes: win, lose, draw. This means that optimal successor states given current states can be computed through deterministic transition dynamics, where for every state there is a single optimal action.

However, most tasks do not have a finite configuration space nor predefined end-states. We classify these as continuous tasks, and optimize them through model-free approaches. In model-free approaches, instead of computing the transition dynamics, the model samples from the environment in order to compute optimal successor states. Put differently, instead of planning its actions through foresight, it uses trial-and-error to learn about the environment.

There are generally two approaches to model-free reinforcement learning: Monte Carlo approach and Temporal-difference learning. Since averages over sufficient samples converge to expectations, model-free approaches estimate expectations through sample means. Monte Carlo methods compute value functions by estimating the expected cumulative returns of a sufficiently large sample of state-action pairs. Some Monte-Carlo methods evaluate the value function only at the end of the task for episodic tasks. For continuous tasks, the definition of an episode varies and can be set by the designer such as based on time intervals.

Unlike Monte-Carlo search, Temporal-difference learning, meanwhile, estimate the value function incrementally by leveraging differences between time-steps. Because of the incremental approach of temporal-difference methods, they exhibit a lower variance from the actual expected value than Monte-Carlo methods who rely on sampling means.

To recapitulate: the agent navigates its environment through mappings from current state and action-space pairs to state-spaces. Transition dynamics compute all possible mappings for finite configuration spaces with predefined end-states. In lieu of a predefined end-state and finite state-spaces, model-free approaches continuously sample from the environment to find the best policy.

Dynamic programming computes state-transition probabilities and expected reward from all state-action pairs. To understand how this process works, we need to understand Markov processes.

Next we’re going to learn the mathematical model that enables the agent to compute the optimal successor state. As we discussed earlier, optimality resolves into the exploration-exploitation dilemma, which varies with the type of task we’re trying to model. Looking into the structure of rewards will help us understand this better.

## Quantifying Reward

We quantify reward in reinforcement learning through a scalar value accrued to the agent from the environment upon taking an action. The value of this reward indicates the immediate goodness of the action with respect to its end-goal.

Cumulative reward, or return, on the other hand, refers to the sum of all hitherto accumulated rewards from the environment. The goal of the agent is not merely to optimize immediate reward but to optimize cumulative reward. The former represents myopic agents who maximize short-term gains, whereas the latter far-sighted agents who seek to maximize long-term gains.

Since for the most part we want agents to maximize highest rewards sooner rather than later, discounting is introduced to incentivize current maximal reward over later maximal reward.

We quantify cumulative reward G with discounting by the expression below:

Here, cumulative reward G equals to the sum of products of a reward and its discount factor gamma 𝜸, which is always a value between 0 and 1: {0,1}. Gamma is incrementally exponentiated with each time-step, which means that across infinite time-steps gamma approaches zero.

As gamma approaches 0, it incentivizes short-term gains, whereas if gamma approaches 1, it incentivizes long-term gains since across infinite iterations the reward sum will itself approach infinity.

Because most tasks are bounded in time, gamma discounting imposes upper bounds on rewards when it is below the value of 1.

The condensed equation for cumulative reward with discounting is given below, where G stands for the sum of expected rewards R, which is multiplied by the discounting factor, gamma. Cumulative reward is therefore computed as the sum of the reward and the discounting factor:

## Markov Decision Process (MDP)

So far we’ve discussed the probabilistic definition of the policy as a mapping from a state to an action, transition dynamics as the probability of moving from one state to another given reward, and the formula for how reward is calculated.

Now, we’re going to step back a little and provide some supplementary theory that defines these probabilistic transition chains. We will start with something called a Markov process. Markov processes are stochastic processes that satisfy the Markov property. A stochastic process is a process that varies randomly. The Markov property states that for every state, successor states are conditioned only by present states.

Because prior states do not bear on future states, processes that satisfy the Markov property are called memoryless. Imagine a set of fixed destinations that recur daily as you leave your home to go work before returning home again. In other words, we have a cyclical process with a beginning and an end. Now further imagine that your decision to move from one destination to the next only depends on your current destination, not your previous history of destinations. Initially, every connected destination would have an equal distribution of probabilities. For example, if upon leaving home you have the option to drive or take the metro, we’d ascribe initial probabilities to each of these possible future states as 0.5. Over iterations of all possible routes these probabilities might stabilize to some frequency distribution with some routes skewing preferentially over others. (This type of probability is called empirical probability because it averages outcomes over possible events relative to a finite number of tests) That distribution equilibrium would be the Markov chain or process.

Now you’re probably thinking: how do you define events and states? Isn’t the world infinitely complex to be talking about fixed possible states and stable probability distributions?

Quite true, but since we’re after a mathematical formalism of agents in environments, we need therefore to distinguish between the types of tasks or environments we are trying to model. To do this, we need to specify the representation of both time steps and state spaces, that is, the distributions of all possible states. The square matrix below provides a definition of Markov chains with respect to axes of state-space and time:

The state-space can be defined as countable/finite or continuous, where finite state-spaces describe all the possible configurations of the system through combinatorics, while continuous state-spaces describe all the possible configurations through a continuous function.

Finite and countably infinite spaces take integers or rational numbers as their measurable space, whereas continuous spaces take real numbers.

Likewise, the axis of time can be defined as discrete or continuous.

Discrete-time processes count phase-transitions as discontinuous but can be modeled on either countable or uncountable state-spaces, where uncountable refers to infinite decimal expansions of real numbers. This is in fact how your computer counts time — it does so in discrete steps. The interval between the steps varies across architectures, but a cycle is usually measured as the length of the time-step required to change a register state.

Continuous-time chains count phase-transitions as continuous and can be modeled on countable or uncountable state-spaces.

The term Markov process is typically reserved for continuous-time processes, whereas the term Markov chain describes a subset of those: discrete-time, stochastic control processes. In the rest of the article, we will focus on discrete-time, finite state-spaces.

So far our Markov chains are very simplistic as they describe transitions between states with fixed probabilities. We’re missing two ingredients important for modelling behaviour in our ontology: actions and rewards.

Adducing rewards to transition probabilities constitute Markov Reward Processes. Markov Reward Processes assign a reward to each transition state, (defined as a positive or negative integer) thereby nudging the system toward some desired state. Recall our cumulative reward formula as the sum of expected rewards multiplied with some discounting factor. A Markov Reward Process allows us then to calculate the value of the state v(s) as the probability of cumulative reward G (where G is averaged over a large sample of iterations) given initial state S:

The last variable we need to adduce in order to scaffold to Markov Decision Processes are actions. The agent begins with equally distributed probabilities of a set of possible actions and subsequently updates the transition function as a mapping from current state and action to the next state and reward. We’ve ended up full-circle to the transition dynamics that we described earlier:

## Dynamic Programming & Bellman Optimality

This brings us to the concept of dynamic programming, developed by Bellman (1957).

Understanding dynamic programming will help us understand approximation methods like Monte Carlo Search and Temporal Difference, which do not require complete knowledge of the environment like dynamic programming does. These model-free methods approximate the deterministic policy of dynamic programming in lieu of perfect information. As such, they provide powerful mechanisms that approximate real-world learning.

The core idea behind how dynamic programming searches and finds the optimal agent state concerns the relationship between the state-value function and the action-value function. These are recursively related.

Let’s expound on these ideas with a relatable example. Let’s say that you are at a suboptimal state in your life and want to change this. Let’s further say that you have a tangible goal or place you would like to be in the future within some realistic time horizon. In order to arrive at the grand goal (here you can substitute anything: better job, start a family etc), you will need to take a series of smaller steps or actions that will be conducive to your desired outcome. Translated in the language of reinforcement learning, your current state will be assigned a value. Given your current state and value, you will take actions. These actions will also be evaluated with respect to your overall goal and current state. A good action will receive a higher valuation than a bad action. Feedback from the environment will determine the value of the action (how these are determined varies with the task). The evaluation of the state will affect the valuation of the available actions and successor states. And the evaluation of the actions will recursively affect the value of the current state. In other words, actions and states are dynamically connected through recursion.

Now, in real life, your goal and the action-steps to get to that goal cannot be specified as a deterministic system with discrete time-steps and and a discrete state-space (though perhaps they could be approximated this way). Instead, dynamic programming assumes a specifiable environment much like the game of chess, where time-steps and action-spaces are abstracted as discrete and finite. The overlap with real life closes on the fact that a larger goal will be approached through optimization of smaller sub-goals conducive to that larger goal.

Dynamic programming therefore will assume the following values: (Ω,A,𝒫), where Ω represents the total of all possible states, A an action event as a subset of the finite sample space, and P as the probability assigned to each action event by some policy function 𝝅.

Now, if you think back to our deterministic transition dynamics, since the sets of states, actions, and rewards are finite, any particular state and reward pair will have a probability of those values occurring given some prior state and action pair. These probabilities are specified as discrete probability distributions of random variables since the state space is discrete. We said that sequences consisting of states, actions, and rewards are Markov Decision Processes (MDPs) that seek to maximize expected cumulative reward over time, where reward is represented as a scalar value.

Now the question we need to address is how does a Markov Decision Process maximize cumulative reward given the assumptions we’ve specified? The answer is provided by the Bellman Optimality Equations which relate two functions: the state-value function and the action-value function.

## State-Value Function

The state-value function can be defined as the sum of the probabilities of all possible actions an agent can take under a policy 𝝅, where, for each action, it’s value is determined by the sum of all weighted values of possible successor states.

Put more simply, the state-value function defines the expected cumulative reward an agent can obtain starting from a particular state (s) by following policy 𝝅.

The above equation contains two terms: a) the sum of the probabilities of all possible actions an agent can take in state (s) following policy 𝝅, and b) an inner sum for each possible action that computes the weighted values of all possible successor states. The term within the square brackets computes the contributions of each action’s possible states as the sum of the immediate reward R(s, a, s’) and discounted reward by gamma factor 𝛾.

Another way of expressing the state-value function is the following:

The above formula defines the value of the next state as the expected return E𝝅 computed as the conditional probability of getting reward R at time t given state s at time t. The reward R is calculated as the sum of products of expected returns in successor states and gamma discounting.

To help understand this better, imagine an agent in a 3 x 3 grid-world that has four possible actions — up, down, right, left — available at each time-step.

We initialize the state-values to 0, and use the Bellman equation for the state-value function to optimize the state-values given the distribution of rewards in the grid. We use (row, col) indexing to identify each position the grid.

Assuming that the policy is equally distributed across each action, and with a discounting factor of 0.9, the state-value function for the initial state (1,1), would be computed in the following way:

The constants in each inner sum represent the rewards which are distributed in the grid according to some outcome we want the agent to achieve. The inner sums represent the immediate reward plus the product of the discounting factor and the cumulative value of the next state. The ratios in the outer sum represent the distribution of total probability given the number of actions. Since there are four possible actions, we can weight the inner sums initially by equally distributed probabilities summing into total probability. The state-value would then be computed for each possible state in the state-space and iterated until the sums converge to stable values.

## Action-Value Function

As we saw the action-value function is embedded within the state-value function as its second term. This means that the action-value function computes the values of all the possible actions in state (s) as the sum of the immediate reward obtained from the transition from (s) to (s’) and the expected cumulative reward of the next state (s’) given the action, given by the formula below:

In other words, the action value function computes the cumulative reward of taking action a in state (s), where the expected return is the sum of the immediate state transition — denoted by R(s, a,s’) — and the discounted value of the cumulative reward of the next state s’— denoted by 𝛾∑𝝅(a’|s’)Q(s’,a’).

Another notation for formulating the action-value function is in terms of the expected return E given state and action pair (s, a) when following the optimal policy 𝝅:

The state-value functions and the action-value functions are related in the sense that the state-value function can be given by the policy and the action-value function Q(s,a).

Therefore, each function contains itself as a parameter, albeit computing the successor transition state, as evinced by the formulas above. The formula for V(s) contains V(s’) and the formula for Q(s, a) contains Q(s’,a).

Put differently, they contain each other within their parameters: the value of the state V(s) depends on the value of successor states computed through Q(s,a) and the value of the action Q(s,a) depends on the value of the successor state computed through V(s’).

As such, the action-value function and the state-value function are recursively related: the value of the action-state pairs determine the value of the state, which conversely determines the value of the action.

The state-value function takes as its prior the state, and yields an expected value E. The action value function takes as its prior state and action pairs, to compute the reward, the expected cumulative return E.

The Bellman Optimality Equations therefore express the recursive iteration of the state-value and action-value functions until they converge on optimal values. The Bellman Equation for the state-value function is expressed below:

where the value of the current state is defined as the maximum reward of any possible action computed as the reward for taking action a at state (s) and the product of the value of the next action s’ and its discount factor gamma.

The Bellman equation averages each all possible actions from the current state and weights them according to their probability of occurring.

## Model Free Methods: Monte Carlo & Temporal Difference

The above example describes a deterministic model where the transition dynamics are known and can thus be perfectly computed. This is because we have complete knowledge of the environment.

However, for most tasks we don’t have complete knowledge of the environment. In lieu of this information, we cannot proceed with deterministic transition dynamics precisely because we cannot solve the dynamic programming equations. To overcome this problem, we can use techniques that borrow from statistics by inferring the state of the environment from a sample.

In Monte Carlo methods, we approximate expected returns with the average of sample returns. As the sample approaches infinity, the average returns converge to the true value of expected returns. We do this by letting the agent run through an entire episode until termination before computing the value function. We then take N number of episode samples and use the mean to approximate the expected value of the target state. Now, as you might be already wondering, how an episode is defined varies with the task and purpose of the model. For example, in a game of chess we can define an episode as a run through an entire game or an arbitrary series of steps.

We can write the MC update rule as the following:

Where V(s) n+1 denotes the value of the next episode, S(s)n denotes the cumulative value of the state and G the value of the reward. We add the cumulative reward G to the state value and divide by the number of episodes or samples.

We can algebraically rearrange the MC update rule to:

Unlike Monte Carlo methods, where we evaluate the value function only after each episode, with Temporal Difference (TD) we evaluate the state value function after each time-step or increment. Since we start with no information about the environment, we have to initialize the values of V(s) to 0 or some other values, which will subsequently be updated with every time step.

We compute the value of the state in TD with two steps. First we compute the error of the step and next we use an update rule to change the value of the state. The error is given by the following difference formula:

Where, 𝜹t stands for the error, R(t+1) the reward from the action, V(S t+1) the estimated value of the next state, and V(S) the value of the current state. The fact that TD uses the estimated value of the next state to evaluate the current state is called bootsrapping. In effect, we subtract the value of the current state from the sum of the reward of the action and the product of the discounting factor and the value of the next state. This enables an immediate update of the value of the state with every time-step.

By adding the observed discrepancy between the expected and observed reward 𝜹 times 𝛼 (the learning rate) we close the discrepancy between observation and expectation:

The role of 𝛼 determines the degree to which the TD algorithm learns, where 𝛼 is a real positive number. Typically, 𝛼 is set to values like [0.1, 0.01, 0.001]. A higher value 𝛼 ensures that the updates are more aggressive, whereas a lower value ensures more conservative updates. The value of 𝛼 affects the exploration-exploitation trade-off, where higher 𝛼 leans on exploration and lower 𝛼 leans on exploitation.

While both MC and TD methods proceed blindly without any prior knowledge of the environment, the merit of Temporal Difference is that it computes online updates at every time-step and the merit of Monte Carlo methods is unbiased estimation due to relying on sampling alone to estimate the value. A drawback of TD methods includes high bias, whereas a drawback of MC methods include overlooking important updates, and thereby higher variance. This suggests that the optimum between the two learning strategies must exist somewhere in between.

The TD approach can be optimized by changing the single-step evaluation strategy to n-steps. As we will see, doing this enables compromising between TD and MC. When we evaluate the state value every n-steps, we do so by estimating n-steps into the future instead of after every step.

A modified approach to n-step TD is TD(𝝀). TD(𝝀) methods use a parameter called eligibility traces to credit state-action pairs that occurred in the past. Instead of estimating n-steps into the future, eligibility traces assign credit to state-action pairs over multiple TD steps. Eligibility traces enable past state-action pairs to receive credit for contributing to observed-reward transitions. Eligibility traces are represented as vectors or matrices associated with each state-action pair. The eligibility trace for a time step is computed recursively as follows:

Where the lambda 𝝀 parameter controls the degree of bootstrapping. When 𝝀 =1, bootstrapping is eliminated and the update rule reduces to Monte Carlo. When 𝝀 = 0, it reduces to a TD time-step with bootstrapping termed TD(0). TD(𝝀) generalizes TD and MC as a continuum where TD(0) denotes single step TD and TD(1) denotes the limit of extending TD to ∞ steps, which reduces to MC. As you can see from the formula, the eligibility trace parameter is computed recursively, wherein the value of the eligibility trace for the next time step takes as input the eligibility trace from the previous step. When E(s) = 0, bootstrapping is eliminated. The TD(𝝀) update rule is computed the same as the TD and MC update rule except by multiplying the eligibility trace to the error as shown below:

## Augmenting Reinforcement Learning with ANNs

Whether model-based or model-free, RL algorithms encounter scaling problems because of the curse of dimensionality, have trouble generalizing across different types of environments, and suffer from sample inefficiency.

Artificial Neural Networks (ANNs) provide a powerful method of rectifying some of the limits inherent within the RL architecture. In particular, ANNs improve sampling efficiency, environment generalization, and scaling problems caused by the curse of dimensionality. They reduce sample inefficiency through superior generalization capacity by virtue of the fact that they learn a general function from the data. This also enables them to scale better since the number of hidden layers and neurons per hidden layer can be increased. Too many hidden layers and neurons, however, can also lead to computational scaling problems (the curse of dimensionality is inescapable beyond certain ranges). They are further beset by the problem of the non-stationarity of target states, since traditionally ANNs require the ground truth (in the case of RL this amounts to expected return) to be set in advance, while RL algorithms find the optimal state through an update function, whether on-policy or off-policy.

Unlike traditional RL algorithms which rely on probabilistic transition rules, the application of ANNs to reinforcement learning uses function approximation to compute the state and state-action values. While any number of function approximation methods can be applied such as linear approximation and tile-coding, artificial neural networks constitute the most powerful technique due to their generalization power that leverages nonlinear function approximation.

Let’s look at two approaches that involve applying artificial neural networks to reinforcement learning: Deep Q Learning (DQN) and Deep Temporal Difference with eligibility traces (TD(𝝀)). Since we don’t know the target values in advance, MC or TD are used to create an estimate of the target state: the expected return. This is then used as the target value to be approximated by the function (really, the gradient which is the partial derivative of the error of the entire network with respect to the network parameter 𝜃). ANNs approximate the target value by computing the error between the target estimate and the output and then computing the error through backpropagation and reducing it through an optimization algorithm. The most common optimization algorithm is a variation of gradient descent such as stochastic gradient descent.

## Off-Policy DQN

Q-Learning is an off-policy version of SARSA (States, Actions, Rewards, States’, Actions’), where the next state-action pair Q(s’, a’) is estimated by selecting the maximum estimated value of the next state. In other words, Q-Learning selects the maximum value of Q(s’,a’) across actions available in the next state s’. This means that it doesn’t use the policy 𝛑 to learn Q(s’,a’). SARSA, on the other hand, is an on-policy method that selects an action from the previous action taken and an estimate of the next state-action pair, Q(s’,a’). This means that it uses the policy 𝛑, namely the probability of an action given the state, to learn the Q-function.

In Deep Q-Learning, the action-value function Q(a, s) is represented via Q(a,s, 𝜃 ) where 𝜃 represent the neural network parameters. Theta 𝜃 parameters are equivalent to weights w in neural networks, which are associated with connections between neurons. The weights determine the strength of the connections and are retroactively adjusted through backpropagation in order to minimize the error. DQN takes as input a high-dimensional representation of the environment and outputs a vector of action-values for each possible action. The expected return is typically approximated through an MC or TD approach. Backpropagation with an optimization function are then used to compute policy gradient and reduce the error by adjusting the policy network parameters 𝜃.

Because ANNs are highly sensitive to new information, it can cause catastrophic forgetting, where new information can overwrite previously written information. A method to manage catastrophic forgetting is to employ experience reply, a technique that stores past experiences and reuses them to train the network.

## On-Policy Deep TD(𝝀)

ANNs can also be applied to TD(λ) methods, where the state observation is fed as input into an ANN, which then approximates the action-value function as output. Due to the on-policy nature of TD(λ), Deep TD(λ) approaches are best suited for tasks that require long-term dependencies between states.

Training online learning methods like TD(λ) can be challenging because the distribution of the environment changes with every or n steps due to bootstrapping. This is called nonstationarity and impedes the convergence of the ANN parameters 𝜃 toward optimality. The interdependence of succeeding states in online learning can cause catastrophic forgetting, where the update interferes with past learning. Furthermore, the combination of eligibility traces which assign credit to past actions and ANNs can create additional complications in the backpropagation step.

A way to mitigate these challenges involves utilizing a technique called experience replay. Experience replay stores agent learned episodes as vectors of [s, a, r, s’] in a memory buffer. During training, the network samples from its memory buffer of stored learned vectors to update the network parameters. This provides the network with greater stability and makes it less prone to catastrophic interference from high-variance new experiences that result in a larger error or temporal difference between steps.

Deep TD(λ) algorithms have shown to excel in continuous control tasks where the state-space is continuous and the target unknown or unclear. These include continuous control tasks in robotics, autonomous cars, and financial markets.

## Reinforcement Learning and Artificial General Intelligence

What are the implications of reinforcement learning for artificial general intelligence?

Notwithstanding the fact that “intelligence” is an ill-formed variable since it meshes disparate competencies into a single notion, what’s termed “general intelligence” sits on top of evolved competencies of living organisms, which require the transduction of worldly information for survival and reproduction. Intelligence, even in the human context, cannot be extricated from the contours of organismic viability. This isn’t, however, the orthodoxy. The general wisdom argues that intelligence is more akin to a program or software that computes inferences on the basis of available information.

The latter conception consists of two models, which are mistakenly thought of as competing. One model describes intelligence as following procedures, whereas the other describes intelligence as generalizing from data for optimal prediction. The former is generally much better understood, whereas the latter amounts to a cluster of techniques that reliably improve the strength of predictions. Animal intelligence is largely based on the latter model.

The most successful paradigm of the second model is deep learning through artificial neural networks. The chief advantage of ANN architectures is that they enable generalization from data without prior information or concepts, although this is not to be confused with unsupervised learning. ANNs first build a model through training and then make predictions on the basis of that model on new data. It is thought, therefore, that the brain does something similar (after factoring pre-training from evolution). However, there are currently two weaknesses within ANNs. The first weakness is that the goal or outcome has to be set by the human designer. An ANN cannot of its own accord conceive of goals. It cannot, a fortiriori, tell the difference between truth and falsity of its own accord. The human designer must supply the true outcome in order for the model to learn to approximate that outcome. The second weakness is that an ANN, without reinforcement learning, cannot search an environment to optimize its own state. For this reason, the combination of the generalization and predictive power of ANNs with the decision optimization power of reinforcement learning makes for a formidable amalgamation.

It is on this basis that some have argued that reinforcement learning represents the clearest path toward artificial general intelligence (Sutton, 2014). The intuition behind this is clear: reinforcement learning comes closest to modelling living systems, which when enhanced with other successful architectures like transformers may lead to a model of AI that replicates (and exceeds!) all human capabilities.

However, if humans are the basis of general intelligence, then the conception of general intelligence cannot be one that divorces intelligence from survival constraints and some form of embodiment. On the other hand, if general intelligence can be defined without reference to living organisms, then it isn’t clear what it would look like — purely abstract models escape satisfactory formalization despite attempts like Marcus Hutter’s AIXI. In the abstract, it can be conceived of some perfectly rational agent that solves problems by virtue of reasoning and computational power alone. The cleavage between information and embodiment is a gambit for a much wider discussion that is beyond the scope of this article. If interested, this paper provides a good starting point.

However, there are good reasons to doubt that reinforcement learning suffices for artificial general intelligence. Some reasons for this include the very definition of general intelligence. Most current AI researchers still rely on a behaviourist conception of intelligence without factoring explicit internal representations as necessary ingredients. And they have good reason to think so. Symbolic AI, in which hopes of general AI were pinned before the success of deep learning, proved to be a failure. Symbolic AI refers to approaches to artificial intelligence based primarily on explicitly coded logical rules and knowledge stores for optimal inference generation.

The tension between symbolic AI and neural networks may, however, be unfounded. Many researchers believe that the quest for artificial general intelligence lies in combining these approaches in the right way. Reasons for thinking that neural nets approximate the native ontology of the brain include the fact that mathematical logic is not quite how the brain reasons: that is, it doesn’t compute necessary and sufficient conditions, or crisp membership, as much as graded membership, which is approximated by the likes of fuzzy logic and at which ANNs excel.

Neural networks consist of a black-box hierarchical architecture of hidden layers parametrized to achieve the desired output through highly calibrated dynamic learning rates, activation functions, connection weights, and optimization algorithms calibrated to minimize error. Beyond highly calibrated hyperparameters like the above, the human designer does not understand how information is processed in the hidden layers. The assumption is that the same is the case with the brain, where information is not stored as combinations of discrete representational units (whether analog or imagistic) but as a vast, distributed architecture of billions of neurons. What we think of as linguistically structured thoughts are not internally represented in the brain that way at all: there’s no specific combination of neurons that stand for the word being or the sentence “Existence as determinate being is in essence being for another” for example.

Linguistic competence is instead embedded in a vast network of semantic connections and reproduction rules reinforced through experience and augmented by imagistic and analog representations. In other words, language and thought as we represent them reflectively, but also behaviourally in writing and speech, do not have brain analogues that mirror their explicit structure (in other words, isomorphic mapping between grammar and the native ontology of the brain), but are instead embedded in distributed networks of neural assemblies characterized by degrees of connectivity and connection strengths.

On the other hand, it seems that neural nets seem unable to instantiate the structured thought-processes that some argue are the seat of reason and human intelligence. After all, explicit reasoning constitutes the chief means of human intellectual achievement, and this does not appear to be something that current neural nets are able to replicate. A salient illustration comes from Gödel’s Incompleteness Theorems, where a formal system alone cannot establish the truth of certain statements on the basis of proof alone. (If interested, check out this article that I wrote that explains Godel’s proof). Meanwhile, the human subject can verify the truth of such a statement despite failure of axiomatic deduction. Foregoing the complicated and contested implications of this uncoupling of truth and proof for computation, it is additionally worth noting that the human agent actively pursues theories of the world, whereas current RL algorithms do so in a very rudimentary sense, though robotics will likely eventually advance toward similar capabilities. Meanwhile the linguistic state of the art, LLMs, regurgitate linguistically indistinguishable analogues to human speech and writing when prompted while exhibiting exponentially faster recall speeds and stores of information orders of magnitude larger. Much hangs in the balance of understanding this distinction: humans actively pursue theories of the world as well as other creative pursuits as part of their cultural programming, which co-opts mechanisms tailored toward survival and reproductive success. In other words, all human activity occurs within the basin of evolutionary constraints. As such, humans and all living organisms, constitute autonomous systems that replicate and endogenously reproduce their own identity conditions. Human and animal intelligence are therefore inextricable from the boundary conditions of survival, barring any measure of cultural independence from strict adaptationism (a big topic which engenders wide disagreement).

Current AI does not approximate autonomous systems that endogenously propel themselves in the world. Nor do they generate their own environmental milieu and reconfigure their own search spaces in the way humans and other animals do. The absence of this constraint currently allows the human designer to set AI’s informational salience, e.g. text-generation, environmental detection etc. Even if the architecture evolves into a bona fide general problem-solving machine, unless it becomes capable of reflective awareness it cannot be said to possess general intelligence. Definitions of general intelligence canonically omit the variable of global awareness — the equivalent to what the ancient Greeks termed nous — as the hallmark of human intelligence. They do so because reflective and global awareness remain recalcitrant to reverse engineering and analysis into parts. For this reason, reflective awareness is dismissed as an ingredient of intelligence. However, admitting recalcitrance to current scientific explanation does not by the same token imply rejecting physicalism or an an endorsement of non-naturalism. Rather, it signals admission of lack of understanding. Given this gap in understanding, I hypothesize that reflective awareness is an extension of sentience which is a fundamental property of living organisms. In asserting this, I do not imply that autonomous systems cannot be engineered through means other than natural selection, though I leave open the possibility that they may remain opaque to scientific analysis in the foreseeable future. If reinforcement learning hopes to amount to general intelligence, the agent should posses as a prior a powerful architecture that not only hosts complex representations of the world, but maintains a global view from the inside of those very representations. This means that while model-world interactivity is indispensable to the task, the native architecture will require a complex hierarchical internal structure with capacities for multi-modal information processing and integration.

## Selected References

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., & Hassabis, D. (2015). Human-level Control through Deep Reinforcement Learning. Nature, 518(7540), 529–533. https://doi.org/10.1038/nature14236

Neftci, E. O., & Averbeck, B. B. (2019, March 4). Reinforcement learning in artificial and Biological Systems. Nature News. https://www.nature.com/articles/s42256-019-0025-4

Sharma, S. (2024, March 7). Learning to Mix 𝑛-Step Returns: Generalizing 𝜆-Returns for Deep Reinforcement Learning. Ar5iv. https://ar5iv.labs.arxiv.org/html/1705.07445

Sanghi, Nimish. Deep Reinforcement Learning with Python: With PYTORCH, Tensorflow and Openai Gym. Apress, 2021.

Silver, D., Singh, S., Precup, D., & Sutton, R. S. (2021). Reward is enough. Artificial Intelligence, 299, 103535. https://doi.org/10.1016/j.artint.2021.103535

Spens, E., & Burgess, N. (2024, January 19). A generative model of memory construction and consolidation. Nature News. https://www.nature.com/articles/s41562-023-01799-z

Sutton, Richard S. Introduction to Reinforcement Learning. MIT Press.

Tyng, C. M., Amin, H. U., Saad, M. N. M., & Malik, A. S. (2017, August 24). The influences of emotion on learning and memory. Frontiers in psychology. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5573739/

White, A., Modayil, J., & Sutton, R. (2014). Surprise and Curiosity for Big Data Robotics. Association for the Advancement of Artificial Intelligence, 19–22.

Source link