## 1. Introduction

The best way to understand something is to try and explain it. And if you keep getting better every time you try to explain it, well, that’s roughly the gist of what *Reinforcement Learning (RL)* is about. Given how different *RL* is from Supervised or Unsupervised Learning, I figured that the best strategy is to go slow, and to go slow is to start with the *Markov *assumption, introduce the concept of a scalar *reward* and build into *Markov Reward Processes* (MRPs). Then, once our *agent* can start making decisions we have ourselves a *Markov Decision Process *(MDP) (An *agent* is something which interacts with the environment by executing certain actions, taking observations, and receiving eventual rewards for this). A *MDP *can be described as the problem to be resolved via RL, i.e. *Reinforcement Learning *is a framework for solving problems that can be expressed as *Markov Decision Processes. *This theoretical flow is of course not very original, and most *RL *lectures or text books begin as such. We can use *Dynamic Programming* *(DP)* to solve *MDPs *when the model is fully known. I will dedicate the next post entirely to *Dynamic Programming *solutions for *MDPs. *We will discuss model-free solutions to *MDPs*, which form the heart of most modern *RL *applications, in a future post. This article is about getting ourselves familiar with the basics. Throughout, I will be referring (with and without explicit mention) to a few excellent resources, such as *Richard Sutton’s *absolute Bible of the field, *David Silver’s *quite remarkable lecture series, *Max Lapan’s *eminently readable practical introduction to *Reinforcement Learning *and an extremely well-written article from *Analytics Vidhya. *Any other source will hopefully be cited. There will be more math than code in this particular article (because too often in *Machine Learning,* the underlying mathematical framework is all but ignored) but worry not, that will be rectified in subsequent posts. Let’s get on with it then shall we?

## 2. Reinforcement Learning

Literally everyone in the world has now heard of *Machine Learning*, and by extension, *Supervised Learning. **Reinforcement Learning *uses some established *Supervised Learning *algorithms such as neural networks to learn data representation, but the way *RL *handles a learning situation is all its own. *RL *also does not look for hidden structure in unlabeled data, as is the case in *Unsupervised Learning. **Reinforcement Learning *is how your algorithm learns what to do in a given situation, over many trials, to maximize a scalar reward signal. One always has to consider the exploration-exploitation trade-off while considering a *RL *problem. As *Sutton *puts it:

“*To obtain a lot of reward, a reinforcement learning agent must prefer actions that it has tried in the past and found to be effective in producing reward. But to discover such actions, it has to try actions that it has not selected before. The agent has to exploit what it has already experienced in order to obtain reward, but it also has to explore in order to make better action selections in the future.*”

This is not something that one has to typically consider in a *Machine Learning *problem. *RL *consists of a goal-oriented *agent *interacting with an uncertain *environment*, and, unlike other typical *ML *problems, which might look at a series of isolated subproblems, *Reinforcement Learning *looks at the **whole **problem. An *agent *always has an explicit goal, receives delayed feedback from its *environment*, and can influence the *environment *with its decisions. The diagram below (courtesy *Lapan*) shows a standard *RL *formulation. The *actions *of the *agent *influence the *environment, *and the *agent *receives feedback from the *environment. *This feedback is in the form of *observations *and a scalar *reward.*

Let us formally, following *Sutton*, define the major elements of a *RL *system that we will encounter over and over in *Reinforcement Learning *problems.

*Policy:** *A policy defines how a learning agent will behave at a given time. A policy can be thought of as a mapping from perceived states of the environment to actions to be taken when in those states. The policy alone is sufficient to determine agent behavior and is stochastic in general.

** Reward Signal: **This defines the goal in a RL problem. At each time step, the agent receives a single number (scalar) called the reward from the environment. The agent’s only objective is to maximize the cumulative reward that it receives. The

*Reward Hypothesis*states that “

*All goals can be described by the maximization of expected cumulative reward*“. Rewards are the immediate and defining features of the problem faced by the agent. The reward signal is the primary basis for altering the policy; a policy may be altered if an action selected by it gives a low reward. In general, reward signals may be stochastic functions of the state of the environment and the actions taken.

* Value Function: * A value function specifies what is good in the long run, as opposed to the immediacy of a reward. A state might always yield a low (immediate) reward but still have a high value because it is regularly followed by other states that yield high rewards. The value of a state is the total amount of reward an agent can expect to accumulate over the future, starting from that state. I know I haven’t formally defined a

*state*yet but that will happen naturally when we define

*Markov Processes*in the next section. Till then you can intuitively think of a state as a description of the current situation in which the agent finds itself. We want our agent to undertake actions that bring about the states of highest value, not the highest reward, because these actions obtain the greatest amount of cumulative reward over many time-steps.

** Model: **A model of the environment is something that allows inferences to be made about how the environment will behave. Given a state and action, the model might predict the resultant next state and next reward. As mentioned briefly in the introduction, methods for solving reinforcement learning problems that use models and planning are model-based methods, as opposed to simpler model-free methods that are explicitly trial-and-error learners—viewed as almost the opposite of planning.

I don’t want to make this discussion too verbose, so any other knowledge that we require will come up as and when we need to understand it in order to apply it to a practical problem. Let us move on to the mathematical basis of *Reinforcement Learning. *

## 3. Markov Processes

Formally, a stochastic process has the Markov property if the conditional probability distribution of future states of the process (conditional on both past and present states) depends only upon the present state, not on the sequence of events that preceded it. A *Markov process *is a process which possesses this property. Another way of framing a *Markov process* is that the *“future is independent of the past, given the present”. *

Thus, the state captures all the relevant information from its history, and is enough, on its own, to describe the future. A Markov process is a memoryless random process, i.e. a sequence of random states S1, S2, … (which taken together comprise the *state space*) with the Markov property. The image below, from *David Silver’s *lecture notes, formalizes the definition:

The state transition probability matrix defines the probabilities of transition from all states *s** *to all successor states ** s’. **Thus, to completely define a

*Markov*process

*,*all you need is a set of states that your system can be in, as well as a transition matrix. Considering this a review article of sorts, I don’t mind shamelessly pilfering examples from other resources, such as the one below from

*Lapan*which is a very simple example of a

*Markov chain or process*

*.*We consider a system where we try to predict the probability that a rainy day leads to a sunny day, or vice-versa. An example transition matrix is shown below:

So, the probability that the next day will be rainy given that today is sunny is 0.2, and so on. Remember the *Markov *property, which states that we only need to know the state of the system today in order to predict the next state (tomorrow). The past history can be safely discarded. We talked about a reward signal in the previous section, and you may have intuited that if we introduce the concept of a *Reward *to a *Markov Process *then we arrive at…

## 4. Markov Reward Processes

In a *Markov Process, *we observe a series of state transitions, but now we want to associate a scalar reward to each of these transitions. In simplistic terms, we want to figure out if it is a good transition or a bad transition. So, all our observations have a reward attached to every transition of the system. One way of representing this is introducing another matrix, with rewards (instead of transition probabilities like before) assigned for each state transition. The formal definition of a *Markov Reward Process (MRP) *follows, as usual, from a *David Silver *lecture slide:

So, we have introduced a *Reward function* and a *discount (or gamma) factor* to the existing formalism. These concepts can both be combined into defining the *Return*, which is:

Thus, for every time-step * t, *the

*return*is defined as the sum of subsequent rewards, multiplied by increasing powers of the

*discount factor.*So the

*discount factor*gives us the present value of future rewards. A value of 0 for the

*discount*will imply that the

*return*would just equal the immediate reward (‘near-sightedness’), while a

*discount factor*of 1 means that the

*return*takes into account all future

*rewards*(‘far-sightedness’). So the closer the

*gamma factor*is to 1

*,*the more future

*rewards*we want to take into account. How much you set the

*discount factor*at depends on the particular situation, for instance, in finance applications, immediate rewards may have more benefits than long-term rewards. It is possible to use un-discounted

*MRPs*in finite horizon environments where there are a limited number of steps that can be taken (

*Lapan*uses the example of Tic-Tac-Toe which can have at most 9 steps).

*Return*is not very useful in practice, as it is defined for every specific chain we observe from our

*MRP*. Thus, it can vary widely even for the same state. However, if we calculate the mathematical expectation of return for any state (by calculating the expectation of a large number of

*Markov processes*), we get a more useful quantity, called a

*value function v(s),*which gives the long term value of the state

*s,*and is defined as:

So, for every state *s*, the value *v (s)* is the average return we get by following the *Markov reward process*. The *Bellman equation *for *MRPs *breaks down the value function into two parts, as shown below (excuse the handwriting!):

This *Bellman *equation can be expressed using matrices:

The above is a system of linear equations and can be solved directly, though with considerable computational complexity (O(n^3)). For larger *MRPs*, solutions have to be iterative using techniques such as *Dynamic Programming *and *Temporal-Difference Learning. *It’s about time for us to introduce *actions *or *decisions *into our *Markov Reward Processes, *which leads us into…

## 5. Markov Decision Processes

Fairly intuitively, a *Markov Decision Process *is a *Markov Reward Process *with *decisions*. An *MDP *is an environment in which all states are *Markov. **MDPs* are meant to be a straightforward framing of the problem of learning from interaction to achieve a goal. The learner and decision maker is called the *agent*. The thing it interacts with, comprising everything outside the *agent*, is called the *environment*. These interact continually, the *agent* selecting *actions* and the *environment* responding to these *actions* and presenting new situations to the *agent*. The *environment* also gives rise to *rewards*, special numerical values that the *agent* seeks to maximize over time through its choice of *actions*. In other words, *MDPs *frame *Reinforcement Learning *problems. The *agent-environment *interaction in a *Markov Decision Process *is shown below:

A formal definition of *MDPs *follows from *David Silver’s *lecture slides:

We see that the only thing different from the *MRP *definition in the above section is the introduction of *actions. *As we have seen above, a policy fully determines how an agent will behave at a given time, that is, a policy governs the decision-making capacity of an *agent. *A fixed* policy *means no decision making, and therefore the *MDP *becomes a *MRP. *Formally, a policy is defined as the probability distribution over actions for every possible state:

*Policies, *which are stationary (time independent), follow the *Markov *assumption and hence *MDP policies *depend only on the current state and not the history. We can define two types of value functions, the *state value function *and the *state-action value (or action-value) functions* as follows:

Similar to the calculation for *MRPs, *the *Bellman Expectation Equation *sees the two value functions broken down into the sum of the immediate reward and the discounted value of the successor state:

The *optimal value function *is the best possible performance in an *MDP, *and a *MDP *is said to be solved when the *optimal value function *is known.

For any *Markov Decision Process, *there exists an optimal policy that is better than or equal to other existing policies. In addition, there is a theorem which states that all optimal policies achieve the optimal value function. The theorem also states that all optimal policies achieve the optimal action-value function.

There is always a deterministic optimal policy for any *MDP *and knowing the optimal *action-value function *means that we immediately have the optimal policy. T**he Bellman optimality equation expresses the fact that the value of a state under an optimal policy must equal the expected return for the best action from that state**. These equations are special consistency conditions that the optimal value functions must satisfy and that can, in principle, be solved for the optimal value functions, from which an optimal policy can be determined with relative ease.

The *Bellman Optimality Equation *is non-linear and in general does not possess a closed-form solution. It is solved by iterative methods such as *Policy Iteration, Value Iteration, Q-Learning *and* Sarsa. Policy Iteration and Value Iteration *will be discussed in the next post, which will deal with solving *MDPs using Dynamic Programming *techniques. *Q-Learning* and *Sarsa will be elaborated upon in a future post. *

Thus we come to the end of the first post on *Reinforcement Learning *basics. The subsequent post will contain working code examples, but I felt it important to somewhat motivate the mathematical background in this particular article. We started off with the *Markov *assumption, went into the *Bellman equation *for Markov* Reward Processes *and ended up with formulating optimal policies for *Markov Decision Processes. *The resources mentioned in the introduction are highly recommended if you want to get a deeper understanding of *MDPs, *beyond the basic concepts that I have included in this article. I am also completely new to *Reinforcement Learning *so please let me know if there are any conceptual errors, or if anything could be written better. Till the next time then! Ciao!

## One thought on “An Introduction to Reinforcement Learning – I :: Markov Decision Processes”