An Introduction to Reinforcement Learning – I :: Markov Decision Processes

1. Introduction

The best way to understand something is to try and explain it. And if you keep getting better every time you try to explain it, well, that’s roughly the gist of what Reinforcement Learning (RL) is about.  Given how different RL is from Supervised or Unsupervised Learning, I figured that the best strategy is to go slow, and to go slow is to start with the Markov assumption, introduce the concept of a scalar reward and build into Markov Reward Processes (MRPs). Then, once our agent can start making decisions we have ourselves a Markov Decision Process (MDP) (An agent is something which interacts with the environment by executing certain actions, taking observations, and receiving eventual rewards for this). A MDP can be described as the problem to be resolved via RL, i.e. Reinforcement Learning is a framework for solving problems that can be expressed as Markov Decision Processes. This theoretical flow is of course not very original, and most RL lectures or text books begin as such. We can use Dynamic Programming (DP) to solve MDPs when the model is fully known. I will dedicate the next post entirely to Dynamic Programming solutions for MDPs. We will discuss model-free solutions to MDPs, which form the heart of most modern RL applications, in a future post. This article is about getting ourselves familiar with the basics. Throughout, I will be referring (with and without explicit mention) to a few excellent resources, such as Richard Sutton’s absolute Bible of the field, David Silver’s quite remarkable lecture seriesMax Lapan’s eminently readable practical introduction to Reinforcement Learning and an extremely well-written article from Analytics VidhyaAny other source will hopefully be cited. There will be more math than code in this particular article (because too often in Machine Learning, the underlying mathematical framework is all but ignored) but worry not, that will be rectified in subsequent posts. Let’s get on with it then shall we?

2. Reinforcement Learning

Literally everyone in the world has now heard of Machine Learning, and by extension, Supervised Learning. Reinforcement Learning uses some established Supervised Learning algorithms such as neural networks to learn data representation, but the way RL handles a learning situation is all its own. RL also does not look for hidden structure in unlabeled data, as is the case in Unsupervised Learning. Reinforcement Learning is how your algorithm learns what to do in a given situation, over many trials, to maximize a scalar reward signal. One always has to consider the exploration-exploitation trade-off while considering a RL problem. As Sutton puts it:

To obtain a lot of reward, a reinforcement learning agent must prefer actions that it has tried in the past and found to be effective in producing reward. But to discover such actions, it has to try actions that it has not selected before. The agent has to exploit what it has already experienced in order to obtain reward, but it also has to explore in order to make better action selections in the future.

This is not something that one has to typically consider in a Machine Learning problem. RL consists of a goal-oriented agent interacting with an uncertain environment, and, unlike other typical ML problems, which might look at a series of isolated subproblems, Reinforcement Learning looks at the whole problem. An agent always has an explicit goal, receives delayed feedback from its environment, and can influence the environment with its decisions. The diagram below (courtesy Lapan) shows a standard RL formulation. The actions of the agent influence the environment, and the agent receives feedback from the environment. This feedback is in the form of observations and a scalar reward.

Screen Shot 2019-04-19 at 8.36.55 PM


Let us formally, following Sutton, define the major elements of a RL system that we will encounter over and over in Reinforcement Learning problems.

Policy: A policy defines how a learning agent will behave at a given time. A policy can be thought of as a mapping from perceived states of the environment to actions to be taken when in those states. The policy alone is sufficient to determine agent behavior and is stochastic in general.

Reward Signal: This defines the goal in a RL problem. At each time step, the agent receives a single number (scalar) called the reward from the environment. The agent’s only objective is to maximize the cumulative reward that it receives. The Reward Hypothesis states that “All goals can be described by the maximization of expected cumulative reward“. Rewards are the immediate and defining features of the problem faced by the agent. The reward signal is the primary basis for altering the policy; a policy may be altered if an action selected by it gives a low reward. In general, reward signals may be stochastic functions of the state of the environment and the actions taken.

Value Function:  A value function specifies what is good in the long run, as opposed to the immediacy of a reward. A state might always yield a low (immediate) reward but still have a high value because it is regularly followed by other states that yield high rewards. The value of a state is the total amount of reward an agent can expect to accumulate over the future, starting from that state. I know I haven’t formally defined a state yet but that will happen naturally when we define Markov Processes in the next section. Till then you can intuitively think of a state as a description of the current situation in which the agent finds itself. We want our agent to undertake actions that bring about the states of highest value, not the highest reward, because these actions obtain the greatest amount of cumulative reward over many time-steps.

Model: A model of the environment is something that allows inferences to be made about how the environment will behave. Given a state and action, the model might predict the resultant next state and next reward. As mentioned briefly in the introduction, methods for solving reinforcement learning problems that use models and planning are model-based methods, as opposed to simpler model-free methods that are explicitly trial-and-error learners—viewed as almost the opposite of planning.

I don’t want to make this discussion too verbose, so any other knowledge that we require will come up as and when we need to understand it in order to apply it to a practical problem. Let us move on to the mathematical basis of Reinforcement Learning. 

3. Markov Processes

Formally, a stochastic process has the Markov property if the conditional probability distribution of future states of the process (conditional on both past and present states) depends only upon the present state, not on the sequence of events that preceded it. A Markov process is a process which possesses this property. Another way of framing a Markov process is that the “future is independent of the past, given the present”. 

Screen Shot 2019-04-19 at 11.16.02 PM.png

Thus, the state captures all the relevant information from its history, and is enough, on its own, to describe the future. A Markov process is a memoryless random process, i.e. a sequence of random states S1, S2, … (which taken together comprise the state space) with the Markov property. The image below, from David Silver’s lecture notes, formalizes the definition:

Screen Shot 2019-04-19 at 11.19.58 PM.png

The state transition probability matrix defines the probabilities of transition from all states s to all successor states s’. Thus, to completely define a Markov processall you need is a set of states that your system can be in, as well as a transition matrix. Considering this a review article of sorts, I don’t mind shamelessly pilfering examples from other resources, such as the one below from Lapan which is a very simple example of a Markov chain or processWe consider a system where we try to predict the probability that a rainy day leads to a sunny day, or vice-versa. An example transition matrix is shown below:

Screen Shot 2019-04-19 at 11.29.06 PM

So, the probability that the next day will be rainy given that today is sunny is 0.2, and so on. Remember the Markov property, which states that we only need to know the state of the system today in order to predict the next state (tomorrow). The past history can be safely discarded. We talked about a reward signal in the previous section, and you may have intuited that if we introduce the concept of a Reward to a Markov Process then we arrive at…

4. Markov Reward Processes

In a Markov Process, we observe a series of state transitions, but now we want to associate a scalar reward to each of these transitions. In simplistic terms, we want to figure out if it is a good transition or a bad transition. So, all our observations have a reward attached to every transition of the system. One way of representing this is introducing another matrix, with rewards (instead of transition probabilities like before) assigned for each state transition. The formal definition of a Markov Reward Process (MRP) follows, as usual, from a David Silver lecture slide:

Screen Shot 2019-04-19 at 11.46.16 PM.png

So, we have introduced a Reward function and a discount (or gamma) factor to the existing formalism. These concepts can both be combined into defining the Return, which is:

Screen Shot 2019-04-19 at 11.49.15 PM.png

Thus, for every time-step t, the return is defined as the sum of subsequent rewards, multiplied by increasing powers of the discount factor. So the discount factor gives us the present value of future rewards. A value of 0 for the discount will imply that the return would just equal the immediate reward (‘near-sightedness’), while a discount factor of 1 means that the return takes into account all future rewards (‘far-sightedness’). So the closer the gamma factor is to 1the more future rewards we want to take into account. How much you set the discount factor at depends on the particular situation, for instance, in finance applications, immediate rewards may have more benefits than long-term rewards. It is possible to use un-discounted MRPs in finite horizon environments where there are a limited number of steps that can be taken (Lapan uses the example of Tic-Tac-Toe which can have at most 9 steps). Return is not very useful in practice, as it is defined for every specific chain we observe from our MRP. Thus, it can vary widely even for the same state. However, if we calculate the mathematical expectation of return for any state (by calculating the expectation of a large number of Markov processes), we get a more useful quantity, called a value function v(s), which gives the long term value of the state s, and is defined as:

Screen Shot 2019-04-20 at 12.32.02 AM.png

So, for every state s, the value v (s) is the average return we get by following the Markov reward process. The Bellman equation for MRPs breaks down the value function into two parts, as shown below (excuse the handwriting!):

Scan Apr 20, 2019 at 12.43 AM.JPG

This Bellman equation can be expressed using matrices:

Screen Shot 2019-04-20 at 12.52.49 AM.png

The above is a system of linear equations and can be solved directly, though with considerable computational complexity (O(n^3)). For larger MRPs, solutions have to be iterative using techniques such as Dynamic Programming and Temporal-Difference LearningIt’s about time for us to introduce actions or decisions into our Markov Reward Processes, which leads us into…

5. Markov Decision Processes

Fairly intuitively, a Markov Decision Process is a Markov Reward Process with decisions. An MDP is an environment in which all states are Markov. MDPs are meant to be a straightforward framing of the problem of learning from interaction to achieve a goal. The learner and decision maker is called the agent. The thing it interacts with, comprising everything outside the agent, is called the environment. These interact continually, the agent selecting actions and the environment responding to these actions and presenting new situations to the agent. The environment also gives rise to rewards, special numerical values that the agent seeks to maximize over time through its choice of actions. In other words, MDPs frame Reinforcement Learning problems. The agent-environment interaction in a Markov Decision Process is shown below:

Screen Shot 2019-04-20 at 1.16.11 AM

A formal definition of MDPs follows from David Silver’s lecture slides:

Screen Shot 2019-04-20 at 1.21.36 AM.png

We see that the only thing different from the MRP definition in the above section is the introduction of actions. As we have seen above, a policy fully determines how an agent will behave at a given time, that is, a policy governs the decision-making capacity of an agent. A fixed policy means no decision making, and therefore the MDP becomes a MRP. Formally, a policy is defined as the probability distribution over actions for every possible state:

Screen Shot 2019-04-20 at 1.30.02 AM.png

Policies, which are stationary (time independent), follow the Markov assumption and hence MDP policies depend only on the current state and not the history. We can define two types of value functions, the state value function and the state-action value (or action-value) functions as follows:

Screen Shot 2019-04-20 at 1.51.39 AM.png

Similar to the calculation for MRPs, the Bellman Expectation Equation sees the two value functions broken down into the sum of the immediate reward and the discounted value of the successor state:

Screen Shot 2019-04-20 at 1.56.06 AM.png

The optimal value function is the best possible performance in an MDP, and a MDP is said to be solved when the optimal value function is known.

Screen Shot 2019-04-20 at 2.01.02 AM

For any Markov Decision Process, there exists an optimal policy that is better than or equal to other existing policies. In addition, there is a theorem which states that all optimal policies achieve the optimal value function. The theorem also states that all optimal policies achieve the optimal action-value function.

Screen Shot 2019-04-20 at 2.08.26 AM.png

There is always a deterministic optimal policy for any MDP and knowing the optimal action-value function means that we immediately have the optimal policy. The Bellman optimality equation expresses the fact that the value of a state under an optimal policy must equal the expected return for the best action from that stateThese equations are special consistency conditions that the optimal value functions must satisfy and that can, in principle, be solved for the optimal value functions, from which an optimal policy can be determined with relative ease.

Scan Apr 20, 2019 at 2.23 AM.JPG

The Bellman Optimality Equation is non-linear and in general does not possess a closed-form solution. It is solved by iterative methods such as Policy Iteration, Value Iteration, Q-Learning and Sarsa. Policy Iteration and Value Iteration will be discussed in the next post, which will deal with solving MDPs using Dynamic Programming techniques. Q-Learning and Sarsa will be elaborated upon in a future post. 

Thus we come to the end of the first post on Reinforcement Learning basics. The subsequent post will contain working code examples, but I felt it important to somewhat motivate the mathematical background in this particular article. We started off with the Markov assumption, went into the Bellman equation for Markov Reward Processes and ended up with formulating optimal policies for Markov Decision Processes. The resources mentioned in the introduction are highly recommended if you want to get a deeper understanding of MDPs, beyond the basic concepts that I have included in this article. I am also completely new to Reinforcement Learning so please let me know if there are any conceptual errors, or if anything could be written better. Till the next time then! Ciao!

One thought on “An Introduction to Reinforcement Learning – I :: Markov Decision Processes

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s