Introduction to Reinforcement Learning!

RL Intro image

Reinforcement learning (RL) is an area of machine learning inspired by behaviorist psychology concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward.

Source - Wikipedia


The Setting

  • Reinforcement Learning is characterized by an agent trying to learn by interacting with its environment.
  • At each time step(T), the agent receives the environment state and chooses some appropriate action in response. At the next time step (T+1), the agent receives a reward from the environment (can be either encouraging or discouraging) and the next state of the environment.
  • All agents have the task to maximize expected cumulative reward i.e the expected sum of rewards to be received over all future time steps.

Episodic Vs Continuous Tasks

There can be two type of tasks in reinforcement learning

  • Episodic tasks
  • Continuous tasks

Episodic Tasks

In Episodic tasks, the agent interacts with the environment in episodes or parts i.e there is no continuous interaction. Here the environment has some predefined terminal states and upon reaching those states the episode terminates. Episodic tasks are more common then you think, consider the agent playing a chess game. Here, the episode will end when one game is over. If the agent wins, it’ll receive some positive reward and if it loses then it’ll receive some negative reward. Other examples can be -

  • Any board or Atari game
  • When the agent needs to reach some destination
  • Control tasks like some Robotic arm throwing a dart

Continuous Tasks

Continuous tasks, in contrast, don’t have fixed ending states. Here, the agent needs to interact with the environment and take decisions in real time. The following example will make this clear.

Consider the task is teaching a Self Driving Car how to drive. If it were an episodic task then the car needs to crash every time in order to update its policy which obviously isn’t practical. Continuous tasks are much like real-world setting where the agent needs to continuously interact with the environment and keep on evolving.


The Reward Hypothesis

The reward hypothesis says all goals can be framed as the maximization of expected cumulative reward

Cumulative Reward (Return)

The cumulative reward at any time step is the sum of rewards that will be achieved in all future time steps. We call this Return (G[t]).

G[t] = R[t+1] + R[t+2] + R[t+3] + …..

Discounted Return

Using Return would mean, the agent needs to consider all the rewards in near as well as distant future but most of the times the episodes can be very long and future rewards are not as relevant as present rewards. That’s where discounted return comes in picture.

In discounted return, we use a hyper-parameter γ(Gamma) to control the agent’s farsightedness.

  • The discounted return at time step t is G[t]:= R[t+1]+γR[t+2]+γ^2R[t+3}+ …..
  • The discount rate γ is something that you set, to refine the goal that you have the agent.
    • It must satisfy 0≤γ≤1.
    • If γ=0 the agent only cares about the most immediate reward.
    • If γ=1 the return is not discounted.
    • For larger values of γ, the agent cares more about the distant future. Smaller values of γ result in more extreme discounting, wherein the most extreme case agent only cares about the most immediate reward.

MDP and One Step Dynamics

Markov Decision Process

Markov Decision Process (MDP)

Markov decision processes (MDPs) provide a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker. MDPs are useful for studying a wide range of optimization problems solved via dynamic programming and reinforcement learning. MDPs were known at least as early as the 1950s, a core body of research on Markov decision processes resulted from Ronald A. Howard’s book published in 1960, Dynamic Programming and Markov Processes. They are used in a wide area of disciplines, including robotics, automatic control, economics, and manufacturing.

Before I explain MDP any further, we need to know a couple of new Terms.

  • State Space - State Space is a set of all possible (non-terminal) state.
  • In episodic tasks, we use S+ to refer to the set of all states, including terminal states.
  • Action Space - The action space A is the set of possible actions. (Alternatively, A(s) refers to the set of possible actions available in state s∈S.)
  • Policy - Policy can be understood as a mapping from different state to optimal actions. It can be as simple as a dictionary or complex as a deep neural network.
  • Expected Cumulative Reward (Expected Return) - Expected Return E[G(t)] is the return that’ll be received if the agent is in State S at Time t and it follows the policy thereafter.
  • One Step Dynamics - One step dynamics of the environment determines how the environment decides next state and reward at any given time step (This is very useful in dynamic programming environments). The dynamics can be defined by specifying p(s′,r∣s,a) = P(S[t+1]​=s′, R[t+1]​=r ∣ S[t]​=s, A[t]=a) for each possible s′,r,s,and a.

Finite MDPs

A Markov decision process is a 5-tupleMDP tuple where

  • S is a set of finite states,
  • A is a finite set of actions (alternatively, A[s]is the finite set of actions available from state s),
  • Finite MDP formula is the probability that action a in state s at time t will lead to state s’ at time t+1,
  • Finite MDP reward is the immediate reward (or expected immediate reward) received after transitioning from state s to state s’, due to action a,
  • Gamma Range is the discount factor, which represents the difference in importance between future rewards and present rewards.

    Source - Wikipedia

    Note - The theory of Markov decision processes does not state that S or A is finite, but the basic algorithms below assume that they are finite.

Written on June 26, 2018