Gridworld¶

Introduction¶

The aim of this lab is to familiarize yourself with Markov decision processes and three main algorithms:

iterative policy evaluation
policy iteration
value iteration

You will implement these algorithms from scratch on the Gridworld environment.

The gridworld environment¶

The figure above represents our gridworld problem where the cells of the rectangular grid correspond to the states of the Markov decision process $(\mathcal{X}, \mathcal{A}, r, p, \gamma)$. At each cell, four actions are possible, $\mathcal{A} = \{north, south, east, west\}$ which deterministically cause the agent to move once cell in the respective direction on the grid. Note that actions that would take the agent off the grid leave its location unchanged. The reward is $-1$ on all transition untile a terminal state, i.e., shaded cells, is reached. For this application we will consider a completely random policy and a discount factor $\gamma = 1$.

Question 1¶

Define a list states which encodes all possible states in this environment, the absorbing states being omitted and a list actions which lists the possible actions.

## Code goes here
## %load solutions/question_1.py

Question 2¶

Write a function step which given a current state and an action causes the agent to move according this action.

## Code goes here
## %load solutions/question_2.py

Question 3¶

Write a function that implements the iterative policy evaluation algorithm to give an estimate of the value function.

## Code goes here
## %load solutions/question_3.py

Question 4¶

Write a function that implements the policy iteration algorithm to get an estimate of the optimal policy.

## Code goes here 
## %load solutions/question_4.py

Question 5¶

Write a function that implements the value iteration algorithm to get an estimate of the optimal policy.

## Code goes here
## %load solutions/question_5.py

Question 6¶

Do the above estimates make sense? Experience what happens when you change the reward.

## Code goes here

Question 7¶

Try to reproduce the results from the lecture notes about the Markov Decision process taken from wikipedia, i.e., value function for a random policy, optimal value function and policy.

## Code goes here