We are going to try to learn a (parametrized) policy to win at the CartPole challenge using the REINFORCE algorithm. More specifically we use the following paramétrization for $\pi_\theta$ \begin{equation*} \mbox{logit} ~Pr(\text{right} \mid X = x) = x^\top \theta, \end{equation*} where $x$ is the state vector and $\theta$ a vector of parameters to be learnt. Note that we didn't not put an intercept in the above parametrization and it is intended.

Question 1:¶

Show that we have \begin{equation*} \nabla \log \pi_\theta(\text{right} \mid x) = x \pi_\theta(\text{left} \mid x) \end{equation*} and \begin{equation*} \nabla \log \pi_\theta(\text{left} \mid x) = -x \pi_\theta(\text{right} \mid x). \end{equation*}

In [ ]:
## Answer goes here

Question 2:¶

Write a function that learns an optimal parametrized policy for the Cartpole challenge. Note that you may want to set the discount factor to $\gamma = 1$ and the learning rate to $\eta = 0.001$.

Hint: The deque object from collections might be useful to store efficiently an entire episode.

In [ ]:
# %load reinforce.py

Question 3:¶

Think about a $Q$--learning or SARSA learning strategies. What could be a drawback of such approaches for the cartpole problem? Does the REINFORCE strategy suffers from this limitation?

In [ ]:
## Answer goes here

Question 4:¶

Run your brand new algorithm for say, $N = 1000$ episodes, and plot the evolution of the cumulative reward as the training goes.

In [1]:
## %load question_4.py

Question 5:¶

Based on your just learnt parametrized policy, play a game using this policy.

In [2]:
# %load question_5.py