We are going to try to learn a (parametrized) policy to win at the CartPole challenge using the REINFORCE algorithm. More specifically we use the following paramétrization for $\pi_\theta$ \begin{equation*} \mbox{logit} ~Pr(\text{right} \mid X = x) = x^\top \theta, \end{equation*} where $x$ is the state vector and $\theta$ a vector of parameters to be learnt. Note that we didn't not put an intercept in the above parametrization and it is intended.
Show that we have \begin{equation*} \nabla \log \pi_\theta(\text{right} \mid x) = x \pi_\theta(\text{left} \mid x) \end{equation*} and \begin{equation*} \nabla \log \pi_\theta(\text{left} \mid x) = -x \pi_\theta(\text{right} \mid x). \end{equation*}
## Answer goes here
Write a function that learns an optimal parametrized policy for the Cartpole challenge. Note that you may want to set the discount factor to $\gamma = 1$ and the learning rate to $\eta = 0.001$.
Hint: The deque object from collections might be useful to store efficiently an entire episode.
# %load reinforce.py
Think about a $Q$--learning or SARSA learning strategies. What could be a drawback of such approaches for the cartpole problem? Does the REINFORCE strategy suffers from this limitation?
## Answer goes here
Run your brand new algorithm for say, $N = 1000$ episodes, and plot the evolution of the cumulative reward as the training goes.
## %load question_4.py
Based on your just learnt parametrized policy, play a game using this policy.
# %load question_5.py