\(\def\lim#1#2{ \underset{#1 \rightarrow #2}{lim} }\) \(\def\min#1{ \underset{#1}{min} \hspace{.1cm} }\) \(\def\argmin#1{ \underset{#1}{argmin} \hspace{.1cm} }\)

\(\def\*J#1{ {J^*_{#1}} }\) \(\def\~*J#1{ \overset{\sim}{J^*_{#1}} }\) \(\def\ud#1#2{ \underset{#2}{#1} }\)

Clearnotes

Determinitcs

Strategy to approaching this book and the author's style

LINKS

  • Why do people rewrite the D.P problem in terms of Q?

  • Multi-step lookahead vs one-step lookhead

    What is your proposal

  • Potential Improvements

  • Mathematic Courses
  • Intuition/rigor-based experimentation (
  • Rollout

    Definitions

    Points

    Dynamic Programming Algorithm

    Principle of Optimality

    Q factors and Q learning

    Approximation in Value Space and Rollout use \(\~*J{k} \*J{k}\)

    Stochastic Dynamic Programming

    TODO: NEED to Review the previous sections ----------- and get a high level approximation

    RL SPIN EXAMPLE Understanding ==========================================================================================================

    What is rollout, it is an estimation of \(J(u0,u1,...,uk-1,uk)\)? Are you sure. lets read it again. What is diff between q-estimiation and optimal policy optimization

    Ok so lets go back to the basics G()

    trajectory \((x_0,u_0,x_1,u_1...)\) \(r(x_t,u_t)\) reward \(p(x_{t+1} | x_t,u_t)\)

    \(V(traj) = E[ R(traj) ]\) Goal is to find maximal traj which is given by maximal policy \(max_{a_t}( r(x_t,a_t) + \gamma V^*(x_{t+1}) )\)

    \(R(\tau) = \sum_{t=0}[\gamma^t r(s_t, a_t)]\) \(V(s) = E_{\tau ~ \pi}[ R(\tau) | s_0=s ]\) \(Q(s,a) = E_{\tau ~ \pi}[ R(\tau) | s_0=s, a_0=a ]\)

    In practice, \(V^{\pi}(s_t)\) cannot be computed exactly, so it has to be approximated. what is the reason? what is Rhat? how can we get R if the model of the enviroment doesn't tell us? (Check the book by dmitri, it only needs R for the end states)