https://github.com/laserany/snake-ai-model
https://github.com/deepmind/open_spiel
KL-divergence
Advantage: represents how much better or worse an action is compared to the average action in a given state, taking into account the expected reward and the current policy.
The main idea behind ppo is
https://spinningup.openai.com/en/latest/algorithms/ppo.html