Policy Composition
What we want to do:
Basically, the idea is that we have multiple policies, each with its own 'desires' (as defined by its own unique reward structure), and then having a structured way of combining them.
As an example, we could have one policy that is incentivized to find hidden passages and one that is incentivized to find stairs and one that is incentivized to kill monsters and one that is incentivized to not take damage. In this case, by weighting how much we care about each of these desires, we create a new composite policy that leverages learned behaviors from each.
https://deepmind.com/blog/article/fast-reinforcement-learning-through-the-composition-of-behaviours
https://www.pnas.org/content/117/48/30079
https://arxiv.org/pdf/2106.13105.pdf
Why:
- Might mitigate/solve the problem of sparse rewards in a lot of the Minihack/Nethack environments.
- Might help with skill transfer across environments
- Something something Qualia
- If it works, provides a built-in explainability thing (what goals is it prioritizing at any given moment)
- I think that if I can build a policy that's incentivized to find hidden things combined with one that goes to stairs, then maybe we can finally beat that stupid Corridor environment