Policy Composition

What we want to do: Basically, the idea is that we have multiple policies, each with its own 'desires' (as defined by its own unique reward structure), and then having a structured way of combining them.
As an example, we could have one policy that is incentivized to find hidden passages and one that is incentivized to find stairs and one that is incentivized to kill monsters and one that is incentivized to not take damage. In this case, by weighting how much we care about each of these desires, we create a new composite policy that leverages learned behaviors from each.
https://deepmind.com/blog/article/fast-reinforcement-learning-through-the-composition-of-behaviours
https://www.pnas.org/content/117/48/30079
https://arxiv.org/pdf/2106.13105.pdf

Why:

Might mitigate/solve the problem of sparse rewards in a lot of the Minihack/Nethack environments.
Might help with skill transfer across environments
Something something Qualia
If it works, provides a built-in explainability thing (what goals is it prioritizing at any given moment)
I think that if I can build a policy that's incentivized to find hidden things combined with one that goes to stairs, then maybe we can finally beat that stupid Corridor environment

Edited Nov 03, 2021 by ian_leong__ext_