* This step involves training in a purely offline process via stored experiences
* Hybrid or Mixed Learning (Phase 2)
* This step involves training in both an offline process via stored experiences and actual env simulation experience
* If Phase 1 is skipped this would be a new experiment, else we would restore from a checkpoint point saved from Phase 1
```
### 💡 The idea
The Flatland environment presents challenges to learning algorithms with respect to regard the organization the agent's behavior when faced with the presence of other agents as well as with sparse and delayed rewards.
### 🗂️ Files and usage
All necessary functions can be found in "neurips2020-flatland-baselines/models/custom_loss_model.py"
An example config file can be found in "neurips2020-flatland-baselines/experiments/flatland_sparse/small_v0/tree_obs_fc_net/ImitationLearning/MARWIL.yaml"
* This step involves training in a purely offline process via stored experiences
* Hybrid or Mixed Learning (Phase 2)
* This step involves training in both an offline process via stored experiences and actual env simulation experience
* If Phase 1 is skipped this would be a new experiment, else we would restore from a checkpoint point saved from Phase 1
## Generate and save Experiences from expert actions
This can be done by running the `ImitationLearning/saving_experiences.py` file.
A saved version of the experiences can also be found in the `ImitationLearning/SaveExperiences` folder. You can copy them to the default input location with:
Involves mixed training in the ratio 25% (Expert) and 75% (Simulation). This is a deviation from the earlier [DQfD](https://arxiv.org/pdf/1704.03732.pdf) paper where there was a pure imitation step
A nice explanation and summary can be found [here](https://danieltakeshi.github.io/2019/05/11/dqfd-followups/) and [here](https://danieltakeshi.github.io/2019/04/30/il-and-rl/)
Config file: `dqn_DQfD.yaml`
(Currently the Ape-X version is not working with custom loss model. Use the config dqn_DQfD.yaml. DQN is similar to Ape-X but is slower)
* Priortised Experience Replay (PER) of samples individually done for expert and simulation. Currently [Distributed PER](https://arxiv.org/abs/1803.00933) is used, but have to check if it will be applied seperately for the expert and simulation data or on the mixed samples.
* Custom Loss with Imitation Loss applied only to the best expert trajectory. L2 loss can be ignored as of now as we do not want to regularise at this stage and regularisation in reinforcement learning is debatable
Distributed PER seems to be an important part of the good results achieved by Ape-X DQfD. This could be implemented by running simulation and outputing the experiences. Then the expert and simulation experiences could be priortised seperately in a seperate process and updated at reasonable intervals