Commit a7a5a4a7 authored by metataro's avatar metataro

removed depreciated documentation

parent 3a3570fb
Pipeline #4889 passed with stage
in 59 minutes and 49 seconds
# Global observation experiments with conv-net models
TLDR: Experiments with PPO on the stock global observations.
We compare two CNN policy arearchitectures: Nature-CNN and IMPALA-CNN.
## Method
In this experiment, we compare the performance of two established CNN architectures on the stock global observation (https://flatlandrl-docs.aicrowd.com/04_specifications.html#id11).
In the first case, agents are based on the Nature-CNN architecture [2] that consists of 3 convolutional layers followed by a dense layer.
In the second case, the agents are based on the IMPALA-CNN [1] network, which consists of a 15-layer residual architecture
neural network followed by a dense layer.
We employ PPO [3], a policy gradient algorithm.
All agents share the same centralized policy network.
## Implementation details
--
## Results
TODO
## Refrences
[1] Lasse Espeholt et al. “IMPALA: Scalable Distributed Deep-RL with Importance
Weighted Actor-Learner Architectures”. In: Proceedings of the 35th International
Conference on Machine Learning. Vol. 80. 2018, pp. 1407–1416. URL: [https://arxiv.org/abs/1802.01561](https://arxiv.org/abs/1802.01561)
[2] Volodymyr Mnih et al. “Human-level control through deep reinforcement learn-
ing”. In: Nature 518.7540 (2015), pp. 529–533. issn: 1476-4687. doi: 10 . 1038 /
nature14236. URL: [https://www.nature.com/articles/nature14236](https://www.nature.com/articles/nature14236)
[3] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
# Stock tree observation experiments with fc-models
TLDR: Experiments with PPO and APEX-DQN on the stock tree-obs.
We use fc-networks to represent policy and value-function.
## Methods
We evaluate performance of the stock tree observation on small_v0 environments (https://flatlandrl-docs.aicrowd.com/04_specifications.html#id11).
We therefore compare PPO [1] (policy gradient) and APEX-DQN [2] (value based), two well established RL methods.
All agents share the same centralized policy/value function network, represented by a fully connected networks
(two hidden layers, 256 units each, ReLU activations).
Tested modifications (only on APEX-DQN):
- **Deeper network**: In order to test the impact of deeper networks we ran experiments with
three hidden layers (256 units each, ReLU activations).
- **Invalid action masking**: At each time-step, an agent has to choose between 5 actions (stop, left, forward, right, noop).
However, out of forward, left and right only ever 1-2 actions are valid.
Choosing an invalid action results in a noop action.
Additionally, the noop action is redundant, since it's effect can always be achieved through one of the other actions.
To address this, we evaluated the impact of masking all invalid actions, by setting their logits to large negative values.
- **Skip no-choice cells**: Agents on straights can only stop or move forward.
Stopping only makes sense right in front of a intersection (trains stop automatically if blocked by another train).
We therefore test skipping all agents on staight cells that are not adjacent to a intersections and perform forward actions for them.
- **Deadlock reward**: Preventing deadlocks, where two trains block each other, is a fundamental problem of Flatland.
In our experiments, we regularly see RL agents fail to avoid these situations and if they do, it seems to be only by chance.
To ease the deadlock problem, we tested a modified reward function that includes a deadlock reward to penalize agents when they run
in a deadlock.
## Implementation details
- We used the built-in rllib fully connected network
## Results
Coming soon
## Refrences
[1] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
[2] Horgan, D., Quan, J., Budden, D., Barth-Maron, G., Hessel, M., Van Hasselt, H., & Silver, D. (2018). Distributed prioritized experience replay. arXiv preprint arXiv:1803.00933.
\ No newline at end of file
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment