Commit 33f9587b authored by metataro's avatar metataro

tree_obs and global_obs experiments documentation

parent d6abdd52
# Stock tree observation experiments with fc-models
TLDR: Experiments with PPO and APEX-DQN on the stock tree-obs.
We use fc-networks to represent policy and value-function.
## Methods
We evaluate performance of the stock tree observation on small_v0 environments (https://flatlandrl-docs.aicrowd.com/04_specifications.html#id11).
We therefore compare PPO [1] (policy gradient) and APEX-DQN [2] (value based), two well established RL methods.
All agents share the same centralized policy/value function network, represented by a fully connected networks
(two hidden layers, 256 units each, ReLU activations).
Tested modifications (only on APEX-DQN):
- **Deeper network**: In order to test the impact of deeper networks we ran experiments with
three hidden layers (256 units each, ReLU activations).
- **Invalid action masking**: At each time-step, an agent has to choose between 5 actions (stop, left, forward, right, noop).
However, out of forward, left and right only ever 1-2 actions are valid.
Choosing an invalid action results in a noop action.
Additionally, the noop action is redundant, since it's effect can always be achieved through one of the other actions.
To address this, we evaluated the impact of masking all invalid actions, by setting their logits to large negative values.
- **Skip no-choice cells**: Agents on straights can only stop or move forward.
Stopping only makes sense right in front of a intersection (trains stop automatically if blocked by another train).
We therefore test skipping all agents on staight cells that are not adjacent to a intersections and perform forward actions for them.
- **Deadlock reward**: Preventing deadlocks, where two trains block each other, is a fundamental problem of Flatland.
In our experiments, we regularly see RL agents fail to avoid these situations and if they do, it seems to be only by chance.
To ease the deadlock problem, we tested a modified reward function that includes a deadlock reward to penalize agents when they run
in a deadlock.
## Implementation details
- We used the built-in rllib fully connected network
## Results
Coming soon
## Refrences
[1] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
[2] Horgan, D., Quan, J., Budden, D., Barth-Maron, G., Hessel, M., Van Hasselt, H., & Silver, D. (2018). Distributed prioritized experience replay. arXiv preprint arXiv:1803.00933.
\ No newline at end of file
# Global observation experiments with conv-net models
# Global observation experiments
TLDR: Experiments with PPO on the stock global observations.
We compare two CNN policy arearchitectures: Nature-CNN and IMPALA-CNN.
## Method
In this experiment, we compare the performance of two established CNN architectures on the stock global observation (https://flatlandrl-docs.aicrowd.com/04_specifications.html#id11).
In these experiments, we compare the performance of two established CNN architectures on the stock global observation
(https://flatlandrl-docs.aicrowd.com/04_specifications.html#id11).
In the first case, agents are based on the Nature-CNN architecture [2] that consists of 3 convolutional layers followed by a dense layer.
In the second case, the agents are based on the IMPALA-CNN [1] network, which consists of a 15-layer residual architecture
neural network followed by a dense layer.
We employ PPO [3], a policy gradient algorithm.
All agents share the same centralized policy network.
All agents share the same centralized policy/value function network.
## Implementation details
--
- Categorical feature layers (e.g. direction) are one-hot encoded.
- Feature layers containing counts (e.g. #trains ready to depart, #malfunction steps) are log transformed
(for convenience we set log(0) = 0).
## Run the experiemnts
```
# impala_cnn
python train.py --config-file experiments/flatland_sparse/global_obs_impala_cnn/[ppo|apex].yaml
# nature_cnn
python train.py --config-file experiments/flatland_sparse/global_obs_nature_cnn/[ppo|apex].yaml
```
## Results
TODO
Coming soon
## Refrences
......@@ -30,6 +44,3 @@ Conference on Machine Learning. Vol. 80. 2018, pp. 1407–1416. URL: [https://ar
[2] Volodymyr Mnih et al. “Human-level control through deep reinforcement learn-
ing”. In: Nature 518.7540 (2015), pp. 529–533. issn: 1476-4687. doi: 10 . 1038 /
nature14236. URL: [https://www.nature.com/articles/nature14236](https://www.nature.com/articles/nature14236)
[3] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
flatland-random-sparse-small-tree-fc-ppo:
flatland-random-sparse-small-global-impala-cnn-ppo:
run: APEX
env: flatland_sparse
stop:
......@@ -27,7 +27,7 @@ flatland-random-sparse-small-tree-fc-ppo:
wandb:
project: flatland
entity: masterscrat
tags: ["small_v0", "global_obs", "apex"] # TODO should be set programmatically
tags: ["small_v0", "global_obs", "apex", "impala_cnn"] # TODO should be set programmatically
model:
custom_model: global_obs_model
......
flatland-sparse-global-conv-ppo:
flatland-random-sparse-small-global-impala-cnn-ppo:
run: PPO
env: flatland_sparse
stop:
......@@ -19,7 +19,7 @@ flatland-sparse-global-conv-ppo:
sgd_minibatch_size: 100 # 500
num_sgd_iter: 10
num_workers: 2
num_envs_per_worker: 2
num_envs_per_worker: 5
batch_mode: truncate_episodes
observation_filter: NoFilter
vf_share_layers: True
......@@ -38,7 +38,7 @@ flatland-sparse-global-conv-ppo:
wandb:
project: flatland
entity: masterscrat
tags: ["small_v0", "global_obs"] # TODO should be set programmatically
tags: ["small_v0", "global_obs", "ppo", "impala_cnn"] # TODO should be set programmatically
model:
custom_model: global_obs_model
......
flatland-random-sparse-small-global-nature-cnn-ppo:
run: APEX
env: flatland_sparse
stop:
timesteps_total: 100000000 # 1e8
checkpoint_freq: 10
checkpoint_at_end: True
keep_checkpoints_num: 5
checkpoint_score_attr: episode_reward_mean
config:
num_workers: 15
num_envs_per_worker: 5
num_gpus: 1
hiddens: []
dueling: False
env_config:
observation: global
observation_config:
max_width: 32
max_height: 32
generator: sparse_rail_generator
generator_config: small_v0
wandb:
project: flatland
entity: masterscrat
tags: ["small_v0", "global_obs", "apex", "nature_cnn"] # TODO should be set programmatically
model:
custom_model: global_obs_model
custom_options:
architecture: nature
flatland-random-sparse-small-global-nature-cnn-ppo:
run: PPO
env: flatland_sparse
stop:
timesteps_total: 10000000 # 1e7
checkpoint_freq: 10
checkpoint_at_end: True
keep_checkpoints_num: 5
checkpoint_score_attr: episode_reward_mean
config:
clip_rewards: True
clip_param: 0.1
vf_clip_param: 500.0
entropy_coeff: 0.01
# effective batch_size: train_batch_size * num_agents_in_each_environment [5, 10]
# see https://github.com/ray-project/ray/issues/4628
train_batch_size: 1000 # 5000
rollout_fragment_length: 50 # 100
sgd_minibatch_size: 100 # 500
num_sgd_iter: 10
num_workers: 2
num_envs_per_worker: 5
batch_mode: truncate_episodes
observation_filter: NoFilter
vf_share_layers: True
vf_loss_coeff: 0.5
num_gpus: 0
env_config:
observation: global
observation_config:
max_width: 32
max_height: 32
generator: sparse_rail_generator
generator_config: small_v0
wandb:
project: flatland
entity: masterscrat
tags: ["small_v0", "global_obs", "ppo", "nature_cnn"] # TODO should be set programmatically
model:
custom_model: global_obs_model
custom_options:
architecture: nature
# Stock tree observation experiments
TLDR: Experiments with PPO and APEX-DQN on the stock tree-obs and fc-networks to represent policy and value-function.
## Methods
We evaluate performance of the stock tree observation (https://flatlandrl-docs.aicrowd.com/04_specifications.html#id11)
on small_v0 environments. We therefore compare PPO [1] (policy gradient) and APEX-DQN [2] (value based), two well
established RL methods. All agents share the same centralized policy/value function network, represented by a fully
connected network (two hidden layers, 256 units each, ReLU activations).
### Invalid action masking
At each time-step, an agent has to choose between 5 actions (stop, left, forward, right, noop). However, out of
forward, left and right only ever 1-2 actions are valid. Choosing an invalid action results in a noop action.
Additionally, the noop action is redundant, since it's effect can always be achieved through one of the other actions.
To address this, we evaluated the impact of masking all invalid actions.
### Skip no-choice cells
Agents on straights can only stop or move forward.
Stopping only makes sense right in front of a intersection (trains stop automatically if blocked by another train).
We therefore test skipping all agents on staight cells that are not adjacent to a intersections and perform forward
actions for them.
### Deadlock reward
Preventing deadlocks, where two trains block each other, is a fundamental problem of Flatland.
In our experiments, we regularly see RL agents fail to avoid these situations. To ease the deadlock detection problem,
we tested a modified reward function that includes a deadlock reward to penalize agents when they run into a deadlock.
## Implementation details
- We used the built in fully-connected network model of rllib.
- Invalid actions are masked by setting their logits to large negative values.
## Run the experiemnts
```
# fc_net
python train.py --config-file experiments/flatland_sparse/tree_obs_fc_net/[ppo|apex].yaml
# fc_net + action_mask
python train.py --config-file experiments/flatland_sparse/tree_obs_fc_net_action_mask/[ppo|apex].yaml
# fc_net + skip_no_choice_cells
python train.py --config-file experiments/flatland_sparse/tree_obs_fc_net_skip_no_choice/[ppo|apex].yaml
# fc_net + deadlock_reward
python train.py --config-file experiments/flatland_sparse/tree_obs_fc_net_dlock_reward/[ppo|apex].yaml
```
## Results
Coming soon
## Refrences
[1] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
[2] Horgan, D., Quan, J., Budden, D., Barth-Maron, G., Hessel, M., Van Hasselt, H., & Silver, D. (2018). Distributed prioritized experience replay. arXiv preprint arXiv:1803.00933.
\ No newline at end of file
flatland-sparse-small-action-mask-tree-fc-apex:
run: APEX
env: flatland_sparse
stop:
timesteps_total: 100000000 # 1e8
checkpoint_freq: 10
checkpoint_at_end: True
# keep_checkpoints_num: 5
checkpoint_score_attr: episode_reward_mean
config:
num_workers: 15
num_envs_per_worker: 5
num_gpus: 1
hiddens: []
dueling: False
env_config:
available_actions_obs: True
observation: tree
observation_config:
max_depth: 2
shortest_path_max_depth: 30
generator: sparse_rail_generator
generator_config: small_v0
wandb:
project: flatland
entity: masterscrat
tags: ["small_v0", "new_tree_obs", "apex", "action_mask"] # TODO should be set programmatically
model:
fcnet_activation: relu
fcnet_hiddens: [256, 256]
vf_share_layers: True # False
flatland-sparse-small-action-mask-tree-fc-ppo:
run: PPO
env: flatland_sparse
stop:
timesteps_total: 10000000 # 1e7
checkpoint_freq: 10
checkpoint_at_end: True
checkpoint_score_attr: episode_reward_mean
config:
clip_rewards: False
# clip_param: 0.1
vf_clip_param: 500.0
entropy_coeff: 0.01
# effective batch_size: train_batch_size * num_agents_in_each_environment
# see https://github.com/ray-project/ray/issues/4628
train_batch_size: 1000 # 5000
rollout_fragment_length: 50 # 100
sgd_minibatch_size: 100 # 500
num_sgd_iter: 10
num_workers: 15
num_envs_per_worker: 5
batch_mode: truncate_episodes
observation_filter: NoFilter
vf_share_layers: True
vf_loss_coeff: 0.05
num_gpus: 0
env_config:
available_actions_obs: True
observation: tree
observation_config:
max_depth: 2
shortest_path_max_depth: 30
generator: sparse_rail_generator
generator_config: small_v0
wandb:
project: flatland
entity: masterscrat
tags: ["small_v0", "new_tree_obs", "ppo", "action_mask"] # TODO should be set programmatically
model:
fcnet_activation: relu
fcnet_hiddens: [256, 256]
vf_share_layers: True # False
flatland-sparse-small-sparse-reward-tree-fc-apex:
run: APEX
env: flatland_sparse
stop:
timesteps_total: 100000000 # 1e8
checkpoint_freq: 10
checkpoint_at_end: True
# keep_checkpoints_num: 5
checkpoint_score_attr: episode_reward_mean
config:
num_workers: 15
num_envs_per_worker: 5
num_gpus: 0
env_config:
deadlock_reward: -100
skip_no_choice_cells: False
available_actions_obs: False
observation: tree
observation_config:
max_depth: 2
shortest_path_max_depth: 30
generator: sparse_rail_generator
generator_config: small_v0
wandb:
project: flatland
entity: masterscrat
tags: ["small_v0", "tree_obs", "apex", "deadlock_reward"] # TODO should be set programmatically
model:
fcnet_activation: relu
fcnet_hiddens: [256, 256]
vf_share_layers: True # False
flatland-sparse-small-action-mask-tree-fc-ppo:
run: PPO
env: flatland_sparse
stop:
timesteps_total: 10000000 # 1e7
checkpoint_freq: 10
checkpoint_at_end: True
checkpoint_score_attr: episode_reward_mean
config:
clip_rewards: False
# clip_param: 0.1
# vf_clip_param: 500.0
vf_clip_param: 10.0
entropy_coeff: 0.01
# effective batch_size: train_batch_size * num_agents_in_each_environment
# see https://github.com/ray-project/ray/issues/4628
train_batch_size: 1000 # 5000
rollout_fragment_length: 50 # 100
sgd_minibatch_size: 100 # 500
num_sgd_iter: 10
num_workers: 15
num_envs_per_worker: 5
batch_mode: truncate_episodes
observation_filter: NoFilter
vf_share_layers: True
vf_loss_coeff: 0.05
num_gpus: 0
env_config:
deadlock_reward: -100
skip_no_choice_cells: False
available_actions_obs: False
observation: tree
observation_config:
max_depth: 2
shortest_path_max_depth: 30
generator: sparse_rail_generator
generator_config: small_v0
wandb:
project: flatland
entity: masterscrat
tags: ["small_v0", "tree_obs", "ppo", "deadlock_reward"] # TODO should be set programmatically
model:
fcnet_activation: relu
fcnet_hiddens: [256, 256]
vf_share_layers: True # False
flatland-sparse-small-action-mask-tree-fc-apex:
run: APEX
env: flatland_sparse
stop:
timesteps_total: 100000000 # 1e8
checkpoint_freq: 10
checkpoint_at_end: True
# keep_checkpoints_num: 5
checkpoint_score_attr: episode_reward_mean
config:
num_workers: 15
num_envs_per_worker: 5
num_gpus: 1
hiddens: []
dueling: False
env_config:
skip_no_choice_cells: True
observation: tree
observation_config:
max_depth: 2
shortest_path_max_depth: 30
generator: sparse_rail_generator
generator_config: small_v0
wandb:
project: flatland
entity: masterscrat
tags: ["small_v0", "new_tree_obs", "apex", "skip_no_choice_cells"] # TODO should be set programmatically
model:
fcnet_activation: relu
fcnet_hiddens: [256, 256]
vf_share_layers: True # False
flatland-sparse-small-action-mask-tree-fc-ppo:
run: PPO
env: flatland_sparse
stop:
timesteps_total: 10000000 # 1e7
checkpoint_freq: 10
checkpoint_at_end: True
checkpoint_score_attr: episode_reward_mean
config:
clip_rewards: False
# clip_param: 0.1
vf_clip_param: 500.0
entropy_coeff: 0.01
# effective batch_size: train_batch_size * num_agents_in_each_environment
# see https://github.com/ray-project/ray/issues/4628
train_batch_size: 1000 # 5000
rollout_fragment_length: 50 # 100
sgd_minibatch_size: 100 # 500
num_sgd_iter: 10
num_workers: 15
num_envs_per_worker: 5
batch_mode: truncate_episodes
observation_filter: NoFilter
vf_share_layers: True
vf_loss_coeff: 0.05
num_gpus: 0
env_config:
skip_no_choice_cells: True
observation: tree
observation_config:
max_depth: 2
shortest_path_max_depth: 30
generator: sparse_rail_generator
generator_config: small_v0
wandb:
project: flatland
entity: masterscrat
tags: ["small_v0", "new_tree_obs", "ppo", "skip_no_choice_cells"] # TODO should be set programmatically
model:
fcnet_activation: relu
fcnet_hiddens: [256, 256]
vf_share_layers: True # False
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment