Commit 7f08312e authored by MasterScrat's avatar MasterScrat

Merge branch 'apex-baselines' into 'master'

Action masking and skippin baselines

See merge request !12
parents 00b5d4b5 f8d3ea39
Pipeline #4898 canceled with stage
# Frame skipping and action masking
```{admonition} TL;DR
We tested how skipping states, where agents do not need to choose an action, and masking invalid actions impact the performance of RL agents.
```
### 💡 The idea
In these experiments we looked at two modifications with to goal of easing the learning for RL agents:
- **Skipping "no-choice" cells**: Between intersections, agents can only move forward or stop. Since trains stop automatically when blocked by another train until the path is free, manual stopping only makes sense right before intersection, where agents might need to let other trains pass first in order to prevent deadlocks. To capitalize on this, we skip all cells, where the agent is not on or next to an intersection cell. This should help agents in the sense that they no longer need to learn to keep going forward between intersections. Also skipping steps shortens the preserved episode length.
- **Action masking**: The Flatland action space consists of 5 discrete actions (noop, forward, stop, left, right). However, at any given timestep, only a subset of all actions is available. For example, left and right are mutually exclusive and are both invalid on straights. Furthermore, the noop-action is not required and can be removed completely. We test masking out all invalid and noop actions so that the agents can focus on relevant actions only.
We implemented and evaluated skipping "no-choice" cells for both DQN APEX [1] and PPO agents [2]. Currently, action masking is only implemented for PPO agents. All experiments are performed on `small_v0` environments with the stock tree observations.
### 🗂️ Files and usage
All configuration files to run these experiments can be found in `baselines/action_masking_and_skipping`:
```shell script
# APEX:
python ./train.py -f baselines/action_masking_and_skipping/apex_tree_obs_small_v0.yaml
# APEX + SkipNoChoice
python ./train.py -f baselines/action_masking_and_skipping/apex_tree_obs_small_v0_skip.yaml
# PPO
python ./train.py -f baselines/action_masking_and_skipping/ppo_tree_obs_small_v0.yaml
# PPO + SkipNoChoice
python ./train.py -f baselines/action_masking_and_skipping/ppo_tree_obs_small_v0_skip.yaml
# PPO + ActionMasking
python ./train.py -f baselines/action_masking_and_skipping/ppo_tree_obs_small_v0_mask.yaml
```
#### Skipping "no-choice" cells
Skipping "no-choice" cells is implemented in the gym_env wrapper [SkipNoChoiceCellsWrapper](https://gitlab.aicrowd.com/flatland/neurips2020-flatland-baselines/blob/master/envs/flatland/utils/gym_env_wrappers.py#L90) and can be controlled with the following environment configurations:
- `skip_no_choice_cells` (bool): Setting this to `True` will activate the skipping of "no-choice" cells.
- `accumulate_skipped_rewards` (bool): Setting this to `True` will accumulate all skipped reward.
- `discounting` (float): Discount factor when summing the rewards. This should be set to the same value as the environment's discount factor γ.
#### Action masking
Action masking requires a `available_actions` observation that encodes which actions are available. This is implemented in the gym_env wrapper [AvailableActionsWrapper](https://gitlab.aicrowd.com/flatland/neurips2020-flatland-baselines/blob/master/envs/flatland/utils/gym_env_wrappers.py#L35) and can be controlled with the following environment configurations:
- `available_actions_obs` (bool): Setting this to `True` will activate the `available_actions` observation.
- `allow_noop` (bool): If set to `False`, the `ǹoop` action will always be invalid, effectively removing it from the action space.
Action masking is implemented for `fully_connected_model`, `global_dens_obs_model` and `global_obs_model` and can be controlled with the following custom model configurations:
- `mask_unavailable_actions` (bool): Setting this to `True` will activate the action masking.
### 📦 Implementation Details
- Available actions are encoded into 0/1 vectors with number of actions elements. 1 indicates that action with corresponding index is available.
- Invalid actions are masked by setting their logits to large negative values, effectively setting their softmax-probabilities to zero.
- For PPO experiments we do **not** share weights between policy and value function.
### 📈 Results
![](training_performance_overview.png)
W&B report: [https://app.wandb.ai/masterscrat/flatland/reports/Skipping-%22no-choice%22-cells-and-action-masking-expriments--VmlldzoxNTY5NDM](https://app.wandb.ai/masterscrat/flatland/reports/Skipping-%22no-choice%22-cells-and-action-masking-expriments--VmlldzoxNTY5NDM)
- Vanilla APEX agents learn fastest w.r.t number of samples.
- APEX+SkipNoChoice agents require more frames to learn compared to vanilla APEX but converge to similarly strong policies.
- For PPO, both SkipNoChoice and ActionMasking improved the learning.
- APEX, APEX+SkipNoChoice and PPO+SkipNoChoice converge to similarly strong policies after 15M frames. PPO and PPO+ActionMasking end up with slightly worse performance. It is not clear whether agents would profit from more frames. Policy gradients like PPO are known to usually be less sample efficient than DQN based algorithms like APEX.
- For PPO agents, skipping leads to more diverse batches while APEX agents already have a large replay buffer to ensure diverse update samples. Therefore, the semantics of the skipping "no-choice" cells modification is a bit different for these algorithms.
### 🔗 References
[1] Schulman, J. et al. (2017). Proximal policy optimization algorithms. [arXiv:1707.06347](https://arxiv.org/abs/1707.06347).
[2] Horgan, D., et al. (2018). Distributed prioritized experience replay. [arXiv:1803.00933](https://arxiv.org/abs/1803.00933).
### 🌟 Credits
[Christian Scheller](https://scholar.google.com/citations?user=LD2JqfMAAAAJ)
[Florian Laurent](https://twitter.com/MasterScrat)
[Adrian Egli](https://twitter.com/ae_sbb)
apex-tree-obs-small-v0:
run: APEX
env: flatland_sparse
stop:
timesteps_total: 15000000 # 1.5e7
checkpoint_at_end: True
checkpoint_score_attr: episode_reward_mean
num_samples: 3
config:
num_workers: 15
num_envs_per_worker: 5
num_gpus: 0
env_config:
observation: tree
observation_config:
max_depth: 2
shortest_path_max_depth: 30
generator: sparse_rail_generator
generator_config: small_v0
wandb:
project: flatland
entity: masterscrat
tags: ["small_v0", "tree_obs", "apex"] # TODO should be set programmatically
model:
fcnet_activation: relu
fcnet_hiddens: [256, 256]
vf_share_layers: True # False
apex-tree-obs-small-v0-skip:
run: APEX
env: flatland_sparse
stop:
timesteps_total: 15000000 # 1.5e7
checkpoint_at_end: True
checkpoint_score_attr: episode_reward_mean
num_samples: 3
config:
num_workers: 15
num_envs_per_worker: 5
num_gpus: 0
gamma: 0.99
env_config:
observation: tree
observation_config:
max_depth: 2
shortest_path_max_depth: 30
generator: sparse_rail_generator
generator_config: small_v0
skip_no_choice_cells: True
accumulate_skipped_rewards: True
discounting: 0.99 # TODO set automatically, should be equal to gamma
wandb:
project: flatland
entity: masterscrat
tags: ["small_v0", "tree_obs", "apex", "skip"] # TODO should be set programmatically
model:
fcnet_activation: relu
fcnet_hiddens: [256, 256]
vf_share_layers: True # False
ppo-tree-obs-small-v0:
run: PPO
env: flatland_sparse
stop:
timesteps_total: 15000000 # 1.5e7
checkpoint_at_end: True
checkpoint_score_attr: episode_reward_mean
num_samples: 3
config:
num_workers: 15
num_envs_per_worker: 5
num_gpus: 0
clip_rewards: False
vf_clip_param: 500.0
entropy_coeff: 0.01
# effective batch_size: train_batch_size * num_agents_in_each_environment [5, 10]
# see https://github.com/ray-project/ray/issues/4628
train_batch_size: 1000 # 5000
rollout_fragment_length: 50 # 100
sgd_minibatch_size: 100 # 500
vf_share_layers: False
env_config:
observation: tree
observation_config:
max_depth: 2
shortest_path_max_depth: 30
generator: sparse_rail_generator
generator_config: small_v0
wandb:
project: flatland
entity: masterscrat
tags: ["small_v0", "tree_obs", "ppo"] # TODO should be set programmatically
model:
fcnet_activation: relu
fcnet_hiddens: [256, 256]
vf_share_layers: True # False
sparse-mask-ppo-tree-obs-small-v0:
run: PPO
env: flatland_sparse
stop:
timesteps_total: 15000000 # 1.5e7
checkpoint_at_end: True
checkpoint_score_attr: episode_reward_mean
num_samples: 3
config:
num_workers: 15
num_envs_per_worker: 5
num_gpus: 0
gamma: 0.99
clip_rewards: False
vf_clip_param: 500.0
entropy_coeff: 0.01
# effective batch_size: train_batch_size * num_agents_in_each_environment [5, 10]
# see https://github.com/ray-project/ray/issues/4628
train_batch_size: 1000 # 5000
rollout_fragment_length: 50 # 100
sgd_minibatch_size: 100 # 500
vf_share_layers: False
env_config:
observation: tree
observation_config:
max_depth: 2
shortest_path_max_depth: 30
generator: sparse_rail_generator
generator_config: small_v0
available_actions_obs: True
allow_noop: False
wandb:
project: flatland
entity: masterscrat
tags: ["small_v0", "tree_obs", "ppo", "mask"] # TODO should be set programmatically
model:
custom_model: fully_connected_model
custom_options:
layers: [256, 256]
activation: relu
layer_norm: False
mask_unavailable_actions: True
ppo-tree-obs-small-v0-skip:
run: PPO
env: flatland_sparse
stop:
timesteps_total: 15000000 # 1.5e7
checkpoint_at_end: True
checkpoint_score_attr: episode_reward_mean
num_samples: 3
config:
num_workers: 15
num_envs_per_worker: 5
num_gpus: 0
gamma: 0.99
clip_rewards: False
vf_clip_param: 500.0
entropy_coeff: 0.01
# effective batch_size: train_batch_size * num_agents_in_each_environment [5, 10]
# see https://github.com/ray-project/ray/issues/4628
train_batch_size: 1000 # 5000
rollout_fragment_length: 50 # 100
sgd_minibatch_size: 100 # 500
vf_share_layers: False
env_config:
observation: tree
observation_config:
max_depth: 2
shortest_path_max_depth: 30
generator: sparse_rail_generator
generator_config: small_v0
skip_no_choice_cells: True
accumulate_skipped_rewards: True
discounting: 0.99 # TODO set automatically, should be equal to gamma
wandb:
project: flatland
entity: masterscrat
tags: ["small_v0", "tree_obs", "ppo", "skip"] # TODO should be set programmatically
model:
fcnet_activation: relu
fcnet_hiddens: [256, 256]
vf_share_layers: True # False
......@@ -89,16 +89,19 @@ def find_all_cells_where_agent_can_choose(rail_env: RailEnv):
class SkipNoChoiceCellsWrapper(gym.Wrapper):
def __init__(self, env, accumulate_skipped_rewards) -> None:
def __init__(self, env, accumulate_skipped_rewards: bool, discounting: float) -> None:
super().__init__(env)
self._switches = None
self._switches_neighbors = None
self._decision_cells = None
self._accumulate_skipped_rewards = accumulate_skipped_rewards
self._skipped_rewards = defaultdict(float)
self._discounting = discounting
self._skipped_rewards = defaultdict(list)
def _on_decision_cell(self, agent: EnvAgent):
return agent.position is None or agent.position in self._decision_cells
return agent.position is None \
or agent.position == agent.initial_position \
or agent.position in self._decision_cells
def _on_switch(self, agent: EnvAgent):
return agent.position in self._switches
......@@ -117,10 +120,13 @@ class SkipNoChoiceCellsWrapper(gym.Wrapper):
d[agent_id] = done[agent_id]
i[agent_id] = info[agent_id]
if self._accumulate_skipped_rewards:
r[agent_id] += self._skipped_rewards[agent_id]
self._skipped_rewards[agent_id] = 0.
discounted_skipped_reward = r[agent_id]
for skipped_reward in reversed(self._skipped_rewards[agent_id]):
discounted_skipped_reward = self._discounting*discounted_skipped_reward + skipped_reward
r[agent_id] = discounted_skipped_reward
self._skipped_rewards[agent_id] = []
elif self._accumulate_skipped_rewards:
self._skipped_rewards[agent_id] += reward[agent_id]
self._skipped_rewards[agent_id].append(reward[agent_id])
d['__all__'] = done['__all__']
action_dict = {}
return StepOutput(o, r, d, i)
......
......@@ -56,9 +56,10 @@ class FlatlandSparse(FlatlandBase):
deadlock_reward = env_config.get('deadlock_reward', 0)
self._env = DeadlockResolutionWrapper(self._env, deadlock_reward)
if env_config.get('skip_no_choice_cells', False):
self._env = SkipNoChoiceCellsWrapper(self._env, env_config.get('accumulate_skipped_rewards', False))
self._env = SkipNoChoiceCellsWrapper(self._env, env_config.get('accumulate_skipped_rewards', False),
discounting=env_config.get('discounting', 1.))
if env_config.get('available_actions_obs', False):
self._env = AvailableActionsWrapper(self._env)
self._env = AvailableActionsWrapper(self._env, env_config.get('allow_noop', True))
@property
def observation_space(self) -> gym.spaces.Space:
......
......@@ -16,6 +16,7 @@ class FullyConnectedModel(TFModelV2):
self._action_space = action_space
self._options = model_config['custom_options']
self._mask_unavailable_actions = self._options.get("mask_unavailable_actions", False)
self._vf_share_layers = model_config["vf_share_layers"]
if self._mask_unavailable_actions:
observations = tf.keras.layers.Input(shape=obs_space.original_space['obs'].shape)
......@@ -26,6 +27,9 @@ class FullyConnectedModel(TFModelV2):
fc_out = FullyConnected(layers=self._options['layers'], activation=activation,
layer_norm=self._options['layer_norm'], activation_out=True)(observations)
logits = tf.keras.layers.Dense(units=action_space.n)(fc_out)
if not self._vf_share_layers:
fc_out = FullyConnected(layers=self._options['layers'], activation=activation,
layer_norm=self._options['layer_norm'], activation_out=True)(observations)
baseline = tf.keras.layers.Dense(units=1)(fc_out)
self._model = tf.keras.Model(inputs=[observations], outputs=[logits, baseline])
self.register_variables(self._model.variables)
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment