Compare revisions

MasterScrat · Egli Adrian (IT-SCI-API-PFI) · Egli Adrian (IT-SCI-API-PFI) · Egli Adrian (IT-SCI-API-PFI) · Egli Adrian (IT-SCI-API-PFI) · Egli Adrian (IT-SCI-API-PFI)
--- a/README.md
+++ b/README.md
-🚂 Starter Kit - NeurIPS 2020 Flatland Challenge
-===
+🚂 This code is based on the official starter kit - NeurIPS 2020 Flatland Challenge
+---

-This starter kit contains 2 example policies to get started with this challenge: 
- a simple single-agent DQN method
- a more robust multi-agent DQN method that you can submit out of the box to the challenge 🚀
+You can use for your own experiments full or reduced action space. 
+
+```python
+def map_action(action):
+    # if full action space is used -> no mapping required
+    if get_action_size() == get_flatland_full_action_size():
+        return action
+    
+    # if reduced action space is used -> the action has to be mapped to real flatland actions
+    # The reduced action space removes the DO_NOTHING action from Flatland.
+    if action == 0:
+        return RailEnvActions.MOVE_LEFT
+    if action == 1:
+        return RailEnvActions.MOVE_FORWARD
+    if action == 2:
+        return RailEnvActions.MOVE_RIGHT
+    if action == 3:
+        return RailEnvActions.STOP_MOVING
+```

-**🔗 [Train the single-agent DQN policy](https://flatland.aicrowd.com/getting-started/rl/single-agent.html)**
+```python
+set_action_size_full()
+```
+or 
+```python
+set_action_size_reduced()
+```
+action space. The reduced action space just removes DO_NOTHING. 

-**🔗 [Train the multi-agent DQN policy](https://flatland.aicrowd.com/getting-started/rl/multi-agent.html)**
+---
+The used policy is based on the FastTreeObs in the official starter kit - NeurIPS 2020 Flatland Challenge. But the
+ FastTreeObs in this repo is an extended version. 
+[fast_tree_obs.py](./utils/fast_tree_obs.py)

-**🔗 [Submit a trained policy](https://flatland.aicrowd.com/getting-started/first-submission.html)**
+---
+Have a look into the [run.py](./run.py) file. There you can select using PPO or DDDQN as RL agents. 
+ 
+```python
+####################################################
+# EVALUATION PARAMETERS
+set_action_size_full()
+
+# Print per-step logs
+VERBOSE = True
+USE_FAST_TREEOBS = True
+
+if False:
+    # -------------------------------------------------------------------------------------------------------
+    # RL solution
+    # -------------------------------------------------------------------------------------------------------
+    # 116591 adrian_egli
+    # graded	71.305	0.633	RL	Successfully Graded ! More details about this submission can be found at:
+    # http://gitlab.aicrowd.com/adrian_egli/neurips2020-flatland-starter-kit/issues/51
+    # Fri, 22 Jan 2021 23:37:56
+    set_action_size_reduced()
+    load_policy = "DDDQN"
+    checkpoint = "./checkpoints/210122120236-3000.pth"  # 17.011131341978228
+    EPSILON = 0.0
+
+if False:
+    # -------------------------------------------------------------------------------------------------------
+    # RL solution
+    # -------------------------------------------------------------------------------------------------------
+    # 116658 adrian_egli
+    # graded	73.821	0.655	RL	Successfully Graded ! More details about this submission can be found at:
+    # http://gitlab.aicrowd.com/adrian_egli/neurips2020-flatland-starter-kit/issues/52
+    # Sat, 23 Jan 2021 07:41:35
+    set_action_size_reduced()
+    load_policy = "PPO"
+    checkpoint = "./checkpoints/210122235754-5000.pth"  # 16.00113400887389
+    EPSILON = 0.0
+
+if True:
+    # -------------------------------------------------------------------------------------------------------
+    # RL solution
+    # -------------------------------------------------------------------------------------------------------
+    # 116659 adrian_egli
+    # graded	80.579	0.715	RL	Successfully Graded ! More details about this submission can be found at:
+    # http://gitlab.aicrowd.com/adrian_egli/neurips2020-flatland-starter-kit/issues/53
+    # Sat, 23 Jan 2021 07:45:49
+    set_action_size_reduced()
+    load_policy = "DDDQN"
+    checkpoint = "./checkpoints/210122165109-5000.pth"  # 17.993750197899438
+    EPSILON = 0.0
+
+if False:
+    # -------------------------------------------------------------------------------------------------------
+    # !! This is not a RL solution !!!!
+    # -------------------------------------------------------------------------------------------------------
+    # 116727 adrian_egli
+    # graded	106.786	0.768	RL	Successfully Graded ! More details about this submission can be found at:
+    # http://gitlab.aicrowd.com/adrian_egli/neurips2020-flatland-starter-kit/issues/54
+    # Sat, 23 Jan 2021 14:31:50
+    set_action_size_reduced()
+    load_policy = "DeadLockAvoidance"
+    checkpoint = None
+    EPSILON = 0.0
+```

-The single-agent example is meant as a minimal example of how to use DQN. The multi-agent is a better starting point to create your own solution.
+---
+A deadlock avoidance agent is implemented. The agent only lets the train take the shortest route. And it tries to avoid as many deadlocks as possible.
+* [dead_lock_avoidance_agent.py](./utils/dead_lock_avoidance_agent.py)

-You can fully train the multi-agent policy in Colab for free! [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1GbPwZNQU7KJIJtilcGBTtpOAD3EabAzJ?usp=sharing)

-Sample training usage
 ---
+The policy interface has changed, please have a look into 
+* [policy.py](./reinforcement_learning/policy.py)

-Train the multi-agent policy for 150 episodes:
-
-```bash
-python reinforcement_learning/multi_agent_training.py -n 150
+---
+See the tensorboard training output to get some insights:
+```
+tensorboard --logdir ./runs_bench 
 ```

-The multi-agent policy training can be tuned using command-line arguments:
+---
+```
+python reinforcement_learning/multi_agent_training.py --use_fast_tree_observation  --checkpoint_interval 1000 -n 5000
+ --policy DDDQN -t 2 --action_size reduced --buffer_siz 128000
+```

-```console 
-usage: multi_agent_training.py [-h] [-n N_EPISODES] [-t TRAINING_ENV_CONFIG]
+[multi_agent_training.py](./reinforcement_learning/multi_agent_training.py)
+has new or changed parameters. Most important new or changed parameters for training. 
+ * policy :  [DDDQN, PPO, DeadLockAvoidance, DeadLockAvoidanceWithDecision, MultiDecision] : Default value
+   DeadLockAvoidance 
+ * use_fast_tree_observation : [false,true] : Default value = true  
+ * action_size: [full, reduced] : Default value = full
+``` 
+usage: multi_agent_training.py [-h] [-n N_EPISODES] [--n_agent_fixed]
+                               [-t TRAINING_ENV_CONFIG]
                               [-e EVALUATION_ENV_CONFIG]
                               [--n_evaluation_episodes N_EVALUATION_EPISODES]
                               [--checkpoint_interval CHECKPOINT_INTERVAL]
@@ -42,12 +144,16 @@ usage: multi_agent_training.py [-h] [-n N_EPISODES] [-t TRAINING_ENV_CONFIG]
                               [--hidden_size HIDDEN_SIZE]
                               [--update_every UPDATE_EVERY]
                               [--use_gpu USE_GPU] [--num_threads NUM_THREADS]
-                               [--render RENDER]
+                               [--render] [--load_policy LOAD_POLICY]
+                               [--use_fast_tree_observation]
+                               [--max_depth MAX_DEPTH] [--policy POLICY]
+                               [--action_size ACTION_SIZE]

 optional arguments:
  -h, --help            show this help message and exit
  -n N_EPISODES, --n_episodes N_EPISODES
                        number of episodes to run
+  --n_agent_fixed       hold the number of agent fixed
  -t TRAINING_ENV_CONFIG, --training_env_config TRAINING_ENV_CONFIG
                        training config id (eg 0 for Test_0)
  -e EVALUATION_ENV_CONFIG, --evaluation_env_config EVALUATION_ENV_CONFIG
@@ -82,20 +188,40 @@ optional arguments:
  --use_gpu USE_GPU     use GPU if available
  --num_threads NUM_THREADS
                        number of threads PyTorch can use
-  --render RENDER       render 1 episode in 100
-```
+  --render              render 1 episode in 100
+  --load_policy LOAD_POLICY
+                        policy filename (reference) to load
+  --use_fast_tree_observation
+                        use FastTreeObs instead of stock TreeObs
+  --max_depth MAX_DEPTH
+                        max depth
+  --policy POLICY       policy name [DDDQN, PPO, DeadLockAvoidance,
+                        DeadLockAvoidanceWithDecision, MultiDecision]
+  --action_size ACTION_SIZE
+                        define the action size [reduced,full]
+```                        

-[**📈 Performance training in environments of various sizes**](https://wandb.ai/masterscrat/flatland-examples-reinforcement_learning/reports/Flatland-Starter-Kit-Training-in-environments-of-various-sizes--VmlldzoxNjgxMTk)

-[**📈 Performance with various hyper-parameters**](https://app.wandb.ai/masterscrat/flatland-examples-reinforcement_learning/reports/Flatland-Examples--VmlldzoxNDI2MTA)
+---
+If you have any questions write me on the official discord channel **aiAdrian**    
+(Adrian Egli - adrian.egli@gmail.com) 
+
+
+Credits
+---

-[![](https://i.imgur.com/Lqrq5GE.png)](https://app.wandb.ai/masterscrat/flatland-examples-reinforcement_learning/reports/Flatland-Examples--VmlldzoxNDI2MTA) 
+* Florian Laurent <florian@aicrowd.com>
+* Erik Nygren <erik.nygren@sbb.ch>
+* Adrian Egli <adrian.egli@sbb.ch>
+* Sharada Mohanty <mohanty@aicrowd.com>
+* Christian Baumberger <christian.baumberger@sbb.ch>
+* Guillaume Mollard <guillaume.mollard2@gmail.com>

 Main links
 ---

 * [Flatland documentation](https://flatland.aicrowd.com/)
-* [NeurIPS 2020 Challenge](https://www.aicrowd.com/challenges/neurips-2020-flatland-challenge/)
+* [Flatland Challenge](https://www.aicrowd.com/challenges/flatland)

 Communication
 ---

--- a/checkpoints/201124171810-7800.pth.local
+++ b/checkpoints/201124171810-7800.pth.local
--- a/checkpoints/201124171810-7800.pth.target
+++ b/checkpoints/201124171810-7800.pth.target
--- a/checkpoints/201211164554-9400.pth.local
+++ b/checkpoints/201211164554-9400.pth.local
--- a/checkpoints/201211164554-9400.pth.target
+++ b/checkpoints/201211164554-9400.pth.target
--- a/checkpoints/201213181400-6800.pth.actor
+++ b/checkpoints/201213181400-6800.pth.actor
--- a/checkpoints/201213181400-6800.pth.optimizer
+++ b/checkpoints/201213181400-6800.pth.optimizer
--- a/checkpoints/201213181400-6800.pth.value
+++ b/checkpoints/201213181400-6800.pth.value
--- a/checkpoints/210122120236-3000.pth.local
+++ b/checkpoints/210122120236-3000.pth.local
--- a/checkpoints/210122120236-3000.pth.target
+++ b/checkpoints/210122120236-3000.pth.target
--- a/checkpoints/210122165109-5000.pth.local
+++ b/checkpoints/210122165109-5000.pth.local
--- a/checkpoints/210122165109-5000.pth.target
+++ b/checkpoints/210122165109-5000.pth.target
--- a/checkpoints/210122235754-5000.pth.actor
+++ b/checkpoints/210122235754-5000.pth.actor
--- a/checkpoints/210122235754-5000.pth.optimizer
+++ b/checkpoints/210122235754-5000.pth.optimizer
--- a/checkpoints/210122235754-5000.pth.value
+++ b/checkpoints/210122235754-5000.pth.value
--- a/reinforcement_learning/multi_agent_training.py
+++ b/reinforcement_learning/multi_agent_training.py
@@ -22,7 +22,8 @@ from reinforcement_learning.dddqn_policy import DDDQNPolicy
 from reinforcement_learning.deadlockavoidance_with_decision_agent import DeadLockAvoidanceWithDecisionAgent
 from reinforcement_learning.multi_decision_agent import MultiDecisionAgent
 from reinforcement_learning.ppo_agent import PPOPolicy
-from utils.agent_action_config import get_flatland_full_action_size, get_action_size, map_actions, map_action
+from utils.agent_action_config import get_flatland_full_action_size, get_action_size, map_actions, map_action, \
+    set_action_size_reduced, set_action_size_full, map_action_policy
 from utils.dead_lock_avoidance_agent import DeadLockAvoidanceAgent

 base_dir = Path(__file__).resolve().parent.parent
@@ -169,20 +170,26 @@ def train_agent(train_params, train_env_params, eval_env_params, obs_params):
    scores_window = deque(maxlen=checkpoint_interval)  # todo smooth when rendering instead
    completion_window = deque(maxlen=checkpoint_interval)

+    if train_params.action_size == "reduced":
+        set_action_size_reduced()
+    else:
+        set_action_size_full()
+
    # Double Dueling DQN policy
-    policy = None
-    if False:
+    if train_params.policy == "DDDQN":
        policy = DDDQNPolicy(state_size, get_action_size(), train_params)
-    if True:
+    elif train_params.policy == "PPO":
        policy = PPOPolicy(state_size, get_action_size(), use_replay_buffer=False, in_parameters=train_params)
-    if False:
-        policy = DeadLockAvoidanceAgent(train_env, get_action_size())
-    if False:
+    elif train_params.policy == "DeadLockAvoidance":
+        policy = DeadLockAvoidanceAgent(train_env, get_action_size(), enable_eps=False)
+    elif train_params.policy == "DeadLockAvoidanceWithDecision":
        # inter_policy = PPOPolicy(state_size, get_action_size(), use_replay_buffer=False, in_parameters=train_params)
        inter_policy = DDDQNPolicy(state_size, get_action_size(), train_params)
        policy = DeadLockAvoidanceWithDecisionAgent(train_env, state_size, get_action_size(), inter_policy)
-    if False:
+    elif train_params.policy == "MultiDecision":
        policy = MultiDecisionAgent(state_size, get_action_size(), train_params)
+    else:
+        policy = PPOPolicy(state_size, get_action_size(), use_replay_buffer=False, in_parameters=train_params)

    # make sure that at least one policy is set
    if policy is None:
@@ -211,7 +218,7 @@ def train_agent(train_params, train_env_params, eval_env_params, obs_params):
                hdd.free / (2 ** 30)))

    # TensorBoard writer
-    writer = SummaryWriter()
+    writer = SummaryWriter(comment="_" + train_params.policy + "_" + train_params.action_size)

    training_timer = Timer()
    training_timer.start()
@@ -235,8 +242,12 @@ def train_agent(train_params, train_env_params, eval_env_params, obs_params):

        # Reset environment
        reset_timer.start()
-        number_of_agents = int(min(n_agents, 1 + np.floor(episode_idx / 200)))
-        train_env_params.n_agents = episode_idx % number_of_agents + 1
+        if train_params.n_agent_fixed:
+            number_of_agents = n_agents
+            train_env_params.n_agents = n_agents
+        else:
+            number_of_agents = int(min(n_agents, 1 + np.floor(episode_idx / 200)))
+            train_env_params.n_agents = episode_idx % number_of_agents + 1

        train_env = create_rail_env(train_env_params, tree_observation)
        obs, info = train_env.reset(regenerate_rail=True, regenerate_schedule=True)
@@ -308,7 +319,7 @@ def train_agent(train_params, train_env_params, eval_env_params, obs_params):
                    learn_timer.start()
                    policy.step(agent_handle,
                                agent_prev_obs[agent_handle],
-                                agent_prev_action[agent_handle] - 1,
+                                map_action_policy(agent_prev_action[agent_handle]),
                                all_rewards[agent_handle],
                                agent_obs[agent_handle],
                                done[agent_handle])
@@ -431,6 +442,7 @@ def train_agent(train_params, train_env_params, eval_env_params, obs_params):
        writer.add_scalar("timer/learn", learn_timer.get(), episode_idx)
        writer.add_scalar("timer/preproc", preproc_timer.get(), episode_idx)
        writer.add_scalar("timer/total", training_timer.get_current(), episode_idx)
+        writer.flush()


 def format_action_prob(action_probs):
@@ -503,7 +515,8 @@ def eval_policy(env, tree_observation, policy, train_params, obs_params):

 if __name__ == "__main__":
    parser = ArgumentParser()
-    parser.add_argument("-n", "--n_episodes", help="number of episodes to run", default=12000, type=int)
+    parser.add_argument("-n", "--n_episodes", help="number of episodes to run", default=5000, type=int)
+    parser.add_argument("--n_agent_fixed", help="hold the number of agent fixed", action='store_true')
    parser.add_argument("-t", "--training_env_config", help="training config id (eg 0 for Test_0)", default=1,
                        type=int)
    parser.add_argument("-e", "--evaluation_env_config", help="evaluation config id (eg 0 for Test_0)", default=1,
@@ -531,6 +544,10 @@ if __name__ == "__main__":
    parser.add_argument("--use_fast_tree_observation", help="use FastTreeObs instead of stock TreeObs",
                        action='store_true')
    parser.add_argument("--max_depth", help="max depth", default=2, type=int)
+    parser.add_argument("--policy",
+                        help="policy name [DDDQN, PPO, DeadLockAvoidance, DeadLockAvoidanceWithDecision, MultiDecision]",
+                        default="DeadLockAvoidance")
+    parser.add_argument("--action_size", help="define the action size [reduced,full]", default="full", type=str)

    training_params = parser.parse_args()
    env_params = [

--- a/reinforcement_learning/ppo_agent.py
+++ b/reinforcement_learning/ppo_agent.py
@@ -34,7 +34,7 @@ class EpisodeBuffers:

 class ActorCriticModel(nn.Module):

-    def __init__(self, state_size, action_size, device, hidsize1=128, hidsize2=128):
+    def __init__(self, state_size, action_size, device, hidsize1=512, hidsize2=256):
        super(ActorCriticModel, self).__init__()
        self.device = device
        self.actor = nn.Sequential(
@@ -85,7 +85,7 @@ class ActorCriticModel(nn.Module):
        return obj

    def load(self, filename):
-        print("load policy from file", filename)
+        print("load model from file", filename)
        self.actor = self._load(self.actor, filename + ".actor")
        self.critic = self._load(self.critic, filename + ".value")

@@ -284,6 +284,8 @@ class PPOPolicy(LearningPolicy):
                obj.load_state_dict(torch.load(filename, map_location=self.device))
            except:
                print(" >> failed!")
+        else:
+            print(" >> file not found!")
        return obj

    def load(self, filename):

--- a/reinforcement_learning/rl_agent_test.py
+++ b/reinforcement_learning/rl_agent_test.py
@@ -3,6 +3,7 @@ from collections import namedtuple

 import gym
 import numpy as np
+from torch.utils.tensorboard import SummaryWriter

 from reinforcement_learning.dddqn_policy import DDDQNPolicy
 from reinforcement_learning.ppo_agent import PPOPolicy
@@ -36,6 +37,9 @@ def cartpole(use_dddqn=False):
    episode = 0
    checkpoint_interval = 20
    scores_window = deque(maxlen=100)
+
+    writer = SummaryWriter()
+
    while True:
        episode += 1
        state = env.reset()
@@ -79,6 +83,10 @@ def cartpole(use_dddqn=False):
                                                                                                                  policy.memory)),
                  end=" ")

+        writer.add_scalar("CartPole/value", tot_reward, episode)
+        writer.add_scalar("CartPole/smoothed_value", np.mean(scores_window), episode)
+        writer.flush()
+

 if __name__ == "__main__":
    cartpole()
--- a/run.py
+++ b/run.py
 '''
+I did experiments in an early submission. Please note that the epsilon can have an
+effects on the evaluation outcome :
 DDDQNPolicy experiments - EPSILON impact analysis
 ----------------------------------------------------------------------------------------
 checkpoint = "./checkpoints/201124171810-7800.pth"  # Training on AGENTS=10 with Depth=2
@@ -25,14 +27,17 @@ from pathlib import Path

 import numpy as np
 from flatland.core.env_observation_builder import DummyObservationBuilder
+from flatland.envs.agent_utils import RailAgentStatus
 from flatland.envs.observations import TreeObsForRailEnv
 from flatland.envs.predictions import ShortestPathPredictorForRailEnv
 from flatland.evaluators.client import FlatlandRemoteClient
 from flatland.evaluators.client import TimeoutException

-from reinforcement_learning.ppo_agent import PPOPolicy
+from reinforcement_learning.dddqn_policy import DDDQNPolicy
 from reinforcement_learning.deadlockavoidance_with_decision_agent import DeadLockAvoidanceWithDecisionAgent
-from utils.agent_action_config import get_action_size, map_actions
+from reinforcement_learning.multi_decision_agent import MultiDecisionAgent
+from reinforcement_learning.ppo_agent import PPOPolicy
+from utils.agent_action_config import get_action_size, map_actions, set_action_size_reduced, set_action_size_full
 from utils.dead_lock_avoidance_agent import DeadLockAvoidanceAgent
 from utils.deadlock_check import check_if_all_blocked
 from utils.fast_tree_obs import FastTreeObs
@@ -41,29 +46,68 @@ from utils.observation_utils import normalize_observation
 base_dir = Path(__file__).resolve().parent.parent
 sys.path.append(str(base_dir))

-from reinforcement_learning.dddqn_policy import DDDQNPolicy
-
 ####################################################
 # EVALUATION PARAMETERS
+set_action_size_full()

 # Print per-step logs
 VERBOSE = True
 USE_FAST_TREEOBS = True
-USE_PPO_AGENT = True
-
-# Checkpoint to use (remember to push it!)
-checkpoint = "./checkpoints/201219090514-8600.pth"  #
-# checkpoint = "./checkpoints/201215212134-12000.pth"  #
-checkpoint = "./checkpoints/201220171629-12000.pth"  # DDDQN - EPSILON: 0.0 - 13.940848323912533
-checkpoint = "./checkpoints/201220203236-12000.pth"  # PPO - EPSILON: 0.0 - 13.660942453931114
-checkpoint = "./checkpoints/201220214325-12000.pth"  # PPO - EPSILON: 0.0 - 13.463600936043

-EPSILON = 0.0
+if False:
+    # -------------------------------------------------------------------------------------------------------
+    # RL solution
+    # -------------------------------------------------------------------------------------------------------
+    # 116591 adrian_egli
+    # graded	71.305	0.633	RL	Successfully Graded ! More details about this submission can be found at:
+    # http://gitlab.aicrowd.com/adrian_egli/neurips2020-flatland-starter-kit/issues/51
+    # Fri, 22 Jan 2021 23:37:56
+    set_action_size_reduced()
+    load_policy = "DDDQN"
+    checkpoint = "./checkpoints/210122120236-3000.pth"  # 17.011131341978228
+    EPSILON = 0.0
+
+if False:
+    # -------------------------------------------------------------------------------------------------------
+    # RL solution
+    # -------------------------------------------------------------------------------------------------------
+    # 116658 adrian_egli
+    # graded	73.821	0.655	RL	Successfully Graded ! More details about this submission can be found at:
+    # http://gitlab.aicrowd.com/adrian_egli/neurips2020-flatland-starter-kit/issues/52
+    # Sat, 23 Jan 2021 07:41:35
+    set_action_size_reduced()
+    load_policy = "PPO"
+    checkpoint = "./checkpoints/210122235754-5000.pth"  # 16.00113400887389
+    EPSILON = 0.0
+
+if True:
+    # -------------------------------------------------------------------------------------------------------
+    # RL solution
+    # -------------------------------------------------------------------------------------------------------
+    # 116659 adrian_egli
+    # graded	80.579	0.715	RL	Successfully Graded ! More details about this submission can be found at:
+    # http://gitlab.aicrowd.com/adrian_egli/neurips2020-flatland-starter-kit/issues/53
+    # Sat, 23 Jan 2021 07:45:49
+    set_action_size_reduced()
+    load_policy = "DDDQN"
+    checkpoint = "./checkpoints/210122165109-5000.pth"  # 17.993750197899438
+    EPSILON = 0.0
+
+if False:
+    # -------------------------------------------------------------------------------------------------------
+    # !! This is not a RL solution !!!!
+    # -------------------------------------------------------------------------------------------------------
+    # 116727 adrian_egli
+    # graded	106.786	0.768	RL	Successfully Graded ! More details about this submission can be found at:
+    # http://gitlab.aicrowd.com/adrian_egli/neurips2020-flatland-starter-kit/issues/54
+    # Sat, 23 Jan 2021 14:31:50
+    set_action_size_reduced()
+    load_policy = "DeadLockAvoidance"
+    checkpoint = None
+    EPSILON = 0.0

 # Use last action cache
 USE_ACTION_CACHE = False
-USE_DEAD_LOCK_AVOIDANCE_AGENT = False  # 21.54485505223213
-USE_MULTI_DECISION_AGENT = False

 # Observation parameters (must match training parameters!)
 observation_tree_depth = 2
@@ -102,15 +146,6 @@ else:
    n_nodes = sum([np.power(4, i) for i in range(observation_tree_depth + 1)])
    state_size = n_features_per_node * n_nodes

-action_size = get_action_size()
-
-# Creates the policy. No GPU on evaluation server.
-if not USE_PPO_AGENT:
-    trained_policy = DDDQNPolicy(state_size, action_size, Namespace(**{'use_gpu': False}), evaluation_mode=True)
-else:
-    trained_policy = PPOPolicy(state_size, action_size)
-trained_policy.load(checkpoint)
-
 #####################################################################
 # Main evaluation loop
 #####################################################################
@@ -145,9 +180,25 @@ while True:
    tree_observation.set_env(local_env)
    tree_observation.reset()

-    policy = trained_policy
-    if USE_MULTI_DECISION_AGENT:
-        policy = DeadLockAvoidanceWithDecisionAgent(local_env, state_size, action_size, trained_policy)
+    # Creates the policy. No GPU on evaluation server.
+    if load_policy == "DDDQN":
+        policy = DDDQNPolicy(state_size, get_action_size(), Namespace(**{'use_gpu': False}), evaluation_mode=True)
+    elif load_policy == "PPO":
+        policy = PPOPolicy(state_size, get_action_size())
+    elif load_policy == "DeadLockAvoidance":
+        policy = DeadLockAvoidanceAgent(local_env, get_action_size(), enable_eps=False)
+    elif load_policy == "DeadLockAvoidanceWithDecision":
+        # inter_policy = PPOPolicy(state_size, get_action_size(), use_replay_buffer=False, in_parameters=train_params)
+        inter_policy = DDDQNPolicy(state_size, get_action_size(), Namespace(**{'use_gpu': False}), evaluation_mode=True)
+        policy = DeadLockAvoidanceWithDecisionAgent(local_env, state_size, get_action_size(), inter_policy)
+    elif load_policy == "MultiDecision":
+        policy = MultiDecisionAgent(state_size, get_action_size(), Namespace(**{'use_gpu': False}))
+    else:
+        policy = PPOPolicy(state_size, get_action_size(), use_replay_buffer=False,
+                           in_parameters=Namespace(**{'use_gpu': False}))
+
+    policy.load(checkpoint)
+
    policy.reset(local_env)
    observation = tree_observation.get_many(list(range(nb_agents)))

@@ -168,9 +219,6 @@ while True:
    agent_last_action = {}
    nb_hit = 0

-    if USE_DEAD_LOCK_AVOIDANCE_AGENT:
-        policy = DeadLockAvoidanceAgent(local_env, action_size)
-
    policy.start_episode(train=False)
    while True:
        try:
@@ -185,14 +233,7 @@ while True:
                time_start = time.time()
                action_dict = {}
                policy.start_step(train=False)
-                if USE_DEAD_LOCK_AVOIDANCE_AGENT:
-                    observation = np.zeros((local_env.get_num_agents(), 2))
                for agent_handle in range(nb_agents):
-
-                    if USE_DEAD_LOCK_AVOIDANCE_AGENT:
-                        observation[agent_handle][0] = agent_handle
-                        observation[agent_handle][1] = steps
-
                    if info['action_required'][agent_handle]:
                        if agent_handle in agent_last_obs and np.all(
                                agent_last_obs[agent_handle] == observation[agent_handle]):
@@ -234,7 +275,11 @@ while True:
                step_time = time.time() - time_start
                time_taken_per_step.append(step_time)

-            nb_agents_done = sum(done[idx] for idx in local_env.get_agent_handles())
+            nb_agents_done = 0
+            for i_agent, agent in enumerate(local_env.agents):
+                # manage the boolean flag to check if all agents are indeed done (or done_removed)
+                if (agent.status in [RailAgentStatus.DONE, RailAgentStatus.DONE_REMOVED]):
+                    nb_agents_done += 1

            if VERBOSE or done['__all__']:
                print(

--- a/runs_bench/Jan14_10-56-32_K57261_PPO_reduced/events.out.tfevents.1610618195.K57261.15412.0
+++ b/runs_bench/Jan14_10-56-32_K57261_PPO_reduced/events.out.tfevents.1610618195.K57261.15412.0
No results found