Compare revisions

Egli Adrian (IT-SCI-API-PFI) · Egli Adrian (IT-SCI-API-PFI) · Egli Adrian (IT-SCI-API-PFI) · Egli Adrian (IT-SCI-API-PFI) · Egli Adrian (IT-SCI-API-PFI) · Egli Adrian (IT-SCI-API-PFI)
--- a/README.md
+++ b/README.md
-🚂 Starter Kit - NeurIPS 2020 Flatland Challenge
-===
+🚂 This code is based on the official starter kit - NeurIPS 2020 Flatland Challenge
+---

-This starter kit contains 2 example policies to get started with this challenge: 
- a simple single-agent DQN method
- a more robust multi-agent DQN method that you can submit out of the box to the challenge 🚀
+You can use for your own experiments full or reduced action space. 
+
+```python
+def map_action(action):
+    # if full action space is used -> no mapping required
+    if get_action_size() == get_flatland_full_action_size():
+        return action
+    
+    # if reduced action space is used -> the action has to be mapped to real flatland actions
+    # The reduced action space removes the DO_NOTHING action from Flatland.
+    if action == 0:
+        return RailEnvActions.MOVE_LEFT
+    if action == 1:
+        return RailEnvActions.MOVE_FORWARD
+    if action == 2:
+        return RailEnvActions.MOVE_RIGHT
+    if action == 3:
+        return RailEnvActions.STOP_MOVING
+```

-**🔗 [Train the single-agent DQN policy](https://flatland.aicrowd.com/getting-started/rl/single-agent.html)**
+```python
+set_action_size_full()
+```
+or 
+```python
+set_action_size_reduced()
+```
+action space. The reduced action space just removes DO_NOTHING. 

-**🔗 [Train the multi-agent DQN policy](https://flatland.aicrowd.com/getting-started/rl/multi-agent.html)**
+---
+The used policy is based on the FastTreeObs in the official starter kit - NeurIPS 2020 Flatland Challenge. But the
+ FastTreeObs in this repo is an extended version. 
+[fast_tree_obs.py](./utils/fast_tree_obs.py)

-**🔗 [Submit a trained policy](https://flatland.aicrowd.com/getting-started/first-submission.html)**
+---
+Have a look into the [run.py](./run.py) file. There you can select using PPO or DDDQN as RL agents. 
+ 
+```python
+####################################################
+# EVALUATION PARAMETERS
+set_action_size_full()
+
+# Print per-step logs
+VERBOSE = True
+USE_FAST_TREEOBS = True
+
+if False:
+    # -------------------------------------------------------------------------------------------------------
+    # RL solution
+    # -------------------------------------------------------------------------------------------------------
+    # 116591 adrian_egli
+    # graded	71.305	0.633	RL	Successfully Graded ! More details about this submission can be found at:
+    # http://gitlab.aicrowd.com/adrian_egli/neurips2020-flatland-starter-kit/issues/51
+    # Fri, 22 Jan 2021 23:37:56
+    set_action_size_reduced()
+    load_policy = "DDDQN"
+    checkpoint = "./checkpoints/210122120236-3000.pth"  # 17.011131341978228
+    EPSILON = 0.0
+
+if False:
+    # -------------------------------------------------------------------------------------------------------
+    # RL solution
+    # -------------------------------------------------------------------------------------------------------
+    # 116658 adrian_egli
+    # graded	73.821	0.655	RL	Successfully Graded ! More details about this submission can be found at:
+    # http://gitlab.aicrowd.com/adrian_egli/neurips2020-flatland-starter-kit/issues/52
+    # Sat, 23 Jan 2021 07:41:35
+    set_action_size_reduced()
+    load_policy = "PPO"
+    checkpoint = "./checkpoints/210122235754-5000.pth"  # 16.00113400887389
+    EPSILON = 0.0
+
+if True:
+    # -------------------------------------------------------------------------------------------------------
+    # RL solution
+    # -------------------------------------------------------------------------------------------------------
+    # 116659 adrian_egli
+    # graded	80.579	0.715	RL	Successfully Graded ! More details about this submission can be found at:
+    # http://gitlab.aicrowd.com/adrian_egli/neurips2020-flatland-starter-kit/issues/53
+    # Sat, 23 Jan 2021 07:45:49
+    set_action_size_reduced()
+    load_policy = "DDDQN"
+    checkpoint = "./checkpoints/210122165109-5000.pth"  # 17.993750197899438
+    EPSILON = 0.0
+
+if False:
+    # -------------------------------------------------------------------------------------------------------
+    # !! This is not a RL solution !!!!
+    # -------------------------------------------------------------------------------------------------------
+    # 116727 adrian_egli
+    # graded	106.786	0.768	RL	Successfully Graded ! More details about this submission can be found at:
+    # http://gitlab.aicrowd.com/adrian_egli/neurips2020-flatland-starter-kit/issues/54
+    # Sat, 23 Jan 2021 14:31:50
+    set_action_size_reduced()
+    load_policy = "DeadLockAvoidance"
+    checkpoint = None
+    EPSILON = 0.0
+```

-The single-agent example is meant as a minimal example of how to use DQN. The multi-agent is a better starting point to create your own solution.
+---
+A deadlock avoidance agent is implemented. The agent only lets the train take the shortest route. And it tries to avoid as many deadlocks as possible.
+* [dead_lock_avoidance_agent.py](./utils/dead_lock_avoidance_agent.py)

-You can fully train the multi-agent policy in Colab for free! [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1GbPwZNQU7KJIJtilcGBTtpOAD3EabAzJ?usp=sharing)

-Sample training usage
 ---
+The policy interface has changed, please have a look into 
+* [policy.py](./reinforcement_learning/policy.py)

-Train the multi-agent policy for 150 episodes:
-
-```bash
-python reinforcement_learning/multi_agent_training.py -n 150
+---
+See the tensorboard training output to get some insights:
+```
+tensorboard --logdir ./runs_bench 
 ```

-The multi-agent policy training can be tuned using command-line arguments:
+---
+```
+python reinforcement_learning/multi_agent_training.py --use_fast_tree_observation  --checkpoint_interval 1000 -n 5000
+ --policy DDDQN -t 2 --action_size reduced --buffer_siz 128000
+```

-```console 
-usage: multi_agent_training.py [-h] [-n N_EPISODES] [-t TRAINING_ENV_CONFIG]
+[multi_agent_training.py](./reinforcement_learning/multi_agent_training.py)
+has new or changed parameters. Most important new or changed parameters for training. 
+ * policy :  [DDDQN, PPO, DeadLockAvoidance, DeadLockAvoidanceWithDecision, MultiDecision] : Default value
+   DeadLockAvoidance 
+ * use_fast_tree_observation : [false,true] : Default value = true  
+ * action_size: [full, reduced] : Default value = full
+``` 
+usage: multi_agent_training.py [-h] [-n N_EPISODES] [--n_agent_fixed]
+                               [-t TRAINING_ENV_CONFIG]
                               [-e EVALUATION_ENV_CONFIG]
                               [--n_evaluation_episodes N_EVALUATION_EPISODES]
                               [--checkpoint_interval CHECKPOINT_INTERVAL]
@@ -42,12 +144,16 @@ usage: multi_agent_training.py [-h] [-n N_EPISODES] [-t TRAINING_ENV_CONFIG]
                               [--hidden_size HIDDEN_SIZE]
                               [--update_every UPDATE_EVERY]
                               [--use_gpu USE_GPU] [--num_threads NUM_THREADS]
-                               [--render RENDER]
+                               [--render] [--load_policy LOAD_POLICY]
+                               [--use_fast_tree_observation]
+                               [--max_depth MAX_DEPTH] [--policy POLICY]
+                               [--action_size ACTION_SIZE]

 optional arguments:
  -h, --help            show this help message and exit
  -n N_EPISODES, --n_episodes N_EPISODES
                        number of episodes to run
+  --n_agent_fixed       hold the number of agent fixed
  -t TRAINING_ENV_CONFIG, --training_env_config TRAINING_ENV_CONFIG
                        training config id (eg 0 for Test_0)
  -e EVALUATION_ENV_CONFIG, --evaluation_env_config EVALUATION_ENV_CONFIG
@@ -82,20 +188,40 @@ optional arguments:
  --use_gpu USE_GPU     use GPU if available
  --num_threads NUM_THREADS
                        number of threads PyTorch can use
-  --render RENDER       render 1 episode in 100
-```
+  --render              render 1 episode in 100
+  --load_policy LOAD_POLICY
+                        policy filename (reference) to load
+  --use_fast_tree_observation
+                        use FastTreeObs instead of stock TreeObs
+  --max_depth MAX_DEPTH
+                        max depth
+  --policy POLICY       policy name [DDDQN, PPO, DeadLockAvoidance,
+                        DeadLockAvoidanceWithDecision, MultiDecision]
+  --action_size ACTION_SIZE
+                        define the action size [reduced,full]
+```                        

-[**📈 Performance training in environments of various sizes**](https://wandb.ai/masterscrat/flatland-examples-reinforcement_learning/reports/Flatland-Starter-Kit-Training-in-environments-of-various-sizes--VmlldzoxNjgxMTk)

-[**📈 Performance with various hyper-parameters**](https://app.wandb.ai/masterscrat/flatland-examples-reinforcement_learning/reports/Flatland-Examples--VmlldzoxNDI2MTA)
+---
+If you have any questions write me on the official discord channel **aiAdrian**    
+(Adrian Egli - adrian.egli@gmail.com) 
+
+
+Credits
+---

-[![](https://i.imgur.com/Lqrq5GE.png)](https://app.wandb.ai/masterscrat/flatland-examples-reinforcement_learning/reports/Flatland-Examples--VmlldzoxNDI2MTA) 
+* Florian Laurent <florian@aicrowd.com>
+* Erik Nygren <erik.nygren@sbb.ch>
+* Adrian Egli <adrian.egli@sbb.ch>
+* Sharada Mohanty <mohanty@aicrowd.com>
+* Christian Baumberger <christian.baumberger@sbb.ch>
+* Guillaume Mollard <guillaume.mollard2@gmail.com>

 Main links
 ---

 * [Flatland documentation](https://flatland.aicrowd.com/)
-* [NeurIPS 2020 Challenge](https://www.aicrowd.com/challenges/neurips-2020-flatland-challenge/)
+* [Flatland Challenge](https://www.aicrowd.com/challenges/flatland)

 Communication
 ---

--- a/checkpoints/201112143850-5400.pth.local
+++ b/checkpoints/201112143850-5400.pth.local
--- a/checkpoints/201112143850-5400.pth.target
+++ b/checkpoints/201112143850-5400.pth.target
--- a/checkpoints/201113211844-6100.pth.local
+++ b/checkpoints/201113211844-6100.pth.local
--- a/checkpoints/201113211844-6100.pth.target
+++ b/checkpoints/201113211844-6100.pth.target
--- a/checkpoints/210122120236-3000.pth.local
+++ b/checkpoints/210122120236-3000.pth.local
--- a/checkpoints/210122120236-3000.pth.target
+++ b/checkpoints/210122120236-3000.pth.target
--- a/checkpoints/210122165109-5000.pth.local
+++ b/checkpoints/210122165109-5000.pth.local
--- a/checkpoints/210122165109-5000.pth.target
+++ b/checkpoints/210122165109-5000.pth.target
--- a/checkpoints/210122235754-5000.pth.actor
+++ b/checkpoints/210122235754-5000.pth.actor
--- a/checkpoints/210122235754-5000.pth.optimizer
+++ b/checkpoints/210122235754-5000.pth.optimizer
--- a/checkpoints/210122235754-5000.pth.value
+++ b/checkpoints/210122235754-5000.pth.value
--- a/dump.rdb
+++ b/dump.rdb
--- a/reinforcement_learning/dddqn_policy.py
+++ b/reinforcement_learning/dddqn_policy.py
@@ -2,7 +2,6 @@ import copy
 import os
 import pickle
 import random
-from collections import namedtuple, deque, Iterable

 import numpy as np
 import torch
@@ -10,14 +9,18 @@ import torch.nn.functional as F
 import torch.optim as optim

 from reinforcement_learning.model import DuelingQNetwork
-from reinforcement_learning.policy import Policy
+from reinforcement_learning.policy import Policy, LearningPolicy
+from reinforcement_learning.replay_buffer import ReplayBuffer


-class DDDQNPolicy(Policy):
+class DDDQNPolicy(LearningPolicy):
    """Dueling Double DQN policy"""

-    def __init__(self, state_size, action_size, parameters, evaluation_mode=False):
-        self.parameters = parameters
+    def __init__(self, state_size, action_size, in_parameters, evaluation_mode=False):
+        print(">> DDDQNPolicy")
+        super(Policy, self).__init__()
+
+        self.ddqn_parameters = in_parameters
        self.evaluation_mode = evaluation_mode

        self.state_size = state_size
@@ -26,17 +29,17 @@ class DDDQNPolicy(Policy):
        self.hidsize = 128

        if not evaluation_mode:
-            self.hidsize = parameters.hidden_size
-            self.buffer_size = parameters.buffer_size
-            self.batch_size = parameters.batch_size
-            self.update_every = parameters.update_every
-            self.learning_rate = parameters.learning_rate
-            self.tau = parameters.tau
-            self.gamma = parameters.gamma
-            self.buffer_min_size = parameters.buffer_min_size
+            self.hidsize = self.ddqn_parameters.hidden_size
+            self.buffer_size = self.ddqn_parameters.buffer_size
+            self.batch_size = self.ddqn_parameters.batch_size
+            self.update_every = self.ddqn_parameters.update_every
+            self.learning_rate = self.ddqn_parameters.learning_rate
+            self.tau = self.ddqn_parameters.tau
+            self.gamma = self.ddqn_parameters.gamma
+            self.buffer_min_size = self.ddqn_parameters.buffer_min_size

            # Device
-        if parameters.use_gpu and torch.cuda.is_available():
+        if self.ddqn_parameters.use_gpu and torch.cuda.is_available():
            self.device = torch.device("cuda:0")
            # print("🐇 Using GPU")
        else:
@@ -44,18 +47,22 @@ class DDDQNPolicy(Policy):
            # print("🐢 Using CPU")

        # Q-Network
-        self.qnetwork_local = DuelingQNetwork(state_size, action_size, hidsize1=self.hidsize, hidsize2=self.hidsize).to(
-            self.device)
+        self.qnetwork_local = DuelingQNetwork(state_size,
+                                              action_size,
+                                              hidsize1=self.hidsize,
+                                              hidsize2=self.hidsize).to(self.device)

        if not evaluation_mode:
            self.qnetwork_target = copy.deepcopy(self.qnetwork_local)
            self.optimizer = optim.Adam(self.qnetwork_local.parameters(), lr=self.learning_rate)
            self.memory = ReplayBuffer(action_size, self.buffer_size, self.batch_size, self.device)
-
            self.t_step = 0
            self.loss = 0.0
+        else:
+            self.memory = ReplayBuffer(action_size, 1, 1, self.device)
+            self.loss = 0.0

-    def act(self, state, eps=0.):
+    def act(self, handle, state, eps=0.):
        state = torch.from_numpy(state).float().unsqueeze(0).to(self.device)
        self.qnetwork_local.eval()
        with torch.no_grad():
@@ -66,10 +73,6 @@ class DDDQNPolicy(Policy):
        # Epsilon-greedy action selection
        if random.random() >= eps:
            return np.argmax(action_values.cpu().data.numpy())
-            qvals = action_values.cpu().data.numpy()[0]
-            qvals = qvals - np.min(qvals)
-            qvals = qvals / (1e-5 + np.sum(qvals))
-            return np.argmax(np.random.multinomial(1, qvals))
        else:
            return random.choice(np.arange(self.action_size))

@@ -88,7 +91,7 @@ class DDDQNPolicy(Policy):

    def _learn(self):
        experiences = self.memory.sample()
-        states, actions, rewards, next_states, dones = experiences
+        states, actions, rewards, next_states, dones, _ = experiences

        # Get expected Q values from local model
        q_expected = self.qnetwork_local(states).gather(1, actions)
@@ -151,63 +154,11 @@ class DDDQNPolicy(Policy):
            self.memory.memory = pickle.load(f)

    def test(self):
-        self.act(np.array([[0] * self.state_size]))
+        self.act(0, np.array([[0] * self.state_size]))
        self._learn()

    def clone(self):
-        me = DDDQNPolicy(self.state_size, self.action_size, self.parameters, evaluation_mode=True)
+        me = DDDQNPolicy(self.state_size, self.action_size, self.ddqn_parameters, evaluation_mode=True)
        me.qnetwork_target = copy.deepcopy(self.qnetwork_local)
        me.qnetwork_target = copy.deepcopy(self.qnetwork_target)
        return me
-
-
-Experience = namedtuple("Experience", field_names=["state", "action", "reward", "next_state", "done"])
-
-
-class ReplayBuffer:
-    """Fixed-size buffer to store experience tuples."""
-
-    def __init__(self, action_size, buffer_size, batch_size, device):
-        """Initialize a ReplayBuffer object.
-
-        Params
-        ======
-            action_size (int): dimension of each action
-            buffer_size (int): maximum size of buffer
-            batch_size (int): size of each training batch
-        """
-        self.action_size = action_size
-        self.memory = deque(maxlen=buffer_size)
-        self.batch_size = batch_size
-        self.device = device
-
-    def add(self, state, action, reward, next_state, done):
-        """Add a new experience to memory."""
-        e = Experience(np.expand_dims(state, 0), action, reward, np.expand_dims(next_state, 0), done)
-        self.memory.append(e)
-
-    def sample(self):
-        """Randomly sample a batch of experiences from memory."""
-        experiences = random.sample(self.memory, k=self.batch_size)
-
-        states = torch.from_numpy(self.__v_stack_impr([e.state for e in experiences if e is not None])) \
-            .float().to(self.device)
-        actions = torch.from_numpy(self.__v_stack_impr([e.action for e in experiences if e is not None])) \
-            .long().to(self.device)
-        rewards = torch.from_numpy(self.__v_stack_impr([e.reward for e in experiences if e is not None])) \
-            .float().to(self.device)
-        next_states = torch.from_numpy(self.__v_stack_impr([e.next_state for e in experiences if e is not None])) \
-            .float().to(self.device)
-        dones = torch.from_numpy(self.__v_stack_impr([e.done for e in experiences if e is not None]).astype(np.uint8)) \
-            .float().to(self.device)
-
-        return states, actions, rewards, next_states, dones
-
-    def __len__(self):
-        """Return the current size of internal memory."""
-        return len(self.memory)
-
-    def __v_stack_impr(self, states):
-        sub_dim = len(states[0][0]) if isinstance(states[0], Iterable) else 1
-        np_states = np.reshape(np.array(states), (len(states), sub_dim))
-        return np_states
--- a/reinforcement_learning/deadlockavoidance_with_decision_agent.py
+++ b/reinforcement_learning/deadlockavoidance_with_decision_agent.py
+from flatland.envs.agent_utils import RailAgentStatus
+from flatland.envs.rail_env import RailEnv, RailEnvActions
+
+from reinforcement_learning.policy import HybridPolicy
+from reinforcement_learning.ppo_agent import PPOPolicy
+from utils.agent_action_config import map_rail_env_action
+from utils.dead_lock_avoidance_agent import DeadLockAvoidanceAgent
+
+
+class DeadLockAvoidanceWithDecisionAgent(HybridPolicy):
+
+    def __init__(self, env: RailEnv, state_size, action_size, learning_agent):
+        print(">> DeadLockAvoidanceWithDecisionAgent")
+        super(DeadLockAvoidanceWithDecisionAgent, self).__init__()
+        self.env = env
+        self.state_size = state_size
+        self.action_size = action_size
+        self.learning_agent = learning_agent
+        self.dead_lock_avoidance_agent = DeadLockAvoidanceAgent(self.env, action_size, False)
+        self.policy_selector = PPOPolicy(state_size, 2)
+
+        self.memory = self.learning_agent.memory
+        self.loss = self.learning_agent.loss
+
+    def step(self, handle, state, action, reward, next_state, done):
+        select = self.policy_selector.act(handle, state, 0.0)
+        self.policy_selector.step(handle, state, select, reward, next_state, done)
+        self.dead_lock_avoidance_agent.step(handle, state, action, reward, next_state, done)
+        self.learning_agent.step(handle, state, action, reward, next_state, done)
+        self.loss = self.learning_agent.loss
+
+    def act(self, handle, state, eps=0.):
+        select = self.policy_selector.act(handle, state, eps)
+        if select == 0:
+            return self.learning_agent.act(handle, state, eps)
+        return self.dead_lock_avoidance_agent.act(handle, state, -1.0)
+
+    def save(self, filename):
+        self.dead_lock_avoidance_agent.save(filename)
+        self.learning_agent.save(filename)
+        self.policy_selector.save(filename + '.selector')
+
+    def load(self, filename):
+        self.dead_lock_avoidance_agent.load(filename)
+        self.learning_agent.load(filename)
+        self.policy_selector.load(filename + '.selector')
+
+    def start_step(self, train):
+        self.dead_lock_avoidance_agent.start_step(train)
+        self.learning_agent.start_step(train)
+        self.policy_selector.start_step(train)
+
+    def end_step(self, train):
+        self.dead_lock_avoidance_agent.end_step(train)
+        self.learning_agent.end_step(train)
+        self.policy_selector.end_step(train)
+
+    def start_episode(self, train):
+        self.dead_lock_avoidance_agent.start_episode(train)
+        self.learning_agent.start_episode(train)
+        self.policy_selector.start_episode(train)
+
+    def end_episode(self, train):
+        self.dead_lock_avoidance_agent.end_episode(train)
+        self.learning_agent.end_episode(train)
+        self.policy_selector.end_episode(train)
+
+    def load_replay_buffer(self, filename):
+        self.dead_lock_avoidance_agent.load_replay_buffer(filename)
+        self.learning_agent.load_replay_buffer(filename)
+        self.policy_selector.load_replay_buffer(filename + ".selector")
+
+    def test(self):
+        self.dead_lock_avoidance_agent.test()
+        self.learning_agent.test()
+        self.policy_selector.test()
+
+    def reset(self, env: RailEnv):
+        self.env = env
+        self.dead_lock_avoidance_agent.reset(env)
+        self.learning_agent.reset(env)
+        self.policy_selector.reset(env)
+
+    def clone(self):
+        return self
--- a/reinforcement_learning/evaluate_agent.py
+++ b/reinforcement_learning/evaluate_agent.py
@@ -26,7 +26,8 @@ from utils.observation_utils import normalize_observation
 from reinforcement_learning.dddqn_policy import DDDQNPolicy


-def eval_policy(env_params, checkpoint, n_eval_episodes, max_steps, action_size, state_size, seed, render, allow_skipping, allow_caching):
+def eval_policy(env_params, checkpoint, n_eval_episodes, max_steps, action_size, state_size, seed, render,
+                allow_skipping, allow_caching):
    # Evaluation is faster on CPU (except if you use a really huge policy)
    parameters = {
        'use_gpu': False
@@ -140,11 +141,12 @@ def eval_policy(env_params, checkpoint, n_eval_episodes, max_steps, action_size,

                    else:
                        preproc_timer.start()
-                        norm_obs = normalize_observation(obs[agent], tree_depth=observation_tree_depth, observation_radius=observation_radius)
+                        norm_obs = normalize_observation(obs[agent], tree_depth=observation_tree_depth,
+                                                         observation_radius=observation_radius)
                        preproc_timer.end()

                        inference_timer.start()
-                        action = policy.act(norm_obs, eps=0.0)
+                        action = policy.act(agent, norm_obs, eps=0.0)
                        inference_timer.end()

                    action_dict.update({agent: action})
@@ -319,12 +321,15 @@ def evaluate_agents(file, n_evaluation_episodes, use_gpu, render, allow_skipping

    results = []
    if render:
-        results.append(eval_policy(params, file, eval_per_thread, max_steps, action_size, state_size, 0, render, allow_skipping, allow_caching))
+        results.append(
+            eval_policy(params, file, eval_per_thread, max_steps, action_size, state_size, 0, render, allow_skipping,
+                        allow_caching))

    else:
        with Pool() as p:
            results = p.starmap(eval_policy,
-                                [(params, file, 1, max_steps, action_size, state_size, seed * nb_threads, render, allow_skipping, allow_caching)
+                                [(params, file, 1, max_steps, action_size, state_size, seed * nb_threads, render,
+                                  allow_skipping, allow_caching)
                                 for seed in
                                 range(total_nb_eval)])

@@ -367,10 +372,12 @@ if __name__ == "__main__":

    parser.add_argument("--use_gpu", dest="use_gpu", help="use GPU if available", action='store_true')
    parser.add_argument("--render", help="render a single episode", action='store_true')
-    parser.add_argument("--allow_skipping", help="skips to the end of the episode if all agents are deadlocked", action='store_true')
+    parser.add_argument("--allow_skipping", help="skips to the end of the episode if all agents are deadlocked",
+                        action='store_true')
    parser.add_argument("--allow_caching", help="caches the last observation-action pair", action='store_true')
    args = parser.parse_args()

    os.environ["OMP_NUM_THREADS"] = str(1)
-    evaluate_agents(file=args.file, n_evaluation_episodes=args.n_evaluation_episodes, use_gpu=args.use_gpu, render=args.render,
+    evaluate_agents(file=args.file, n_evaluation_episodes=args.n_evaluation_episodes, use_gpu=args.use_gpu,
+                    render=args.render,
                    allow_skipping=args.allow_skipping, allow_caching=args.allow_caching)
--- a/reinforcement_learning/multi_agent_training.py
+++ b/reinforcement_learning/multi_agent_training.py
--- a/reinforcement_learning/multi_decision_agent.py
+++ b/reinforcement_learning/multi_decision_agent.py
+from flatland.envs.rail_env import RailEnv
+
+from reinforcement_learning.dddqn_policy import DDDQNPolicy
+from reinforcement_learning.policy import LearningPolicy, DummyMemory
+from reinforcement_learning.ppo_agent import PPOPolicy
+
+
+class MultiDecisionAgent(LearningPolicy):
+
+    def __init__(self, state_size, action_size, in_parameters=None):
+        print(">> MultiDecisionAgent")
+        super(MultiDecisionAgent, self).__init__()
+        self.state_size = state_size
+        self.action_size = action_size
+        self.in_parameters = in_parameters
+        self.memory = DummyMemory()
+        self.loss = 0
+
+        self.ppo_policy = PPOPolicy(state_size, action_size, use_replay_buffer=False, in_parameters=in_parameters)
+        self.dddqn_policy = DDDQNPolicy(state_size, action_size, in_parameters)
+        self.policy_selector = PPOPolicy(state_size, 2)
+
+
+    def step(self, handle, state, action, reward, next_state, done):
+        self.ppo_policy.step(handle, state, action, reward, next_state, done)
+        self.dddqn_policy.step(handle, state, action, reward, next_state, done)
+        select = self.policy_selector.act(handle, state, 0.0)
+        self.policy_selector.step(handle, state, select, reward, next_state, done)
+
+    def act(self, handle, state, eps=0.):
+        select = self.policy_selector.act(handle, state, eps)
+        if select == 0:
+            return self.dddqn_policy.act(handle, state, eps)
+        return self.policy_selector.act(handle, state, eps)
+
+    def save(self, filename):
+        self.ppo_policy.save(filename)
+        self.dddqn_policy.save(filename)
+        self.policy_selector.save(filename)
+
+    def load(self, filename):
+        self.ppo_policy.load(filename)
+        self.dddqn_policy.load(filename)
+        self.policy_selector.load(filename)
+
+    def start_step(self, train):
+        self.ppo_policy.start_step(train)
+        self.dddqn_policy.start_step(train)
+        self.policy_selector.start_step(train)
+
+    def end_step(self, train):
+        self.ppo_policy.end_step(train)
+        self.dddqn_policy.end_step(train)
+        self.policy_selector.end_step(train)
+
+    def start_episode(self, train):
+        self.ppo_policy.start_episode(train)
+        self.dddqn_policy.start_episode(train)
+        self.policy_selector.start_episode(train)
+
+    def end_episode(self, train):
+        self.ppo_policy.end_episode(train)
+        self.dddqn_policy.end_episode(train)
+        self.policy_selector.end_episode(train)
+
+    def load_replay_buffer(self, filename):
+        self.ppo_policy.load_replay_buffer(filename)
+        self.dddqn_policy.load_replay_buffer(filename)
+        self.policy_selector.load_replay_buffer(filename)
+
+    def test(self):
+        self.ppo_policy.test()
+        self.dddqn_policy.test()
+        self.policy_selector.test()
+
+    def reset(self, env: RailEnv):
+        self.ppo_policy.reset(env)
+        self.dddqn_policy.reset(env)
+        self.policy_selector.reset(env)
+
+    def clone(self):
+        multi_descision_agent = MultiDecisionAgent(
+            self.state_size,
+            self.action_size,
+            self.in_parameters
+        )
+        multi_descision_agent.ppo_policy = self.ppo_policy.clone()
+        multi_descision_agent.dddqn_policy = self.dddqn_policy.clone()
+        multi_descision_agent.policy_selector = self.policy_selector.clone()
+        return multi_descision_agent
--- a/reinforcement_learning/multi_policy.py
+++ b/reinforcement_learning/multi_policy.py
 import numpy as np
-from flatland.envs.rail_env import RailEnvActions
+from flatland.envs.rail_env import RailEnv

 from reinforcement_learning.policy import Policy
-from reinforcement_learning.ppo.ppo_agent import PPOAgent
+from reinforcement_learning.ppo_agent import PPOPolicy
 from utils.dead_lock_avoidance_agent import DeadLockAvoidanceAgent
-from utils.extra import ExtraPolicy


 class MultiPolicy(Policy):
@@ -13,20 +12,20 @@ class MultiPolicy(Policy):
        self.action_size = action_size
        self.memory = []
        self.loss = 0
-        self.extra_policy = ExtraPolicy(state_size, action_size)
-        self.ppo_policy = PPOAgent(state_size + action_size, action_size, n_agents, env)
+        self.deadlock_avoidance_policy = DeadLockAvoidanceAgent(env, action_size, False)
+        self.ppo_policy = PPOPolicy(state_size + action_size, action_size)

    def load(self, filename):
        self.ppo_policy.load(filename)
-        self.extra_policy.load(filename)
+        self.deadlock_avoidance_policy.load(filename)

    def save(self, filename):
        self.ppo_policy.save(filename)
-        self.extra_policy.save(filename)
+        self.deadlock_avoidance_policy.save(filename)

    def step(self, handle, state, action, reward, next_state, done):
-        action_extra_state = self.extra_policy.act(handle, state, 0.0)
-        action_extra_next_state = self.extra_policy.act(handle, next_state, 0.0)
+        action_extra_state = self.deadlock_avoidance_policy.act(handle, state, 0.0)
+        action_extra_next_state = self.deadlock_avoidance_policy.act(handle, next_state, 0.0)

        extended_state = np.copy(state)
        for action_itr in np.arange(self.action_size):
@@ -35,11 +34,11 @@ class MultiPolicy(Policy):
        for action_itr in np.arange(self.action_size):
            extended_next_state = np.append(extended_next_state, [int(action_extra_next_state == action_itr)])

-        self.extra_policy.step(handle, state, action, reward, next_state, done)
+        self.deadlock_avoidance_policy.step(handle, state, action, reward, next_state, done)
        self.ppo_policy.step(handle, extended_state, action, reward, extended_next_state, done)

    def act(self, handle, state, eps=0.):
-        action_extra_state = self.extra_policy.act(handle, state, 0.0)
+        action_extra_state = self.deadlock_avoidance_policy.act(handle, state, 0.0)
        extended_state = np.copy(state)
        for action_itr in np.arange(self.action_size):
            extended_state = np.append(extended_state, [int(action_extra_state == action_itr)])
@@ -47,18 +46,18 @@ class MultiPolicy(Policy):
        self.loss = self.ppo_policy.loss
        return action_ppo

-    def reset(self):
-        self.ppo_policy.reset()
-        self.extra_policy.reset()
+    def reset(self, env: RailEnv):
+        self.ppo_policy.reset(env)
+        self.deadlock_avoidance_policy.reset(env)

    def test(self):
        self.ppo_policy.test()
-        self.extra_policy.test()
+        self.deadlock_avoidance_policy.test()

-    def start_step(self):
-        self.extra_policy.start_step()
-        self.ppo_policy.start_step()
+    def start_step(self, train):
+        self.deadlock_avoidance_policy.start_step(train)
+        self.ppo_policy.start_step(train)

-    def end_step(self):
-        self.extra_policy.end_step()
-        self.ppo_policy.end_step()
+    def end_step(self, train):
+        self.deadlock_avoidance_policy.end_step(train)
+        self.ppo_policy.end_step(train)
--- a/reinforcement_learning/ordered_policy.py
+++ b/reinforcement_learning/ordered_policy.py
@@ -15,7 +15,7 @@ class OrderedPolicy(Policy):
    def __init__(self):
        self.action_size = 5

-    def act(self, state, eps=0.):
+    def act(self, handle, state, eps=0.):
        _, distance, _ = split_tree_into_feature_groups(state, 1)
        distance = distance[1:]
        min_dist = min_gt(distance, 0)
No results found