Compare revisions

Egli Adrian (IT-SCI-API-PFI) · Egli Adrian (IT-SCI-API-PFI) · Egli Adrian (IT-SCI-API-PFI) · Egli Adrian (IT-SCI-API-PFI) · MasterScrat · Egli Adrian (IT-SCI-API-PFI)
--- a/README.md
+++ b/README.md
-🚂 Starter Kit - NeurIPS 2020 Flatland Challenge
-===
+🚂 This code is based on the official starter kit - NeurIPS 2020 Flatland Challenge
+---

-This starter kit contains 2 example policies to get started with this challenge: 
- a simple single-agent DQN method
- a more robust multi-agent DQN method that you can submit out of the box to the challenge 🚀
+You can use for your own experiments full or reduced action space. 
+
+```python
+def map_action(action):
+    # if full action space is used -> no mapping required
+    if get_action_size() == get_flatland_full_action_size():
+        return action
+    
+    # if reduced action space is used -> the action has to be mapped to real flatland actions
+    # The reduced action space removes the DO_NOTHING action from Flatland.
+    if action == 0:
+        return RailEnvActions.MOVE_LEFT
+    if action == 1:
+        return RailEnvActions.MOVE_FORWARD
+    if action == 2:
+        return RailEnvActions.MOVE_RIGHT
+    if action == 3:
+        return RailEnvActions.STOP_MOVING
+```

-**🔗 [Train the single-agent DQN policy](https://flatland.aicrowd.com/getting-started/rl/single-agent.html)**
+```python
+set_action_size_full()
+```
+or 
+```python
+set_action_size_reduced()
+```
+action space. The reduced action space just removes DO_NOTHING. 

-**🔗 [Train the multi-agent DQN policy](https://flatland.aicrowd.com/getting-started/rl/multi-agent.html)**
+---
+The used policy is based on the FastTreeObs in the official starter kit - NeurIPS 2020 Flatland Challenge. But the
+ FastTreeObs in this repo is an extended version. 
+[fast_tree_obs.py](./utils/fast_tree_obs.py)

-**🔗 [Submit a trained policy](https://flatland.aicrowd.com/getting-started/first-submission.html)**
+---
+Have a look into the [run.py](./run.py) file. There you can select using PPO or DDDQN as RL agents. 
+ 
+```python
+####################################################
+# EVALUATION PARAMETERS
+set_action_size_full()
+
+# Print per-step logs
+VERBOSE = True
+USE_FAST_TREEOBS = True
+
+if False:
+    # -------------------------------------------------------------------------------------------------------
+    # RL solution
+    # -------------------------------------------------------------------------------------------------------
+    # 116591 adrian_egli
+    # graded	71.305	0.633	RL	Successfully Graded ! More details about this submission can be found at:
+    # http://gitlab.aicrowd.com/adrian_egli/neurips2020-flatland-starter-kit/issues/51
+    # Fri, 22 Jan 2021 23:37:56
+    set_action_size_reduced()
+    load_policy = "DDDQN"
+    checkpoint = "./checkpoints/210122120236-3000.pth"  # 17.011131341978228
+    EPSILON = 0.0
+
+if False:
+    # -------------------------------------------------------------------------------------------------------
+    # RL solution
+    # -------------------------------------------------------------------------------------------------------
+    # 116658 adrian_egli
+    # graded	73.821	0.655	RL	Successfully Graded ! More details about this submission can be found at:
+    # http://gitlab.aicrowd.com/adrian_egli/neurips2020-flatland-starter-kit/issues/52
+    # Sat, 23 Jan 2021 07:41:35
+    set_action_size_reduced()
+    load_policy = "PPO"
+    checkpoint = "./checkpoints/210122235754-5000.pth"  # 16.00113400887389
+    EPSILON = 0.0
+
+if True:
+    # -------------------------------------------------------------------------------------------------------
+    # RL solution
+    # -------------------------------------------------------------------------------------------------------
+    # 116659 adrian_egli
+    # graded	80.579	0.715	RL	Successfully Graded ! More details about this submission can be found at:
+    # http://gitlab.aicrowd.com/adrian_egli/neurips2020-flatland-starter-kit/issues/53
+    # Sat, 23 Jan 2021 07:45:49
+    set_action_size_reduced()
+    load_policy = "DDDQN"
+    checkpoint = "./checkpoints/210122165109-5000.pth"  # 17.993750197899438
+    EPSILON = 0.0
+
+if False:
+    # -------------------------------------------------------------------------------------------------------
+    # !! This is not a RL solution !!!!
+    # -------------------------------------------------------------------------------------------------------
+    # 116727 adrian_egli
+    # graded	106.786	0.768	RL	Successfully Graded ! More details about this submission can be found at:
+    # http://gitlab.aicrowd.com/adrian_egli/neurips2020-flatland-starter-kit/issues/54
+    # Sat, 23 Jan 2021 14:31:50
+    set_action_size_reduced()
+    load_policy = "DeadLockAvoidance"
+    checkpoint = None
+    EPSILON = 0.0
+```

-The single-agent example is meant as a minimal example of how to use DQN. The multi-agent is a better starting point to create your own solution.
+---
+A deadlock avoidance agent is implemented. The agent only lets the train take the shortest route. And it tries to avoid as many deadlocks as possible.
+* [dead_lock_avoidance_agent.py](./utils/dead_lock_avoidance_agent.py)

-You can fully train the multi-agent policy in Colab for free! [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1GbPwZNQU7KJIJtilcGBTtpOAD3EabAzJ?usp=sharing)

-Sample training usage
 ---
+The policy interface has changed, please have a look into 
+* [policy.py](./reinforcement_learning/policy.py)

-Train the multi-agent policy for 150 episodes:
-
-```bash
-python reinforcement_learning/multi_agent_training.py -n 150
+---
+See the tensorboard training output to get some insights:
+```
+tensorboard --logdir ./runs_bench 
 ```

-The multi-agent policy training can be tuned using command-line arguments:
+---
+```
+python reinforcement_learning/multi_agent_training.py --use_fast_tree_observation  --checkpoint_interval 1000 -n 5000
+ --policy DDDQN -t 2 --action_size reduced --buffer_siz 128000
+```

-```console 
-usage: multi_agent_training.py [-h] [-n N_EPISODES] [-t TRAINING_ENV_CONFIG]
+[multi_agent_training.py](./reinforcement_learning/multi_agent_training.py)
+has new or changed parameters. Most important new or changed parameters for training. 
+ * policy :  [DDDQN, PPO, DeadLockAvoidance, DeadLockAvoidanceWithDecision, MultiDecision] : Default value
+   DeadLockAvoidance 
+ * use_fast_tree_observation : [false,true] : Default value = true  
+ * action_size: [full, reduced] : Default value = full
+``` 
+usage: multi_agent_training.py [-h] [-n N_EPISODES] [--n_agent_fixed]
+                               [-t TRAINING_ENV_CONFIG]
                               [-e EVALUATION_ENV_CONFIG]
                               [--n_evaluation_episodes N_EVALUATION_EPISODES]
                               [--checkpoint_interval CHECKPOINT_INTERVAL]
@@ -42,12 +144,16 @@ usage: multi_agent_training.py [-h] [-n N_EPISODES] [-t TRAINING_ENV_CONFIG]
                               [--hidden_size HIDDEN_SIZE]
                               [--update_every UPDATE_EVERY]
                               [--use_gpu USE_GPU] [--num_threads NUM_THREADS]
-                               [--render RENDER]
+                               [--render] [--load_policy LOAD_POLICY]
+                               [--use_fast_tree_observation]
+                               [--max_depth MAX_DEPTH] [--policy POLICY]
+                               [--action_size ACTION_SIZE]

 optional arguments:
  -h, --help            show this help message and exit
  -n N_EPISODES, --n_episodes N_EPISODES
                        number of episodes to run
+  --n_agent_fixed       hold the number of agent fixed
  -t TRAINING_ENV_CONFIG, --training_env_config TRAINING_ENV_CONFIG
                        training config id (eg 0 for Test_0)
  -e EVALUATION_ENV_CONFIG, --evaluation_env_config EVALUATION_ENV_CONFIG
@@ -82,20 +188,40 @@ optional arguments:
  --use_gpu USE_GPU     use GPU if available
  --num_threads NUM_THREADS
                        number of threads PyTorch can use
-  --render RENDER       render 1 episode in 100
-```
+  --render              render 1 episode in 100
+  --load_policy LOAD_POLICY
+                        policy filename (reference) to load
+  --use_fast_tree_observation
+                        use FastTreeObs instead of stock TreeObs
+  --max_depth MAX_DEPTH
+                        max depth
+  --policy POLICY       policy name [DDDQN, PPO, DeadLockAvoidance,
+                        DeadLockAvoidanceWithDecision, MultiDecision]
+  --action_size ACTION_SIZE
+                        define the action size [reduced,full]
+```                        

-[**📈 Performance training in environments of various sizes**](https://wandb.ai/masterscrat/flatland-examples-reinforcement_learning/reports/Flatland-Starter-Kit-Training-in-environments-of-various-sizes--VmlldzoxNjgxMTk)

-[**📈 Performance with various hyper-parameters**](https://app.wandb.ai/masterscrat/flatland-examples-reinforcement_learning/reports/Flatland-Examples--VmlldzoxNDI2MTA)
+---
+If you have any questions write me on the official discord channel **aiAdrian**    
+(Adrian Egli - adrian.egli@gmail.com) 
+
+
+Credits
+---

-[![](https://i.imgur.com/Lqrq5GE.png)](https://app.wandb.ai/masterscrat/flatland-examples-reinforcement_learning/reports/Flatland-Examples--VmlldzoxNDI2MTA) 
+* Florian Laurent <florian@aicrowd.com>
+* Erik Nygren <erik.nygren@sbb.ch>
+* Adrian Egli <adrian.egli@sbb.ch>
+* Sharada Mohanty <mohanty@aicrowd.com>
+* Christian Baumberger <christian.baumberger@sbb.ch>
+* Guillaume Mollard <guillaume.mollard2@gmail.com>

 Main links
 ---

 * [Flatland documentation](https://flatland.aicrowd.com/)
-* [NeurIPS 2020 Challenge](https://www.aicrowd.com/challenges/neurips-2020-flatland-challenge/)
+* [Flatland Challenge](https://www.aicrowd.com/challenges/flatland)

 Communication
 ---

--- a/checkpoints/201124171810-7800.pth.local
+++ b/checkpoints/201124171810-7800.pth.local
--- a/checkpoints/201124171810-7800.pth.target
+++ b/checkpoints/201124171810-7800.pth.target
--- a/checkpoints/210122120236-3000.pth.local
+++ b/checkpoints/210122120236-3000.pth.local
--- a/checkpoints/210122120236-3000.pth.target
+++ b/checkpoints/210122120236-3000.pth.target
--- a/checkpoints/210122165109-5000.pth.local
+++ b/checkpoints/210122165109-5000.pth.local
--- a/checkpoints/210122165109-5000.pth.target
+++ b/checkpoints/210122165109-5000.pth.target
--- a/checkpoints/210122235754-5000.pth.actor
+++ b/checkpoints/210122235754-5000.pth.actor
--- a/checkpoints/210122235754-5000.pth.optimizer
+++ b/checkpoints/210122235754-5000.pth.optimizer
--- a/checkpoints/210122235754-5000.pth.value
+++ b/checkpoints/210122235754-5000.pth.value
--- a/dump.rdb
+++ b/dump.rdb
--- a/reinforcement_learning/dddqn_policy.py
+++ b/reinforcement_learning/dddqn_policy.py
@@ -2,7 +2,6 @@ import copy
 import os
 import pickle
 import random
-from collections import namedtuple, deque, Iterable

 import numpy as np
 import torch
@@ -10,13 +9,15 @@ import torch.nn.functional as F
 import torch.optim as optim

 from reinforcement_learning.model import DuelingQNetwork
-from reinforcement_learning.policy import Policy
+from reinforcement_learning.policy import Policy, LearningPolicy
+from reinforcement_learning.replay_buffer import ReplayBuffer


-class DDDQNPolicy(Policy):
+class DDDQNPolicy(LearningPolicy):
    """Dueling Double DQN policy"""

    def __init__(self, state_size, action_size, in_parameters, evaluation_mode=False):
+        print(">> DDDQNPolicy")
        super(Policy, self).__init__()

        self.ddqn_parameters = in_parameters
@@ -55,11 +56,13 @@ class DDDQNPolicy(Policy):
            self.qnetwork_target = copy.deepcopy(self.qnetwork_local)
            self.optimizer = optim.Adam(self.qnetwork_local.parameters(), lr=self.learning_rate)
            self.memory = ReplayBuffer(action_size, self.buffer_size, self.batch_size, self.device)
-
            self.t_step = 0
            self.loss = 0.0
+        else:
+            self.memory = ReplayBuffer(action_size, 1, 1, self.device)
+            self.loss = 0.0

-    def act(self, state, eps=0.):
+    def act(self, handle, state, eps=0.):
        state = torch.from_numpy(state).float().unsqueeze(0).to(self.device)
        self.qnetwork_local.eval()
        with torch.no_grad():
@@ -88,7 +91,7 @@ class DDDQNPolicy(Policy):

    def _learn(self):
        experiences = self.memory.sample()
-        states, actions, rewards, next_states, dones = experiences
+        states, actions, rewards, next_states, dones, _ = experiences

        # Get expected Q values from local model
        q_expected = self.qnetwork_local(states).gather(1, actions)
@@ -151,7 +154,7 @@ class DDDQNPolicy(Policy):
            self.memory.memory = pickle.load(f)

    def test(self):
-        self.act(np.array([[0] * self.state_size]))
+        self.act(0, np.array([[0] * self.state_size]))
        self._learn()

    def clone(self):
@@ -159,55 +162,3 @@ class DDDQNPolicy(Policy):
        me.qnetwork_target = copy.deepcopy(self.qnetwork_local)
        me.qnetwork_target = copy.deepcopy(self.qnetwork_target)
        return me
-
-
-Experience = namedtuple("Experience", field_names=["state", "action", "reward", "next_state", "done"])
-
-
-class ReplayBuffer:
-    """Fixed-size buffer to store experience tuples."""
-
-    def __init__(self, action_size, buffer_size, batch_size, device):
-        """Initialize a ReplayBuffer object.
-
-        Params
-        ======
-            action_size (int): dimension of each action
-            buffer_size (int): maximum size of buffer
-            batch_size (int): size of each training batch
-        """
-        self.action_size = action_size
-        self.memory = deque(maxlen=buffer_size)
-        self.batch_size = batch_size
-        self.device = device
-
-    def add(self, state, action, reward, next_state, done):
-        """Add a new experience to memory."""
-        e = Experience(np.expand_dims(state, 0), action, reward, np.expand_dims(next_state, 0), done)
-        self.memory.append(e)
-
-    def sample(self):
-        """Randomly sample a batch of experiences from memory."""
-        experiences = random.sample(self.memory, k=self.batch_size)
-
-        states = torch.from_numpy(self.__v_stack_impr([e.state for e in experiences if e is not None])) \
-            .float().to(self.device)
-        actions = torch.from_numpy(self.__v_stack_impr([e.action for e in experiences if e is not None])) \
-            .long().to(self.device)
-        rewards = torch.from_numpy(self.__v_stack_impr([e.reward for e in experiences if e is not None])) \
-            .float().to(self.device)
-        next_states = torch.from_numpy(self.__v_stack_impr([e.next_state for e in experiences if e is not None])) \
-            .float().to(self.device)
-        dones = torch.from_numpy(self.__v_stack_impr([e.done for e in experiences if e is not None]).astype(np.uint8)) \
-            .float().to(self.device)
-
-        return states, actions, rewards, next_states, dones
-
-    def __len__(self):
-        """Return the current size of internal memory."""
-        return len(self.memory)
-
-    def __v_stack_impr(self, states):
-        sub_dim = len(states[0][0]) if isinstance(states[0], Iterable) else 1
-        np_states = np.reshape(np.array(states), (len(states), sub_dim))
-        return np_states
--- a/reinforcement_learning/deadlockavoidance_with_decision_agent.py
+++ b/reinforcement_learning/deadlockavoidance_with_decision_agent.py
+from flatland.envs.agent_utils import RailAgentStatus
+from flatland.envs.rail_env import RailEnv, RailEnvActions
+
+from reinforcement_learning.policy import HybridPolicy
+from reinforcement_learning.ppo_agent import PPOPolicy
+from utils.agent_action_config import map_rail_env_action
+from utils.dead_lock_avoidance_agent import DeadLockAvoidanceAgent
+
+
+class DeadLockAvoidanceWithDecisionAgent(HybridPolicy):
+
+    def __init__(self, env: RailEnv, state_size, action_size, learning_agent):
+        print(">> DeadLockAvoidanceWithDecisionAgent")
+        super(DeadLockAvoidanceWithDecisionAgent, self).__init__()
+        self.env = env
+        self.state_size = state_size
+        self.action_size = action_size
+        self.learning_agent = learning_agent
+        self.dead_lock_avoidance_agent = DeadLockAvoidanceAgent(self.env, action_size, False)
+        self.policy_selector = PPOPolicy(state_size, 2)
+
+        self.memory = self.learning_agent.memory
+        self.loss = self.learning_agent.loss
+
+    def step(self, handle, state, action, reward, next_state, done):
+        select = self.policy_selector.act(handle, state, 0.0)
+        self.policy_selector.step(handle, state, select, reward, next_state, done)
+        self.dead_lock_avoidance_agent.step(handle, state, action, reward, next_state, done)
+        self.learning_agent.step(handle, state, action, reward, next_state, done)
+        self.loss = self.learning_agent.loss
+
+    def act(self, handle, state, eps=0.):
+        select = self.policy_selector.act(handle, state, eps)
+        if select == 0:
+            return self.learning_agent.act(handle, state, eps)
+        return self.dead_lock_avoidance_agent.act(handle, state, -1.0)
+
+    def save(self, filename):
+        self.dead_lock_avoidance_agent.save(filename)
+        self.learning_agent.save(filename)
+        self.policy_selector.save(filename + '.selector')
+
+    def load(self, filename):
+        self.dead_lock_avoidance_agent.load(filename)
+        self.learning_agent.load(filename)
+        self.policy_selector.load(filename + '.selector')
+
+    def start_step(self, train):
+        self.dead_lock_avoidance_agent.start_step(train)
+        self.learning_agent.start_step(train)
+        self.policy_selector.start_step(train)
+
+    def end_step(self, train):
+        self.dead_lock_avoidance_agent.end_step(train)
+        self.learning_agent.end_step(train)
+        self.policy_selector.end_step(train)
+
+    def start_episode(self, train):
+        self.dead_lock_avoidance_agent.start_episode(train)
+        self.learning_agent.start_episode(train)
+        self.policy_selector.start_episode(train)
+
+    def end_episode(self, train):
+        self.dead_lock_avoidance_agent.end_episode(train)
+        self.learning_agent.end_episode(train)
+        self.policy_selector.end_episode(train)
+
+    def load_replay_buffer(self, filename):
+        self.dead_lock_avoidance_agent.load_replay_buffer(filename)
+        self.learning_agent.load_replay_buffer(filename)
+        self.policy_selector.load_replay_buffer(filename + ".selector")
+
+    def test(self):
+        self.dead_lock_avoidance_agent.test()
+        self.learning_agent.test()
+        self.policy_selector.test()
+
+    def reset(self, env: RailEnv):
+        self.env = env
+        self.dead_lock_avoidance_agent.reset(env)
+        self.learning_agent.reset(env)
+        self.policy_selector.reset(env)
+
+    def clone(self):
+        return self
--- a/reinforcement_learning/evaluate_agent.py
+++ b/reinforcement_learning/evaluate_agent.py
@@ -26,7 +26,8 @@ from utils.observation_utils import normalize_observation
 from reinforcement_learning.dddqn_policy import DDDQNPolicy


-def eval_policy(env_params, checkpoint, n_eval_episodes, max_steps, action_size, state_size, seed, render, allow_skipping, allow_caching):
+def eval_policy(env_params, checkpoint, n_eval_episodes, max_steps, action_size, state_size, seed, render,
+                allow_skipping, allow_caching):
    # Evaluation is faster on CPU (except if you use a really huge policy)
    parameters = {
        'use_gpu': False
@@ -140,11 +141,12 @@ def eval_policy(env_params, checkpoint, n_eval_episodes, max_steps, action_size,

                    else:
                        preproc_timer.start()
-                        norm_obs = normalize_observation(obs[agent], tree_depth=observation_tree_depth, observation_radius=observation_radius)
+                        norm_obs = normalize_observation(obs[agent], tree_depth=observation_tree_depth,
+                                                         observation_radius=observation_radius)
                        preproc_timer.end()

                        inference_timer.start()
-                        action = policy.act(norm_obs, eps=0.0)
+                        action = policy.act(agent, norm_obs, eps=0.0)
                        inference_timer.end()

                    action_dict.update({agent: action})
@@ -319,12 +321,15 @@ def evaluate_agents(file, n_evaluation_episodes, use_gpu, render, allow_skipping

    results = []
    if render:
-        results.append(eval_policy(params, file, eval_per_thread, max_steps, action_size, state_size, 0, render, allow_skipping, allow_caching))
+        results.append(
+            eval_policy(params, file, eval_per_thread, max_steps, action_size, state_size, 0, render, allow_skipping,
+                        allow_caching))

    else:
        with Pool() as p:
            results = p.starmap(eval_policy,
-                                [(params, file, 1, max_steps, action_size, state_size, seed * nb_threads, render, allow_skipping, allow_caching)
+                                [(params, file, 1, max_steps, action_size, state_size, seed * nb_threads, render,
+                                  allow_skipping, allow_caching)
                                 for seed in
                                 range(total_nb_eval)])

@@ -367,10 +372,12 @@ if __name__ == "__main__":

    parser.add_argument("--use_gpu", dest="use_gpu", help="use GPU if available", action='store_true')
    parser.add_argument("--render", help="render a single episode", action='store_true')
-    parser.add_argument("--allow_skipping", help="skips to the end of the episode if all agents are deadlocked", action='store_true')
+    parser.add_argument("--allow_skipping", help="skips to the end of the episode if all agents are deadlocked",
+                        action='store_true')
    parser.add_argument("--allow_caching", help="caches the last observation-action pair", action='store_true')
    args = parser.parse_args()

    os.environ["OMP_NUM_THREADS"] = str(1)
-    evaluate_agents(file=args.file, n_evaluation_episodes=args.n_evaluation_episodes, use_gpu=args.use_gpu, render=args.render,
+    evaluate_agents(file=args.file, n_evaluation_episodes=args.n_evaluation_episodes, use_gpu=args.use_gpu,
+                    render=args.render,
                    allow_skipping=args.allow_skipping, allow_caching=args.allow_caching)
--- a/reinforcement_learning/multi_agent_training.py
+++ b/reinforcement_learning/multi_agent_training.py
@@ -9,19 +9,22 @@ from pprint import pprint

 import numpy as np
 import psutil
-from flatland.core.grid.grid4_utils import get_new_position
-from flatland.envs.agent_utils import RailAgentStatus
 from flatland.envs.malfunction_generators import malfunction_from_params, MalfunctionParameters
 from flatland.envs.observations import TreeObsForRailEnv
 from flatland.envs.predictions import ShortestPathPredictorForRailEnv
-from flatland.envs.rail_env import RailEnv, RailEnvActions, fast_count_nonzero
+from flatland.envs.rail_env import RailEnv, RailEnvActions
 from flatland.envs.rail_generators import sparse_rail_generator
 from flatland.envs.schedule_generators import sparse_schedule_generator
 from flatland.utils.rendertools import RenderTool
 from torch.utils.tensorboard import SummaryWriter

 from reinforcement_learning.dddqn_policy import DDDQNPolicy
-from reinforcement_learning.ppo_agent import PPOAgent
+from reinforcement_learning.deadlockavoidance_with_decision_agent import DeadLockAvoidanceWithDecisionAgent
+from reinforcement_learning.multi_decision_agent import MultiDecisionAgent
+from reinforcement_learning.ppo_agent import PPOPolicy
+from utils.agent_action_config import get_flatland_full_action_size, get_action_size, map_actions, map_action, \
+    set_action_size_reduced, set_action_size_full, map_action_policy
+from utils.dead_lock_avoidance_agent import DeadLockAvoidanceAgent

 base_dir = Path(__file__).resolve().parent.parent
 sys.path.append(str(base_dir))
@@ -78,41 +81,6 @@ def create_rail_env(env_params, tree_observation):
    )


-def get_agent_positions(env):
-    agent_positions: np.ndarray = np.full((env.height, env.width), -1)
-    for agent_handle in env.get_agent_handles():
-        agent = env.agents[agent_handle]
-        if agent.status == RailAgentStatus.ACTIVE:
-            position = agent.position
-            if position is None:
-                position = agent.initial_position
-            agent_positions[position] = agent_handle
-    return agent_positions
-
-
-def check_for_dealock(handle, env, agent_positions):
-    agent = env.agents[handle]
-    if agent.status == RailAgentStatus.DONE or agent.status == RailAgentStatus.DONE_REMOVED:
-        return False
-
-    position = agent.position
-    if position is None:
-        position = agent.initial_position
-    possible_transitions = env.rail.get_transitions(*position, agent.direction)
-    num_transitions = fast_count_nonzero(possible_transitions)
-    for dir_loop in range(4):
-        if possible_transitions[dir_loop] == 1:
-            new_position = get_new_position(position, dir_loop)
-            opposite_agent = agent_positions[new_position]
-            if opposite_agent != handle and opposite_agent != -1:
-                num_transitions -= 1
-            else:
-                return False
-
-    is_deadlock = num_transitions <= 0
-    return is_deadlock
-
-
 def train_agent(train_params, train_env_params, eval_env_params, obs_params):
    # Environment parameters
    n_agents = train_env_params.n_agents
@@ -188,10 +156,7 @@ def train_agent(train_params, train_env_params, eval_env_params, obs_params):
        # Calculate the state size given the depth of the tree observation and the number of features
        state_size = tree_observation.observation_dim

-    # The action space of flatland is 5 discrete actions
-    action_size = 5
-
-    action_count = [0] * action_size
+    action_count = [0] * get_flatland_full_action_size()
    action_dict = dict()
    agent_obs = [None] * n_agents
    agent_prev_obs = [None] * n_agents
@@ -205,12 +170,33 @@ def train_agent(train_params, train_env_params, eval_env_params, obs_params):
    scores_window = deque(maxlen=checkpoint_interval)  # todo smooth when rendering instead
    completion_window = deque(maxlen=checkpoint_interval)

+    if train_params.action_size == "reduced":
+        set_action_size_reduced()
+    else:
+        set_action_size_full()
+
    # Double Dueling DQN policy
-    policy = DDDQNPolicy(state_size, action_size, train_params)
-    if True:
-        policy = PPOAgent(state_size, action_size)
+    if train_params.policy == "DDDQN":
+        policy = DDDQNPolicy(state_size, get_action_size(), train_params)
+    elif train_params.policy == "PPO":
+        policy = PPOPolicy(state_size, get_action_size(), use_replay_buffer=False, in_parameters=train_params)
+    elif train_params.policy == "DeadLockAvoidance":
+        policy = DeadLockAvoidanceAgent(train_env, get_action_size(), enable_eps=False)
+    elif train_params.policy == "DeadLockAvoidanceWithDecision":
+        # inter_policy = PPOPolicy(state_size, get_action_size(), use_replay_buffer=False, in_parameters=train_params)
+        inter_policy = DDDQNPolicy(state_size, get_action_size(), train_params)
+        policy = DeadLockAvoidanceWithDecisionAgent(train_env, state_size, get_action_size(), inter_policy)
+    elif train_params.policy == "MultiDecision":
+        policy = MultiDecisionAgent(state_size, get_action_size(), train_params)
+    else:
+        policy = PPOPolicy(state_size, get_action_size(), use_replay_buffer=False, in_parameters=train_params)
+
+    # make sure that at least one policy is set
+    if policy is None:
+        policy = DDDQNPolicy(state_size, get_action_size(), train_params)
+
    # Load existing policy
-    if train_params.load_policy is not "":
+    if train_params.load_policy != "":
        policy.load(train_params.load_policy)

    # Loads existing replay buffer
@@ -232,7 +218,7 @@ def train_agent(train_params, train_env_params, eval_env_params, obs_params):
                hdd.free / (2 ** 30)))

    # TensorBoard writer
-    writer = SummaryWriter()
+    writer = SummaryWriter(comment="_" + train_params.policy + "_" + train_params.action_size)

    training_timer = Timer()
    training_timer.start()
@@ -256,12 +242,16 @@ def train_agent(train_params, train_env_params, eval_env_params, obs_params):

        # Reset environment
        reset_timer.start()
-        number_of_agents = int(min(n_agents, 1 + np.floor(episode_idx / 200)))
-        train_env_params.n_agents = episode_idx % number_of_agents + 1
+        if train_params.n_agent_fixed:
+            number_of_agents = n_agents
+            train_env_params.n_agents = n_agents
+        else:
+            number_of_agents = int(min(n_agents, 1 + np.floor(episode_idx / 200)))
+            train_env_params.n_agents = episode_idx % number_of_agents + 1

        train_env = create_rail_env(train_env_params, tree_observation)
        obs, info = train_env.reset(regenerate_rail=True, regenerate_schedule=True)
-        policy.reset()
+        policy.reset(train_env)
        reset_timer.end()

        if train_params.render:
@@ -296,10 +286,9 @@ def train_agent(train_params, train_env_params, eval_env_params, obs_params):
                agent = train_env.agents[agent_handle]
                if info['action_required'][agent_handle]:
                    update_values[agent_handle] = True
-                    action = policy.act(agent_obs[agent_handle], eps=eps_start)
-
-                    action_count[action] += 1
-                    actions_taken.append(action)
+                    action = policy.act(agent_handle, agent_obs[agent_handle], eps=eps_start)
+                    action_count[map_action(action)] += 1
+                    actions_taken.append(map_action(action))
                else:
                    # An action is not required if the train hasn't joined the railway network,
                    # if it already reached its target, or if is currently malfunctioning.
@@ -311,42 +300,7 @@ def train_agent(train_params, train_env_params, eval_env_params, obs_params):

            # Environment step
            step_timer.start()
-            next_obs, all_rewards, done, info = train_env.step(action_dict)
-
-            # Reward shaping .Dead-lock .NotMoving .NotStarted
-            if False:
-                agent_positions = get_agent_positions(train_env)
-                for agent_handle in train_env.get_agent_handles():
-                    agent = train_env.agents[agent_handle]
-
-                    act = action_dict.get(agent_handle, RailEnvActions.MOVE_FORWARD)
-                    if agent.status == RailAgentStatus.ACTIVE:
-                        pos = agent.position
-                        dir = agent.direction
-                        possible_transitions = train_env.rail.get_transitions(*pos, dir)
-                        num_transitions = fast_count_nonzero(possible_transitions)
-                        if act == RailEnvActions.STOP_MOVING:
-                            all_rewards[agent_handle] -= 2.0
-
-                        if num_transitions == 1:
-                            if act != RailEnvActions.MOVE_FORWARD:
-                                all_rewards[agent_handle] -= 1.0
-                        if check_for_dealock(agent_handle, train_env, agent_positions):
-                            all_rewards[agent_handle] -= 5.0
-                    elif agent.status == RailAgentStatus.READY_TO_DEPART:
-                        all_rewards[agent_handle] -= 5.0
-            else:
-                if True:
-                    agent_positions = get_agent_positions(train_env)
-                    for agent_handle in train_env.get_agent_handles():
-                        agent = train_env.agents[agent_handle]
-                        act = action_dict.get(agent_handle, RailEnvActions.MOVE_FORWARD)
-                        if agent.status == RailAgentStatus.ACTIVE:
-                            if done[agent_handle] == False:
-                                if check_for_dealock(agent_handle, train_env, agent_positions):
-                                    all_rewards[agent_handle] -= 1000.0
-                                    done[agent_handle] = True
-
+            next_obs, all_rewards, done, info = train_env.step(map_actions(action_dict))
            step_timer.end()

            # Render an episode at some interval
@@ -365,7 +319,7 @@ def train_agent(train_params, train_env_params, eval_env_params, obs_params):
                    learn_timer.start()
                    policy.step(agent_handle,
                                agent_prev_obs[agent_handle],
-                                agent_prev_action[agent_handle],
+                                map_action_policy(agent_prev_action[agent_handle]),
                                all_rewards[agent_handle],
                                agent_obs[agent_handle],
                                done[agent_handle])
@@ -415,19 +369,19 @@ def train_agent(train_params, train_env_params, eval_env_params, obs_params):
                policy.save_replay_buffer('./replay_buffers/' + training_id + '-' + str(episode_idx) + '.pkl')

            # reset action count
-            action_count = [0] * action_size
+            action_count = [0] * get_flatland_full_action_size()

        print(
            '\r🚂 Episode {}'
-            '\t 🚉 nAgents {}'
-            '\t 🏆 Score: {:7.3f}'
+            '\t 🚉 nAgents {:2}/{:2}'
+            ' 🏆 Score: {:7.3f}'
            ' Avg: {:7.3f}'
            '\t 💯 Done: {:6.2f}%'
            ' Avg: {:6.2f}%'
            '\t 🎲 Epsilon: {:.3f} '
            '\t 🔀 Action Probs: {}'.format(
                episode_idx,
-                train_env_params.n_agents,
+                train_env_params.n_agents, number_of_agents,
                normalized_score,
                smoothed_normalized_score,
                100 * completion,
@@ -488,6 +442,7 @@ def train_agent(train_params, train_env_params, eval_env_params, obs_params):
        writer.add_scalar("timer/learn", learn_timer.get(), episode_idx)
        writer.add_scalar("timer/preproc", preproc_timer.get(), episode_idx)
        writer.add_scalar("timer/total", training_timer.get_current(), episode_idx)
+        writer.flush()


 def format_action_prob(action_probs):
@@ -517,6 +472,7 @@ def eval_policy(env, tree_observation, policy, train_params, obs_params):
        score = 0.0

        obs, info = env.reset(regenerate_rail=True, regenerate_schedule=True)
+        policy.reset(env)
        final_step = 0

        policy.start_episode(train=False)
@@ -530,10 +486,10 @@ def eval_policy(env, tree_observation, policy, train_params, obs_params):
                action = 0
                if info['action_required'][agent]:
                    if tree_observation.check_is_observation_valid(agent_obs[agent]):
-                        action = policy.act(agent_obs[agent], eps=0.0)
+                        action = policy.act(agent, agent_obs[agent], eps=0.0)
                action_dict.update({agent: action})
            policy.end_step(train=False)
-            obs, all_rewards, done, info = env.step(action_dict)
+            obs, all_rewards, done, info = env.step(map_actions(action_dict))

            for agent in env.get_agent_handles():
                score += all_rewards[agent]
@@ -559,17 +515,18 @@ def eval_policy(env, tree_observation, policy, train_params, obs_params):

 if __name__ == "__main__":
    parser = ArgumentParser()
-    parser.add_argument("-n", "--n_episodes", help="number of episodes to run", default=25000, type=int)
-    parser.add_argument("-t", "--training_env_config", help="training config id (eg 0 for Test_0)", default=0,
+    parser.add_argument("-n", "--n_episodes", help="number of episodes to run", default=5000, type=int)
+    parser.add_argument("--n_agent_fixed", help="hold the number of agent fixed", action='store_true')
+    parser.add_argument("-t", "--training_env_config", help="training config id (eg 0 for Test_0)", default=1,
                        type=int)
-    parser.add_argument("-e", "--evaluation_env_config", help="evaluation config id (eg 0 for Test_0)", default=0,
+    parser.add_argument("-e", "--evaluation_env_config", help="evaluation config id (eg 0 for Test_0)", default=1,
                        type=int)
-    parser.add_argument("--n_evaluation_episodes", help="number of evaluation episodes", default=5, type=int)
-    parser.add_argument("--checkpoint_interval", help="checkpoint interval", default=200, type=int)
+    parser.add_argument("--n_evaluation_episodes", help="number of evaluation episodes", default=10, type=int)
+    parser.add_argument("--checkpoint_interval", help="checkpoint interval", default=100, type=int)
    parser.add_argument("--eps_start", help="max exploration", default=1.0, type=float)
-    parser.add_argument("--eps_end", help="min exploration", default=0.05, type=float)
+    parser.add_argument("--eps_end", help="min exploration", default=0.01, type=float)
    parser.add_argument("--eps_decay", help="exploration decay", default=0.9975, type=float)
-    parser.add_argument("--buffer_size", help="replay buffer size", default=int(1e7), type=int)
+    parser.add_argument("--buffer_size", help="replay buffer size", default=int(32_000), type=int)
    parser.add_argument("--buffer_min_size", help="min buffer size to start training", default=0, type=int)
    parser.add_argument("--restore_replay_buffer", help="replay buffer to restore", default="", type=str)
    parser.add_argument("--save_replay_buffer", help="save replay buffer at each evaluation interval", default=False,
@@ -587,6 +544,10 @@ if __name__ == "__main__":
    parser.add_argument("--use_fast_tree_observation", help="use FastTreeObs instead of stock TreeObs",
                        action='store_true')
    parser.add_argument("--max_depth", help="max depth", default=2, type=int)
+    parser.add_argument("--policy",
+                        help="policy name [DDDQN, PPO, DeadLockAvoidance, DeadLockAvoidanceWithDecision, MultiDecision]",
+                        default="DeadLockAvoidance")
+    parser.add_argument("--action_size", help="define the action size [reduced,full]", default="full", type=str)

    training_params = parser.parse_args()
    env_params = [

--- a/reinforcement_learning/multi_decision_agent.py
+++ b/reinforcement_learning/multi_decision_agent.py
+from flatland.envs.rail_env import RailEnv
+
+from reinforcement_learning.dddqn_policy import DDDQNPolicy
+from reinforcement_learning.policy import LearningPolicy, DummyMemory
+from reinforcement_learning.ppo_agent import PPOPolicy
+
+
+class MultiDecisionAgent(LearningPolicy):
+
+    def __init__(self, state_size, action_size, in_parameters=None):
+        print(">> MultiDecisionAgent")
+        super(MultiDecisionAgent, self).__init__()
+        self.state_size = state_size
+        self.action_size = action_size
+        self.in_parameters = in_parameters
+        self.memory = DummyMemory()
+        self.loss = 0
+
+        self.ppo_policy = PPOPolicy(state_size, action_size, use_replay_buffer=False, in_parameters=in_parameters)
+        self.dddqn_policy = DDDQNPolicy(state_size, action_size, in_parameters)
+        self.policy_selector = PPOPolicy(state_size, 2)
+
+
+    def step(self, handle, state, action, reward, next_state, done):
+        self.ppo_policy.step(handle, state, action, reward, next_state, done)
+        self.dddqn_policy.step(handle, state, action, reward, next_state, done)
+        select = self.policy_selector.act(handle, state, 0.0)
+        self.policy_selector.step(handle, state, select, reward, next_state, done)
+
+    def act(self, handle, state, eps=0.):
+        select = self.policy_selector.act(handle, state, eps)
+        if select == 0:
+            return self.dddqn_policy.act(handle, state, eps)
+        return self.policy_selector.act(handle, state, eps)
+
+    def save(self, filename):
+        self.ppo_policy.save(filename)
+        self.dddqn_policy.save(filename)
+        self.policy_selector.save(filename)
+
+    def load(self, filename):
+        self.ppo_policy.load(filename)
+        self.dddqn_policy.load(filename)
+        self.policy_selector.load(filename)
+
+    def start_step(self, train):
+        self.ppo_policy.start_step(train)
+        self.dddqn_policy.start_step(train)
+        self.policy_selector.start_step(train)
+
+    def end_step(self, train):
+        self.ppo_policy.end_step(train)
+        self.dddqn_policy.end_step(train)
+        self.policy_selector.end_step(train)
+
+    def start_episode(self, train):
+        self.ppo_policy.start_episode(train)
+        self.dddqn_policy.start_episode(train)
+        self.policy_selector.start_episode(train)
+
+    def end_episode(self, train):
+        self.ppo_policy.end_episode(train)
+        self.dddqn_policy.end_episode(train)
+        self.policy_selector.end_episode(train)
+
+    def load_replay_buffer(self, filename):
+        self.ppo_policy.load_replay_buffer(filename)
+        self.dddqn_policy.load_replay_buffer(filename)
+        self.policy_selector.load_replay_buffer(filename)
+
+    def test(self):
+        self.ppo_policy.test()
+        self.dddqn_policy.test()
+        self.policy_selector.test()
+
+    def reset(self, env: RailEnv):
+        self.ppo_policy.reset(env)
+        self.dddqn_policy.reset(env)
+        self.policy_selector.reset(env)
+
+    def clone(self):
+        multi_descision_agent = MultiDecisionAgent(
+            self.state_size,
+            self.action_size,
+            self.in_parameters
+        )
+        multi_descision_agent.ppo_policy = self.ppo_policy.clone()
+        multi_descision_agent.dddqn_policy = self.dddqn_policy.clone()
+        multi_descision_agent.policy_selector = self.policy_selector.clone()
+        return multi_descision_agent
--- a/reinforcement_learning/multi_policy.py
+++ b/reinforcement_learning/multi_policy.py
 import numpy as np
+from flatland.envs.rail_env import RailEnv

 from reinforcement_learning.policy import Policy
-from reinforcement_learning.ppo_agent import PPOAgent
+from reinforcement_learning.ppo_agent import PPOPolicy
 from utils.dead_lock_avoidance_agent import DeadLockAvoidanceAgent


@@ -12,7 +13,7 @@ class MultiPolicy(Policy):
        self.memory = []
        self.loss = 0
        self.deadlock_avoidance_policy = DeadLockAvoidanceAgent(env, action_size, False)
-        self.ppo_policy = PPOAgent(state_size + action_size, action_size)
+        self.ppo_policy = PPOPolicy(state_size + action_size, action_size)

    def load(self, filename):
        self.ppo_policy.load(filename)
@@ -45,9 +46,9 @@ class MultiPolicy(Policy):
        self.loss = self.ppo_policy.loss
        return action_ppo

-    def reset(self):
-        self.ppo_policy.reset()
-        self.deadlock_avoidance_policy.reset()
+    def reset(self, env: RailEnv):
+        self.ppo_policy.reset(env)
+        self.deadlock_avoidance_policy.reset(env)

    def test(self):
        self.ppo_policy.test()

--- a/reinforcement_learning/ordered_policy.py
+++ b/reinforcement_learning/ordered_policy.py
@@ -15,7 +15,7 @@ class OrderedPolicy(Policy):
    def __init__(self):
        self.action_size = 5

-    def act(self, state, eps=0.):
+    def act(self, handle, state, eps=0.):
        _, distance, _ = split_tree_into_feature_groups(state, 1)
        distance = distance[1:]
        min_dist = min_gt(distance, 0)

--- a/reinforcement_learning/policy.py
+++ b/reinforcement_learning/policy.py
-import torch.nn as nn
+from flatland.envs.rail_env import RailEnv
+
+
+class DummyMemory:
+    def __init__(self):
+        self.memory = []
+
+    def __len__(self):
+        return 0
+

 class Policy:
    def step(self, handle, state, action, reward, next_state, done):
        raise NotImplementedError

-    def act(self, state, eps=0.):
+    def act(self, handle, state, eps=0.):
        raise NotImplementedError

    def save(self, filename):
@@ -13,16 +22,16 @@ class Policy:
    def load(self, filename):
        raise NotImplementedError

-    def start_step(self,train):
+    def start_step(self, train):
        pass

-    def end_step(self,train):
+    def end_step(self, train):
        pass

-    def start_episode(self,train):
+    def start_episode(self, train):
        pass

-    def end_episode(self,train):
+    def end_episode(self, train):
        pass

    def load_replay_buffer(self, filename):
@@ -31,8 +40,23 @@ class Policy:
    def test(self):
        pass

-    def reset(self):
+    def reset(self, env: RailEnv):
        pass

    def clone(self):
-        return self
\ No newline at end of file
+        return self
+
+
+class HeuristicPolicy(Policy):
+    def __init__(self):
+        super(HeuristicPolicy).__init__()
+
+
+class LearningPolicy(Policy):
+    def __init__(self):
+        super(LearningPolicy).__init__()
+
+
+class HybridPolicy(Policy):
+    def __init__(self):
+        super(HybridPolicy).__init__()
--- a/reinforcement_learning/ppo_agent.py
+++ b/reinforcement_learning/ppo_agent.py
@@ -3,18 +3,16 @@ import os

 import torch
 import torch.nn as nn
-import torch.nn.functional as F
 import torch.optim as optim
 from torch.distributions import Categorical

 # Hyperparameters
-from reinforcement_learning.policy import Policy
+from reinforcement_learning.policy import LearningPolicy
+from reinforcement_learning.replay_buffer import ReplayBuffer

-device = torch.device("cpu")  # "cuda:0" if torch.cuda.is_available() else "cpu")
-print("device:", device)
+# https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html

-
-class DataBuffers:
+class EpisodeBuffers:
    def __init__(self):
        self.reset()

@@ -36,15 +34,17 @@ class DataBuffers:

 class ActorCriticModel(nn.Module):

-    def __init__(self, state_size, action_size, hidsize1=128, hidsize2=128):
+    def __init__(self, state_size, action_size, device, hidsize1=512, hidsize2=256):
        super(ActorCriticModel, self).__init__()
+        self.device = device
        self.actor = nn.Sequential(
            nn.Linear(state_size, hidsize1),
            nn.Tanh(),
            nn.Linear(hidsize1, hidsize2),
            nn.Tanh(),
-            nn.Linear(hidsize2, action_size)
-        )
+            nn.Linear(hidsize2, action_size),
+            nn.Softmax(dim=-1)
+        ).to(self.device)

        self.critic = nn.Sequential(
            nn.Linear(state_size, hidsize1),
@@ -52,18 +52,18 @@ class ActorCriticModel(nn.Module):
            nn.Linear(hidsize1, hidsize2),
            nn.Tanh(),
            nn.Linear(hidsize2, 1)
-        )
+        ).to(self.device)

    def forward(self, x):
        raise NotImplementedError

-    def act_prob(self, states, softmax_dim=0):
-        x = self.actor(states)
-        prob = F.softmax(x, dim=softmax_dim)
-        return prob
+    def get_actor_dist(self, state):
+        action_probs = self.actor(state)
+        dist = Categorical(action_probs)
+        return dist

    def evaluate(self, states, actions):
-        action_probs = self.act_prob(states)
+        action_probs = self.actor(states)
        dist = Categorical(action_probs)
        action_logprobs = dist.log_prob(actions)
        dist_entropy = dist.entropy()
@@ -79,49 +79,96 @@ class ActorCriticModel(nn.Module):
        if os.path.exists(filename):
            print(' >> ', filename)
            try:
-                obj.load_state_dict(torch.load(filename, map_location=device))
+                obj.load_state_dict(torch.load(filename, map_location=self.device))
            except:
                print(" >> failed!")
        return obj

    def load(self, filename):
-        print("load policy from file", filename)
+        print("load model from file", filename)
        self.actor = self._load(self.actor, filename + ".actor")
-        self.critic = self._load(self.critic, filename + ".critic")
-
+        self.critic = self._load(self.critic, filename + ".value")

-class PPOAgent(Policy):
-    def __init__(self, state_size, action_size):
-        super(PPOAgent, self).__init__()

+class PPOPolicy(LearningPolicy):
+    def __init__(self, state_size, action_size, use_replay_buffer=False, in_parameters=None):
+        print(">> PPOPolicy")
+        super(PPOPolicy, self).__init__()
        # parameters
-        self.learning_rate = 0.1e-3
-        self.gamma = 0.98
+        self.ppo_parameters = in_parameters
+        if self.ppo_parameters is not None:
+            self.hidsize = self.ppo_parameters.hidden_size
+            self.buffer_size = self.ppo_parameters.buffer_size
+            self.batch_size = self.ppo_parameters.batch_size
+            self.learning_rate = self.ppo_parameters.learning_rate
+            self.gamma = self.ppo_parameters.gamma
+            # Device
+            if self.ppo_parameters.use_gpu and torch.cuda.is_available():
+                self.device = torch.device("cuda:0")
+                # print("🐇 Using GPU")
+            else:
+                self.device = torch.device("cpu")
+                # print("🐢 Using CPU")
+        else:
+            self.hidsize = 128
+            self.learning_rate = 1.0e-3
+            self.gamma = 0.95
+            self.buffer_size = 32_000
+            self.batch_size = 1024
+            self.device = torch.device("cpu")
+
        self.surrogate_eps_clip = 0.1
-        self.K_epoch = 3
-        self.weight_loss = 0.9
+        self.K_epoch = 10
+        self.weight_loss = 0.5
        self.weight_entropy = 0.01

-        # objects
-        self.memory = DataBuffers()
+        self.buffer_min_size = 0
+        self.use_replay_buffer = use_replay_buffer
+
+        self.current_episode_memory = EpisodeBuffers()
+        self.memory = ReplayBuffer(action_size, self.buffer_size, self.batch_size, self.device)
        self.loss = 0
-        self.actor_critic_model = ActorCriticModel(state_size, action_size)
+        self.actor_critic_model = ActorCriticModel(state_size, action_size,self.device,
+                                                   hidsize1=self.hidsize,
+                                                   hidsize2=self.hidsize)
        self.optimizer = optim.Adam(self.actor_critic_model.parameters(), lr=self.learning_rate)
-        self.lossFunction = nn.MSELoss()
+        self.loss_function = nn.MSELoss()  # nn.SmoothL1Loss()

-    def reset(self):
+    def reset(self, env):
        pass

-    def act(self, state, eps=None):
+    def act(self, handle, state, eps=None):
        # sample a action to take
-        prob = self.actor_critic_model.act_prob(torch.from_numpy(state).float())
-        return Categorical(prob).sample().item()
+        torch_state = torch.tensor(state, dtype=torch.float).to(self.device)
+        dist = self.actor_critic_model.get_actor_dist(torch_state)
+        action = dist.sample()
+        return action.item()

    def step(self, handle, state, action, reward, next_state, done):
-        # record transitions ([state] -> [action] -> [reward, nextstate, done])
-        prob = self.actor_critic_model.act_prob(torch.from_numpy(state).float())
-        transition = (state, action, reward, next_state, prob[action].item(), done)
-        self.memory.push_transition(handle, transition)
+        # record transitions ([state] -> [action] -> [reward, next_state, done])
+        torch_action = torch.tensor(action, dtype=torch.float).to(self.device)
+        torch_state = torch.tensor(state, dtype=torch.float).to(self.device)
+        # evaluate actor
+        dist = self.actor_critic_model.get_actor_dist(torch_state)
+        action_logprobs = dist.log_prob(torch_action)
+        transition = (state, action, reward, next_state, action_logprobs.item(), done)
+        self.current_episode_memory.push_transition(handle, transition)
+
+    def _push_transitions_to_replay_buffer(self,
+                                           state_list,
+                                           action_list,
+                                           reward_list,
+                                           state_next_list,
+                                           done_list,
+                                           prob_a_list):
+        for idx in range(len(reward_list)):
+            state_i = state_list[idx]
+            action_i = action_list[idx]
+            reward_i = reward_list[idx]
+            state_next_i = state_next_list[idx]
+            done_i = done_list[idx]
+            prob_action_i = prob_a_list[idx]
+            self.memory.add(state_i, action_i, reward_i, state_next_i, done_i, prob_action_i)

    def _convert_transitions_to_torch_tensors(self, transitions_array):
        # build empty lists(arrays)
@@ -138,60 +185,87 @@ class PPOAgent(Policy):
                discounted_reward = 0
                done_list.insert(0, 1)
            else:
-                discounted_reward = reward_i + self.gamma * discounted_reward
                done_list.insert(0, 0)
+
+            discounted_reward = reward_i + self.gamma * discounted_reward
            reward_list.insert(0, discounted_reward)
            state_next_list.insert(0, state_next_i)
            prob_a_list.insert(0, prob_action_i)

+        if self.use_replay_buffer:
+            self._push_transitions_to_replay_buffer(state_list, action_list,
+                                                    reward_list, state_next_list,
+                                                    done_list, prob_a_list)
+
        # convert data to torch tensors
        states, actions, rewards, states_next, dones, prob_actions = \
-            torch.tensor(state_list, dtype=torch.float).to(device), \
-            torch.tensor(action_list).to(device), \
-            torch.tensor(reward_list, dtype=torch.float).to(device), \
-            torch.tensor(state_next_list, dtype=torch.float).to(device), \
-            torch.tensor(done_list, dtype=torch.float).to(device), \
-            torch.tensor(prob_a_list).to(device)
-
-        # standard-normalize rewards
-        rewards = (rewards - rewards.mean()) / (rewards.std() + 1.e-5)
+            torch.tensor(state_list, dtype=torch.float).to(self.device), \
+            torch.tensor(action_list).to(self.device), \
+            torch.tensor(reward_list, dtype=torch.float).to(self.device), \
+            torch.tensor(state_next_list, dtype=torch.float).to(self.device), \
+            torch.tensor(done_list, dtype=torch.float).to(self.device), \
+            torch.tensor(prob_a_list).to(self.device)

        return states, actions, rewards, states_next, dones, prob_actions

+    def _get_transitions_from_replay_buffer(self, states, actions, rewards, states_next, dones, probs_action):
+        if len(self.memory) > self.buffer_min_size and len(self.memory) > self.batch_size:
+            states, actions, rewards, states_next, dones, probs_action = self.memory.sample()
+            actions = torch.squeeze(actions)
+            rewards = torch.squeeze(rewards)
+            states_next = torch.squeeze(states_next)
+            dones = torch.squeeze(dones)
+            probs_action = torch.squeeze(probs_action)
+        return states, actions, rewards, states_next, dones, probs_action
+
    def train_net(self):
-        for handle in range(len(self.memory)):
-            agent_episode_history = self.memory.get_transitions(handle)
+        # All agents have to propagate their experiences made during past episode
+        for handle in range(len(self.current_episode_memory)):
+            # Extract agent's episode history (list of all transitions)
+            agent_episode_history = self.current_episode_memory.get_transitions(handle)
            if len(agent_episode_history) > 0:
-                # convert the replay buffer to torch tensors (arrays)
+                # Convert the replay buffer to torch tensors (arrays)
                states, actions, rewards, states_next, dones, probs_action = \
                    self._convert_transitions_to_torch_tensors(agent_episode_history)

                # Optimize policy for K epochs:
-                for _ in range(self.K_epoch):
-                    # evaluating actions (actor) and values (critic)
+                for k_loop in range(int(self.K_epoch)):
+
+                    if self.use_replay_buffer:
+                        states, actions, rewards, states_next, dones, probs_action = \
+                            self._get_transitions_from_replay_buffer(
+                                states, actions, rewards, states_next, dones, probs_action
+                            )
+
+                    # Evaluating actions (actor) and values (critic)
                    logprobs, state_values, dist_entropy = self.actor_critic_model.evaluate(states, actions)

-                    # finding the ratios (pi_thetas / pi_thetas_replayed):
+                    # Finding the ratios (pi_thetas / pi_thetas_replayed):
                    ratios = torch.exp(logprobs - probs_action.detach())

-                    # finding Surrogate Loss:
+                    # Finding Surrogate Loos
                    advantages = rewards - state_values.detach()
                    surr1 = ratios * advantages
-                    surr2 = torch.clamp(ratios, 1 - self.surrogate_eps_clip, 1 + self.surrogate_eps_clip) * advantages
+                    surr2 = torch.clamp(ratios, 1. - self.surrogate_eps_clip, 1. + self.surrogate_eps_clip) * advantages
+
+                    # The loss function is used to estimate the gardient and use the entropy function based
+                    # heuristic to penalize the gradient function when the policy becomes deterministic this would let
+                    # the gradient becomes very flat and so the gradient is no longer useful.
                    loss = \
                        -torch.min(surr1, surr2) \
-                        + self.weight_loss * self.lossFunction(state_values, rewards) \
+                        + self.weight_loss * self.loss_function(state_values, rewards) \
                        - self.weight_entropy * dist_entropy

-                    # make a gradient step
+                    # Make a gradient step
                    self.optimizer.zero_grad()
                    loss.mean().backward()
                    self.optimizer.step()

-                    # store current loss to the agent
-                    self.loss = loss.mean().detach().numpy()
+                    # Transfer the current loss to the agents loss (information) for debug purpose only
+                    self.loss = loss.mean().detach().cpu().numpy()

-        self.memory.reset()
+        # Reset all collect transition data
+        self.current_episode_memory.reset()

    def end_episode(self, train):
        if train:
@@ -207,9 +281,11 @@ class PPOAgent(Policy):
        if os.path.exists(filename):
            print(' >> ', filename)
            try:
-                obj.load_state_dict(torch.load(filename, map_location=device))
+                obj.load_state_dict(torch.load(filename, map_location=self.device))
            except:
                print(" >> failed!")
+        else:
+            print(" >> file not found!")
        return obj

    def load(self, filename):
@@ -219,7 +295,7 @@ class PPOAgent(Policy):
        self.optimizer = self._load(self.optimizer, filename + ".optimizer")

    def clone(self):
-        policy = PPOAgent(self.state_size, self.action_size)
+        policy = PPOPolicy(self.state_size, self.action_size)
        policy.actor_critic_model = copy.deepcopy(self.actor_critic_model)
        policy.optimizer = copy.deepcopy(self.optimizer)
        return self
No results found