Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • hebe0663/neurips2020-flatland-starter-kit
  • flatland/neurips2020-flatland-starter-kit
  • manavsinghal157/marl-flatland
3 results
Show changes
Commits on Source (87)
Showing
with 512 additions and 223 deletions
🚂 Starter Kit - NeurIPS 2020 Flatland Challenge 🚂 This code is based on the official starter kit - NeurIPS 2020 Flatland Challenge
=== ---
This starter kit contains 2 example policies to get started with this challenge: You can use for your own experiments full or reduced action space.
- a simple single-agent DQN method
- a more robust multi-agent DQN method that you can submit out of the box to the challenge 🚀 ```python
def map_action(action):
# if full action space is used -> no mapping required
if get_action_size() == get_flatland_full_action_size():
return action
# if reduced action space is used -> the action has to be mapped to real flatland actions
# The reduced action space removes the DO_NOTHING action from Flatland.
if action == 0:
return RailEnvActions.MOVE_LEFT
if action == 1:
return RailEnvActions.MOVE_FORWARD
if action == 2:
return RailEnvActions.MOVE_RIGHT
if action == 3:
return RailEnvActions.STOP_MOVING
```
**🔗 [Train the single-agent DQN policy](https://flatland.aicrowd.com/getting-started/rl/single-agent.html)** ```python
set_action_size_full()
```
or
```python
set_action_size_reduced()
```
action space. The reduced action space just removes DO_NOTHING.
**🔗 [Train the multi-agent DQN policy](https://flatland.aicrowd.com/getting-started/rl/multi-agent.html)** ---
The used policy is based on the FastTreeObs in the official starter kit - NeurIPS 2020 Flatland Challenge. But the
FastTreeObs in this repo is an extended version.
[fast_tree_obs.py](./utils/fast_tree_obs.py)
**🔗 [Submit a trained policy](https://flatland.aicrowd.com/getting-started/first-submission.html)** ---
Have a look into the [run.py](./run.py) file. There you can select using PPO or DDDQN as RL agents.
```python
####################################################
# EVALUATION PARAMETERS
set_action_size_full()
# Print per-step logs
VERBOSE = True
USE_FAST_TREEOBS = True
if False:
# -------------------------------------------------------------------------------------------------------
# RL solution
# -------------------------------------------------------------------------------------------------------
# 116591 adrian_egli
# graded 71.305 0.633 RL Successfully Graded ! More details about this submission can be found at:
# http://gitlab.aicrowd.com/adrian_egli/neurips2020-flatland-starter-kit/issues/51
# Fri, 22 Jan 2021 23:37:56
set_action_size_reduced()
load_policy = "DDDQN"
checkpoint = "./checkpoints/210122120236-3000.pth" # 17.011131341978228
EPSILON = 0.0
if False:
# -------------------------------------------------------------------------------------------------------
# RL solution
# -------------------------------------------------------------------------------------------------------
# 116658 adrian_egli
# graded 73.821 0.655 RL Successfully Graded ! More details about this submission can be found at:
# http://gitlab.aicrowd.com/adrian_egli/neurips2020-flatland-starter-kit/issues/52
# Sat, 23 Jan 2021 07:41:35
set_action_size_reduced()
load_policy = "PPO"
checkpoint = "./checkpoints/210122235754-5000.pth" # 16.00113400887389
EPSILON = 0.0
if True:
# -------------------------------------------------------------------------------------------------------
# RL solution
# -------------------------------------------------------------------------------------------------------
# 116659 adrian_egli
# graded 80.579 0.715 RL Successfully Graded ! More details about this submission can be found at:
# http://gitlab.aicrowd.com/adrian_egli/neurips2020-flatland-starter-kit/issues/53
# Sat, 23 Jan 2021 07:45:49
set_action_size_reduced()
load_policy = "DDDQN"
checkpoint = "./checkpoints/210122165109-5000.pth" # 17.993750197899438
EPSILON = 0.0
if False:
# -------------------------------------------------------------------------------------------------------
# !! This is not a RL solution !!!!
# -------------------------------------------------------------------------------------------------------
# 116727 adrian_egli
# graded 106.786 0.768 RL Successfully Graded ! More details about this submission can be found at:
# http://gitlab.aicrowd.com/adrian_egli/neurips2020-flatland-starter-kit/issues/54
# Sat, 23 Jan 2021 14:31:50
set_action_size_reduced()
load_policy = "DeadLockAvoidance"
checkpoint = None
EPSILON = 0.0
```
The single-agent example is meant as a minimal example of how to use DQN. The multi-agent is a better starting point to create your own solution. ---
A deadlock avoidance agent is implemented. The agent only lets the train take the shortest route. And it tries to avoid as many deadlocks as possible.
* [dead_lock_avoidance_agent.py](./utils/dead_lock_avoidance_agent.py)
You can fully train the multi-agent policy in Colab for free! [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1GbPwZNQU7KJIJtilcGBTtpOAD3EabAzJ?usp=sharing)
Sample training usage
--- ---
The policy interface has changed, please have a look into
* [policy.py](./reinforcement_learning/policy.py)
Train the multi-agent policy for 150 episodes: ---
See the tensorboard training output to get some insights:
```bash ```
python reinforcement_learning/multi_agent_training.py -n 150 tensorboard --logdir ./runs_bench
``` ```
The multi-agent policy training can be tuned using command-line arguments: ---
```
python reinforcement_learning/multi_agent_training.py --use_fast_tree_observation --checkpoint_interval 1000 -n 5000
--policy DDDQN -t 2 --action_size reduced --buffer_siz 128000
```
```console [multi_agent_training.py](./reinforcement_learning/multi_agent_training.py)
usage: multi_agent_training.py [-h] [-n N_EPISODES] [-t TRAINING_ENV_CONFIG] has new or changed parameters. Most important new or changed parameters for training.
* policy : [DDDQN, PPO, DeadLockAvoidance, DeadLockAvoidanceWithDecision, MultiDecision] : Default value
DeadLockAvoidance
* use_fast_tree_observation : [false,true] : Default value = true
* action_size: [full, reduced] : Default value = full
```
usage: multi_agent_training.py [-h] [-n N_EPISODES] [--n_agent_fixed]
[-t TRAINING_ENV_CONFIG]
[-e EVALUATION_ENV_CONFIG] [-e EVALUATION_ENV_CONFIG]
[--n_evaluation_episodes N_EVALUATION_EPISODES] [--n_evaluation_episodes N_EVALUATION_EPISODES]
[--checkpoint_interval CHECKPOINT_INTERVAL] [--checkpoint_interval CHECKPOINT_INTERVAL]
...@@ -42,12 +144,16 @@ usage: multi_agent_training.py [-h] [-n N_EPISODES] [-t TRAINING_ENV_CONFIG] ...@@ -42,12 +144,16 @@ usage: multi_agent_training.py [-h] [-n N_EPISODES] [-t TRAINING_ENV_CONFIG]
[--hidden_size HIDDEN_SIZE] [--hidden_size HIDDEN_SIZE]
[--update_every UPDATE_EVERY] [--update_every UPDATE_EVERY]
[--use_gpu USE_GPU] [--num_threads NUM_THREADS] [--use_gpu USE_GPU] [--num_threads NUM_THREADS]
[--render RENDER] [--render] [--load_policy LOAD_POLICY]
[--use_fast_tree_observation]
[--max_depth MAX_DEPTH] [--policy POLICY]
[--action_size ACTION_SIZE]
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
-n N_EPISODES, --n_episodes N_EPISODES -n N_EPISODES, --n_episodes N_EPISODES
number of episodes to run number of episodes to run
--n_agent_fixed hold the number of agent fixed
-t TRAINING_ENV_CONFIG, --training_env_config TRAINING_ENV_CONFIG -t TRAINING_ENV_CONFIG, --training_env_config TRAINING_ENV_CONFIG
training config id (eg 0 for Test_0) training config id (eg 0 for Test_0)
-e EVALUATION_ENV_CONFIG, --evaluation_env_config EVALUATION_ENV_CONFIG -e EVALUATION_ENV_CONFIG, --evaluation_env_config EVALUATION_ENV_CONFIG
...@@ -82,20 +188,40 @@ optional arguments: ...@@ -82,20 +188,40 @@ optional arguments:
--use_gpu USE_GPU use GPU if available --use_gpu USE_GPU use GPU if available
--num_threads NUM_THREADS --num_threads NUM_THREADS
number of threads PyTorch can use number of threads PyTorch can use
--render RENDER render 1 episode in 100 --render render 1 episode in 100
``` --load_policy LOAD_POLICY
policy filename (reference) to load
--use_fast_tree_observation
use FastTreeObs instead of stock TreeObs
--max_depth MAX_DEPTH
max depth
--policy POLICY policy name [DDDQN, PPO, DeadLockAvoidance,
DeadLockAvoidanceWithDecision, MultiDecision]
--action_size ACTION_SIZE
define the action size [reduced,full]
```
[**📈 Performance training in environments of various sizes**](https://wandb.ai/masterscrat/flatland-examples-reinforcement_learning/reports/Flatland-Starter-Kit-Training-in-environments-of-various-sizes--VmlldzoxNjgxMTk)
[**📈 Performance with various hyper-parameters**](https://app.wandb.ai/masterscrat/flatland-examples-reinforcement_learning/reports/Flatland-Examples--VmlldzoxNDI2MTA) ---
If you have any questions write me on the official discord channel **aiAdrian**
(Adrian Egli - adrian.egli@gmail.com)
Credits
---
[![](https://i.imgur.com/Lqrq5GE.png)](https://app.wandb.ai/masterscrat/flatland-examples-reinforcement_learning/reports/Flatland-Examples--VmlldzoxNDI2MTA) * Florian Laurent <florian@aicrowd.com>
* Erik Nygren <erik.nygren@sbb.ch>
* Adrian Egli <adrian.egli@sbb.ch>
* Sharada Mohanty <mohanty@aicrowd.com>
* Christian Baumberger <christian.baumberger@sbb.ch>
* Guillaume Mollard <guillaume.mollard2@gmail.com>
Main links Main links
--- ---
* [Flatland documentation](https://flatland.aicrowd.com/) * [Flatland documentation](https://flatland.aicrowd.com/)
* [NeurIPS 2020 Challenge](https://www.aicrowd.com/challenges/neurips-2020-flatland-challenge/) * [Flatland Challenge](https://www.aicrowd.com/challenges/flatland)
Communication Communication
--- ---
......
File deleted
File deleted
File deleted
File deleted
File added
File added
File added
File added
File added
File added
File added
File deleted
...@@ -2,7 +2,6 @@ import copy ...@@ -2,7 +2,6 @@ import copy
import os import os
import pickle import pickle
import random import random
from collections import namedtuple, deque, Iterable
import numpy as np import numpy as np
import torch import torch
...@@ -10,14 +9,18 @@ import torch.nn.functional as F ...@@ -10,14 +9,18 @@ import torch.nn.functional as F
import torch.optim as optim import torch.optim as optim
from reinforcement_learning.model import DuelingQNetwork from reinforcement_learning.model import DuelingQNetwork
from reinforcement_learning.policy import Policy from reinforcement_learning.policy import Policy, LearningPolicy
from reinforcement_learning.replay_buffer import ReplayBuffer
class DDDQNPolicy(Policy): class DDDQNPolicy(LearningPolicy):
"""Dueling Double DQN policy""" """Dueling Double DQN policy"""
def __init__(self, state_size, action_size, parameters, evaluation_mode=False): def __init__(self, state_size, action_size, in_parameters, evaluation_mode=False):
self.parameters = parameters print(">> DDDQNPolicy")
super(Policy, self).__init__()
self.ddqn_parameters = in_parameters
self.evaluation_mode = evaluation_mode self.evaluation_mode = evaluation_mode
self.state_size = state_size self.state_size = state_size
...@@ -26,17 +29,17 @@ class DDDQNPolicy(Policy): ...@@ -26,17 +29,17 @@ class DDDQNPolicy(Policy):
self.hidsize = 128 self.hidsize = 128
if not evaluation_mode: if not evaluation_mode:
self.hidsize = parameters.hidden_size self.hidsize = self.ddqn_parameters.hidden_size
self.buffer_size = parameters.buffer_size self.buffer_size = self.ddqn_parameters.buffer_size
self.batch_size = parameters.batch_size self.batch_size = self.ddqn_parameters.batch_size
self.update_every = parameters.update_every self.update_every = self.ddqn_parameters.update_every
self.learning_rate = parameters.learning_rate self.learning_rate = self.ddqn_parameters.learning_rate
self.tau = parameters.tau self.tau = self.ddqn_parameters.tau
self.gamma = parameters.gamma self.gamma = self.ddqn_parameters.gamma
self.buffer_min_size = parameters.buffer_min_size self.buffer_min_size = self.ddqn_parameters.buffer_min_size
# Device # Device
if parameters.use_gpu and torch.cuda.is_available(): if self.ddqn_parameters.use_gpu and torch.cuda.is_available():
self.device = torch.device("cuda:0") self.device = torch.device("cuda:0")
# print("🐇 Using GPU") # print("🐇 Using GPU")
else: else:
...@@ -44,18 +47,22 @@ class DDDQNPolicy(Policy): ...@@ -44,18 +47,22 @@ class DDDQNPolicy(Policy):
# print("🐢 Using CPU") # print("🐢 Using CPU")
# Q-Network # Q-Network
self.qnetwork_local = DuelingQNetwork(state_size, action_size, hidsize1=self.hidsize, hidsize2=self.hidsize).to( self.qnetwork_local = DuelingQNetwork(state_size,
self.device) action_size,
hidsize1=self.hidsize,
hidsize2=self.hidsize).to(self.device)
if not evaluation_mode: if not evaluation_mode:
self.qnetwork_target = copy.deepcopy(self.qnetwork_local) self.qnetwork_target = copy.deepcopy(self.qnetwork_local)
self.optimizer = optim.Adam(self.qnetwork_local.parameters(), lr=self.learning_rate) self.optimizer = optim.Adam(self.qnetwork_local.parameters(), lr=self.learning_rate)
self.memory = ReplayBuffer(action_size, self.buffer_size, self.batch_size, self.device) self.memory = ReplayBuffer(action_size, self.buffer_size, self.batch_size, self.device)
self.t_step = 0 self.t_step = 0
self.loss = 0.0 self.loss = 0.0
else:
self.memory = ReplayBuffer(action_size, 1, 1, self.device)
self.loss = 0.0
def act(self, state, eps=0.): def act(self, handle, state, eps=0.):
state = torch.from_numpy(state).float().unsqueeze(0).to(self.device) state = torch.from_numpy(state).float().unsqueeze(0).to(self.device)
self.qnetwork_local.eval() self.qnetwork_local.eval()
with torch.no_grad(): with torch.no_grad():
...@@ -66,10 +73,6 @@ class DDDQNPolicy(Policy): ...@@ -66,10 +73,6 @@ class DDDQNPolicy(Policy):
# Epsilon-greedy action selection # Epsilon-greedy action selection
if random.random() >= eps: if random.random() >= eps:
return np.argmax(action_values.cpu().data.numpy()) return np.argmax(action_values.cpu().data.numpy())
qvals = action_values.cpu().data.numpy()[0]
qvals = qvals - np.min(qvals)
qvals = qvals / (1e-5 + np.sum(qvals))
return np.argmax(np.random.multinomial(1, qvals))
else: else:
return random.choice(np.arange(self.action_size)) return random.choice(np.arange(self.action_size))
...@@ -88,7 +91,7 @@ class DDDQNPolicy(Policy): ...@@ -88,7 +91,7 @@ class DDDQNPolicy(Policy):
def _learn(self): def _learn(self):
experiences = self.memory.sample() experiences = self.memory.sample()
states, actions, rewards, next_states, dones = experiences states, actions, rewards, next_states, dones, _ = experiences
# Get expected Q values from local model # Get expected Q values from local model
q_expected = self.qnetwork_local(states).gather(1, actions) q_expected = self.qnetwork_local(states).gather(1, actions)
...@@ -151,63 +154,11 @@ class DDDQNPolicy(Policy): ...@@ -151,63 +154,11 @@ class DDDQNPolicy(Policy):
self.memory.memory = pickle.load(f) self.memory.memory = pickle.load(f)
def test(self): def test(self):
self.act(np.array([[0] * self.state_size])) self.act(0, np.array([[0] * self.state_size]))
self._learn() self._learn()
def clone(self): def clone(self):
me = DDDQNPolicy(self.state_size, self.action_size, self.parameters, evaluation_mode=True) me = DDDQNPolicy(self.state_size, self.action_size, self.ddqn_parameters, evaluation_mode=True)
me.qnetwork_target = copy.deepcopy(self.qnetwork_local) me.qnetwork_target = copy.deepcopy(self.qnetwork_local)
me.qnetwork_target = copy.deepcopy(self.qnetwork_target) me.qnetwork_target = copy.deepcopy(self.qnetwork_target)
return me return me
Experience = namedtuple("Experience", field_names=["state", "action", "reward", "next_state", "done"])
class ReplayBuffer:
"""Fixed-size buffer to store experience tuples."""
def __init__(self, action_size, buffer_size, batch_size, device):
"""Initialize a ReplayBuffer object.
Params
======
action_size (int): dimension of each action
buffer_size (int): maximum size of buffer
batch_size (int): size of each training batch
"""
self.action_size = action_size
self.memory = deque(maxlen=buffer_size)
self.batch_size = batch_size
self.device = device
def add(self, state, action, reward, next_state, done):
"""Add a new experience to memory."""
e = Experience(np.expand_dims(state, 0), action, reward, np.expand_dims(next_state, 0), done)
self.memory.append(e)
def sample(self):
"""Randomly sample a batch of experiences from memory."""
experiences = random.sample(self.memory, k=self.batch_size)
states = torch.from_numpy(self.__v_stack_impr([e.state for e in experiences if e is not None])) \
.float().to(self.device)
actions = torch.from_numpy(self.__v_stack_impr([e.action for e in experiences if e is not None])) \
.long().to(self.device)
rewards = torch.from_numpy(self.__v_stack_impr([e.reward for e in experiences if e is not None])) \
.float().to(self.device)
next_states = torch.from_numpy(self.__v_stack_impr([e.next_state for e in experiences if e is not None])) \
.float().to(self.device)
dones = torch.from_numpy(self.__v_stack_impr([e.done for e in experiences if e is not None]).astype(np.uint8)) \
.float().to(self.device)
return states, actions, rewards, next_states, dones
def __len__(self):
"""Return the current size of internal memory."""
return len(self.memory)
def __v_stack_impr(self, states):
sub_dim = len(states[0][0]) if isinstance(states[0], Iterable) else 1
np_states = np.reshape(np.array(states), (len(states), sub_dim))
return np_states
from flatland.envs.agent_utils import RailAgentStatus
from flatland.envs.rail_env import RailEnv, RailEnvActions
from reinforcement_learning.policy import HybridPolicy
from reinforcement_learning.ppo_agent import PPOPolicy
from utils.agent_action_config import map_rail_env_action
from utils.dead_lock_avoidance_agent import DeadLockAvoidanceAgent
class DeadLockAvoidanceWithDecisionAgent(HybridPolicy):
def __init__(self, env: RailEnv, state_size, action_size, learning_agent):
print(">> DeadLockAvoidanceWithDecisionAgent")
super(DeadLockAvoidanceWithDecisionAgent, self).__init__()
self.env = env
self.state_size = state_size
self.action_size = action_size
self.learning_agent = learning_agent
self.dead_lock_avoidance_agent = DeadLockAvoidanceAgent(self.env, action_size, False)
self.policy_selector = PPOPolicy(state_size, 2)
self.memory = self.learning_agent.memory
self.loss = self.learning_agent.loss
def step(self, handle, state, action, reward, next_state, done):
select = self.policy_selector.act(handle, state, 0.0)
self.policy_selector.step(handle, state, select, reward, next_state, done)
self.dead_lock_avoidance_agent.step(handle, state, action, reward, next_state, done)
self.learning_agent.step(handle, state, action, reward, next_state, done)
self.loss = self.learning_agent.loss
def act(self, handle, state, eps=0.):
select = self.policy_selector.act(handle, state, eps)
if select == 0:
return self.learning_agent.act(handle, state, eps)
return self.dead_lock_avoidance_agent.act(handle, state, -1.0)
def save(self, filename):
self.dead_lock_avoidance_agent.save(filename)
self.learning_agent.save(filename)
self.policy_selector.save(filename + '.selector')
def load(self, filename):
self.dead_lock_avoidance_agent.load(filename)
self.learning_agent.load(filename)
self.policy_selector.load(filename + '.selector')
def start_step(self, train):
self.dead_lock_avoidance_agent.start_step(train)
self.learning_agent.start_step(train)
self.policy_selector.start_step(train)
def end_step(self, train):
self.dead_lock_avoidance_agent.end_step(train)
self.learning_agent.end_step(train)
self.policy_selector.end_step(train)
def start_episode(self, train):
self.dead_lock_avoidance_agent.start_episode(train)
self.learning_agent.start_episode(train)
self.policy_selector.start_episode(train)
def end_episode(self, train):
self.dead_lock_avoidance_agent.end_episode(train)
self.learning_agent.end_episode(train)
self.policy_selector.end_episode(train)
def load_replay_buffer(self, filename):
self.dead_lock_avoidance_agent.load_replay_buffer(filename)
self.learning_agent.load_replay_buffer(filename)
self.policy_selector.load_replay_buffer(filename + ".selector")
def test(self):
self.dead_lock_avoidance_agent.test()
self.learning_agent.test()
self.policy_selector.test()
def reset(self, env: RailEnv):
self.env = env
self.dead_lock_avoidance_agent.reset(env)
self.learning_agent.reset(env)
self.policy_selector.reset(env)
def clone(self):
return self
...@@ -26,7 +26,8 @@ from utils.observation_utils import normalize_observation ...@@ -26,7 +26,8 @@ from utils.observation_utils import normalize_observation
from reinforcement_learning.dddqn_policy import DDDQNPolicy from reinforcement_learning.dddqn_policy import DDDQNPolicy
def eval_policy(env_params, checkpoint, n_eval_episodes, max_steps, action_size, state_size, seed, render, allow_skipping, allow_caching): def eval_policy(env_params, checkpoint, n_eval_episodes, max_steps, action_size, state_size, seed, render,
allow_skipping, allow_caching):
# Evaluation is faster on CPU (except if you use a really huge policy) # Evaluation is faster on CPU (except if you use a really huge policy)
parameters = { parameters = {
'use_gpu': False 'use_gpu': False
...@@ -140,11 +141,12 @@ def eval_policy(env_params, checkpoint, n_eval_episodes, max_steps, action_size, ...@@ -140,11 +141,12 @@ def eval_policy(env_params, checkpoint, n_eval_episodes, max_steps, action_size,
else: else:
preproc_timer.start() preproc_timer.start()
norm_obs = normalize_observation(obs[agent], tree_depth=observation_tree_depth, observation_radius=observation_radius) norm_obs = normalize_observation(obs[agent], tree_depth=observation_tree_depth,
observation_radius=observation_radius)
preproc_timer.end() preproc_timer.end()
inference_timer.start() inference_timer.start()
action = policy.act(norm_obs, eps=0.0) action = policy.act(agent, norm_obs, eps=0.0)
inference_timer.end() inference_timer.end()
action_dict.update({agent: action}) action_dict.update({agent: action})
...@@ -319,12 +321,15 @@ def evaluate_agents(file, n_evaluation_episodes, use_gpu, render, allow_skipping ...@@ -319,12 +321,15 @@ def evaluate_agents(file, n_evaluation_episodes, use_gpu, render, allow_skipping
results = [] results = []
if render: if render:
results.append(eval_policy(params, file, eval_per_thread, max_steps, action_size, state_size, 0, render, allow_skipping, allow_caching)) results.append(
eval_policy(params, file, eval_per_thread, max_steps, action_size, state_size, 0, render, allow_skipping,
allow_caching))
else: else:
with Pool() as p: with Pool() as p:
results = p.starmap(eval_policy, results = p.starmap(eval_policy,
[(params, file, 1, max_steps, action_size, state_size, seed * nb_threads, render, allow_skipping, allow_caching) [(params, file, 1, max_steps, action_size, state_size, seed * nb_threads, render,
allow_skipping, allow_caching)
for seed in for seed in
range(total_nb_eval)]) range(total_nb_eval)])
...@@ -367,10 +372,12 @@ if __name__ == "__main__": ...@@ -367,10 +372,12 @@ if __name__ == "__main__":
parser.add_argument("--use_gpu", dest="use_gpu", help="use GPU if available", action='store_true') parser.add_argument("--use_gpu", dest="use_gpu", help="use GPU if available", action='store_true')
parser.add_argument("--render", help="render a single episode", action='store_true') parser.add_argument("--render", help="render a single episode", action='store_true')
parser.add_argument("--allow_skipping", help="skips to the end of the episode if all agents are deadlocked", action='store_true') parser.add_argument("--allow_skipping", help="skips to the end of the episode if all agents are deadlocked",
action='store_true')
parser.add_argument("--allow_caching", help="caches the last observation-action pair", action='store_true') parser.add_argument("--allow_caching", help="caches the last observation-action pair", action='store_true')
args = parser.parse_args() args = parser.parse_args()
os.environ["OMP_NUM_THREADS"] = str(1) os.environ["OMP_NUM_THREADS"] = str(1)
evaluate_agents(file=args.file, n_evaluation_episodes=args.n_evaluation_episodes, use_gpu=args.use_gpu, render=args.render, evaluate_agents(file=args.file, n_evaluation_episodes=args.n_evaluation_episodes, use_gpu=args.use_gpu,
render=args.render,
allow_skipping=args.allow_skipping, allow_caching=args.allow_caching) allow_skipping=args.allow_skipping, allow_caching=args.allow_caching)
from flatland.envs.rail_env import RailEnv
from reinforcement_learning.dddqn_policy import DDDQNPolicy
from reinforcement_learning.policy import LearningPolicy, DummyMemory
from reinforcement_learning.ppo_agent import PPOPolicy
class MultiDecisionAgent(LearningPolicy):
def __init__(self, state_size, action_size, in_parameters=None):
print(">> MultiDecisionAgent")
super(MultiDecisionAgent, self).__init__()
self.state_size = state_size
self.action_size = action_size
self.in_parameters = in_parameters
self.memory = DummyMemory()
self.loss = 0
self.ppo_policy = PPOPolicy(state_size, action_size, use_replay_buffer=False, in_parameters=in_parameters)
self.dddqn_policy = DDDQNPolicy(state_size, action_size, in_parameters)
self.policy_selector = PPOPolicy(state_size, 2)
def step(self, handle, state, action, reward, next_state, done):
self.ppo_policy.step(handle, state, action, reward, next_state, done)
self.dddqn_policy.step(handle, state, action, reward, next_state, done)
select = self.policy_selector.act(handle, state, 0.0)
self.policy_selector.step(handle, state, select, reward, next_state, done)
def act(self, handle, state, eps=0.):
select = self.policy_selector.act(handle, state, eps)
if select == 0:
return self.dddqn_policy.act(handle, state, eps)
return self.policy_selector.act(handle, state, eps)
def save(self, filename):
self.ppo_policy.save(filename)
self.dddqn_policy.save(filename)
self.policy_selector.save(filename)
def load(self, filename):
self.ppo_policy.load(filename)
self.dddqn_policy.load(filename)
self.policy_selector.load(filename)
def start_step(self, train):
self.ppo_policy.start_step(train)
self.dddqn_policy.start_step(train)
self.policy_selector.start_step(train)
def end_step(self, train):
self.ppo_policy.end_step(train)
self.dddqn_policy.end_step(train)
self.policy_selector.end_step(train)
def start_episode(self, train):
self.ppo_policy.start_episode(train)
self.dddqn_policy.start_episode(train)
self.policy_selector.start_episode(train)
def end_episode(self, train):
self.ppo_policy.end_episode(train)
self.dddqn_policy.end_episode(train)
self.policy_selector.end_episode(train)
def load_replay_buffer(self, filename):
self.ppo_policy.load_replay_buffer(filename)
self.dddqn_policy.load_replay_buffer(filename)
self.policy_selector.load_replay_buffer(filename)
def test(self):
self.ppo_policy.test()
self.dddqn_policy.test()
self.policy_selector.test()
def reset(self, env: RailEnv):
self.ppo_policy.reset(env)
self.dddqn_policy.reset(env)
self.policy_selector.reset(env)
def clone(self):
multi_descision_agent = MultiDecisionAgent(
self.state_size,
self.action_size,
self.in_parameters
)
multi_descision_agent.ppo_policy = self.ppo_policy.clone()
multi_descision_agent.dddqn_policy = self.dddqn_policy.clone()
multi_descision_agent.policy_selector = self.policy_selector.clone()
return multi_descision_agent
import numpy as np import numpy as np
from flatland.envs.rail_env import RailEnvActions from flatland.envs.rail_env import RailEnv
from reinforcement_learning.policy import Policy from reinforcement_learning.policy import Policy
from reinforcement_learning.ppo.ppo_agent import PPOAgent from reinforcement_learning.ppo_agent import PPOPolicy
from utils.dead_lock_avoidance_agent import DeadLockAvoidanceAgent from utils.dead_lock_avoidance_agent import DeadLockAvoidanceAgent
from utils.extra import ExtraPolicy
class MultiPolicy(Policy): class MultiPolicy(Policy):
...@@ -13,20 +12,20 @@ class MultiPolicy(Policy): ...@@ -13,20 +12,20 @@ class MultiPolicy(Policy):
self.action_size = action_size self.action_size = action_size
self.memory = [] self.memory = []
self.loss = 0 self.loss = 0
self.extra_policy = ExtraPolicy(state_size, action_size) self.deadlock_avoidance_policy = DeadLockAvoidanceAgent(env, action_size, False)
self.ppo_policy = PPOAgent(state_size + action_size, action_size, n_agents, env) self.ppo_policy = PPOPolicy(state_size + action_size, action_size)
def load(self, filename): def load(self, filename):
self.ppo_policy.load(filename) self.ppo_policy.load(filename)
self.extra_policy.load(filename) self.deadlock_avoidance_policy.load(filename)
def save(self, filename): def save(self, filename):
self.ppo_policy.save(filename) self.ppo_policy.save(filename)
self.extra_policy.save(filename) self.deadlock_avoidance_policy.save(filename)
def step(self, handle, state, action, reward, next_state, done): def step(self, handle, state, action, reward, next_state, done):
action_extra_state = self.extra_policy.act(handle, state, 0.0) action_extra_state = self.deadlock_avoidance_policy.act(handle, state, 0.0)
action_extra_next_state = self.extra_policy.act(handle, next_state, 0.0) action_extra_next_state = self.deadlock_avoidance_policy.act(handle, next_state, 0.0)
extended_state = np.copy(state) extended_state = np.copy(state)
for action_itr in np.arange(self.action_size): for action_itr in np.arange(self.action_size):
...@@ -35,11 +34,11 @@ class MultiPolicy(Policy): ...@@ -35,11 +34,11 @@ class MultiPolicy(Policy):
for action_itr in np.arange(self.action_size): for action_itr in np.arange(self.action_size):
extended_next_state = np.append(extended_next_state, [int(action_extra_next_state == action_itr)]) extended_next_state = np.append(extended_next_state, [int(action_extra_next_state == action_itr)])
self.extra_policy.step(handle, state, action, reward, next_state, done) self.deadlock_avoidance_policy.step(handle, state, action, reward, next_state, done)
self.ppo_policy.step(handle, extended_state, action, reward, extended_next_state, done) self.ppo_policy.step(handle, extended_state, action, reward, extended_next_state, done)
def act(self, handle, state, eps=0.): def act(self, handle, state, eps=0.):
action_extra_state = self.extra_policy.act(handle, state, 0.0) action_extra_state = self.deadlock_avoidance_policy.act(handle, state, 0.0)
extended_state = np.copy(state) extended_state = np.copy(state)
for action_itr in np.arange(self.action_size): for action_itr in np.arange(self.action_size):
extended_state = np.append(extended_state, [int(action_extra_state == action_itr)]) extended_state = np.append(extended_state, [int(action_extra_state == action_itr)])
...@@ -47,18 +46,18 @@ class MultiPolicy(Policy): ...@@ -47,18 +46,18 @@ class MultiPolicy(Policy):
self.loss = self.ppo_policy.loss self.loss = self.ppo_policy.loss
return action_ppo return action_ppo
def reset(self): def reset(self, env: RailEnv):
self.ppo_policy.reset() self.ppo_policy.reset(env)
self.extra_policy.reset() self.deadlock_avoidance_policy.reset(env)
def test(self): def test(self):
self.ppo_policy.test() self.ppo_policy.test()
self.extra_policy.test() self.deadlock_avoidance_policy.test()
def start_step(self): def start_step(self, train):
self.extra_policy.start_step() self.deadlock_avoidance_policy.start_step(train)
self.ppo_policy.start_step() self.ppo_policy.start_step(train)
def end_step(self): def end_step(self, train):
self.extra_policy.end_step() self.deadlock_avoidance_policy.end_step(train)
self.ppo_policy.end_step() self.ppo_policy.end_step(train)
...@@ -15,7 +15,7 @@ class OrderedPolicy(Policy): ...@@ -15,7 +15,7 @@ class OrderedPolicy(Policy):
def __init__(self): def __init__(self):
self.action_size = 5 self.action_size = 5
def act(self, state, eps=0.): def act(self, handle, state, eps=0.):
_, distance, _ = split_tree_into_feature_groups(state, 1) _, distance, _ = split_tree_into_feature_groups(state, 1)
distance = distance[1:] distance = distance[1:]
min_dist = min_gt(distance, 0) min_dist = min_gt(distance, 0)
......