Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • hebe0663/neurips2020-flatland-starter-kit
  • flatland/neurips2020-flatland-starter-kit
  • manavsinghal157/marl-flatland
3 results
Show changes
Commits on Source (33)
Showing
with 276 additions and 78 deletions
🚂 Starter Kit - NeurIPS 2020 Flatland Challenge
===
🚂 This code is based on the official starter kit - NeurIPS 2020 Flatland Challenge
---
This starter kit contains 2 example policies to get started with this challenge:
- a simple single-agent DQN method
- a more robust multi-agent DQN method that you can submit out of the box to the challenge 🚀
You can use for your own experiments full or reduced action space.
```python
def map_action(action):
# if full action space is used -> no mapping required
if get_action_size() == get_flatland_full_action_size():
return action
# if reduced action space is used -> the action has to be mapped to real flatland actions
# The reduced action space removes the DO_NOTHING action from Flatland.
if action == 0:
return RailEnvActions.MOVE_LEFT
if action == 1:
return RailEnvActions.MOVE_FORWARD
if action == 2:
return RailEnvActions.MOVE_RIGHT
if action == 3:
return RailEnvActions.STOP_MOVING
```
**🔗 [Train the single-agent DQN policy](https://flatland.aicrowd.com/getting-started/rl/single-agent.html)**
```python
set_action_size_full()
```
or
```python
set_action_size_reduced()
```
action space. The reduced action space just removes DO_NOTHING.
**🔗 [Train the multi-agent DQN policy](https://flatland.aicrowd.com/getting-started/rl/multi-agent.html)**
---
The used policy is based on the FastTreeObs in the official starter kit - NeurIPS 2020 Flatland Challenge. But the
FastTreeObs in this repo is an extended version.
[fast_tree_obs.py](./utils/fast_tree_obs.py)
**🔗 [Submit a trained policy](https://flatland.aicrowd.com/getting-started/first-submission.html)**
---
Have a look into the [run.py](./run.py) file. There you can select using PPO or DDDQN as RL agents.
```python
####################################################
# EVALUATION PARAMETERS
set_action_size_full()
# Print per-step logs
VERBOSE = True
USE_FAST_TREEOBS = True
if False:
# -------------------------------------------------------------------------------------------------------
# RL solution
# -------------------------------------------------------------------------------------------------------
# 116591 adrian_egli
# graded 71.305 0.633 RL Successfully Graded ! More details about this submission can be found at:
# http://gitlab.aicrowd.com/adrian_egli/neurips2020-flatland-starter-kit/issues/51
# Fri, 22 Jan 2021 23:37:56
set_action_size_reduced()
load_policy = "DDDQN"
checkpoint = "./checkpoints/210122120236-3000.pth" # 17.011131341978228
EPSILON = 0.0
if False:
# -------------------------------------------------------------------------------------------------------
# RL solution
# -------------------------------------------------------------------------------------------------------
# 116658 adrian_egli
# graded 73.821 0.655 RL Successfully Graded ! More details about this submission can be found at:
# http://gitlab.aicrowd.com/adrian_egli/neurips2020-flatland-starter-kit/issues/52
# Sat, 23 Jan 2021 07:41:35
set_action_size_reduced()
load_policy = "PPO"
checkpoint = "./checkpoints/210122235754-5000.pth" # 16.00113400887389
EPSILON = 0.0
if True:
# -------------------------------------------------------------------------------------------------------
# RL solution
# -------------------------------------------------------------------------------------------------------
# 116659 adrian_egli
# graded 80.579 0.715 RL Successfully Graded ! More details about this submission can be found at:
# http://gitlab.aicrowd.com/adrian_egli/neurips2020-flatland-starter-kit/issues/53
# Sat, 23 Jan 2021 07:45:49
set_action_size_reduced()
load_policy = "DDDQN"
checkpoint = "./checkpoints/210122165109-5000.pth" # 17.993750197899438
EPSILON = 0.0
if False:
# -------------------------------------------------------------------------------------------------------
# !! This is not a RL solution !!!!
# -------------------------------------------------------------------------------------------------------
# 116727 adrian_egli
# graded 106.786 0.768 RL Successfully Graded ! More details about this submission can be found at:
# http://gitlab.aicrowd.com/adrian_egli/neurips2020-flatland-starter-kit/issues/54
# Sat, 23 Jan 2021 14:31:50
set_action_size_reduced()
load_policy = "DeadLockAvoidance"
checkpoint = None
EPSILON = 0.0
```
The single-agent example is meant as a minimal example of how to use DQN. The multi-agent is a better starting point to create your own solution.
---
A deadlock avoidance agent is implemented. The agent only lets the train take the shortest route. And it tries to avoid as many deadlocks as possible.
* [dead_lock_avoidance_agent.py](./utils/dead_lock_avoidance_agent.py)
You can fully train the multi-agent policy in Colab for free! [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1GbPwZNQU7KJIJtilcGBTtpOAD3EabAzJ?usp=sharing)
Sample training usage
---
The policy interface has changed, please have a look into
* [policy.py](./reinforcement_learning/policy.py)
Train the multi-agent policy for 150 episodes:
```bash
python reinforcement_learning/multi_agent_training.py -n 150
---
See the tensorboard training output to get some insights:
```
tensorboard --logdir ./runs_bench
```
The multi-agent policy training can be tuned using command-line arguments:
---
```
python reinforcement_learning/multi_agent_training.py --use_fast_tree_observation --checkpoint_interval 1000 -n 5000
--policy DDDQN -t 2 --action_size reduced --buffer_siz 128000
```
```console
usage: multi_agent_training.py [-h] [-n N_EPISODES] [-t TRAINING_ENV_CONFIG]
[multi_agent_training.py](./reinforcement_learning/multi_agent_training.py)
has new or changed parameters. Most important new or changed parameters for training.
* policy : [DDDQN, PPO, DeadLockAvoidance, DeadLockAvoidanceWithDecision, MultiDecision] : Default value
DeadLockAvoidance
* use_fast_tree_observation : [false,true] : Default value = true
* action_size: [full, reduced] : Default value = full
```
usage: multi_agent_training.py [-h] [-n N_EPISODES] [--n_agent_fixed]
[-t TRAINING_ENV_CONFIG]
[-e EVALUATION_ENV_CONFIG]
[--n_evaluation_episodes N_EVALUATION_EPISODES]
[--checkpoint_interval CHECKPOINT_INTERVAL]
......@@ -42,12 +144,16 @@ usage: multi_agent_training.py [-h] [-n N_EPISODES] [-t TRAINING_ENV_CONFIG]
[--hidden_size HIDDEN_SIZE]
[--update_every UPDATE_EVERY]
[--use_gpu USE_GPU] [--num_threads NUM_THREADS]
[--render RENDER]
[--render] [--load_policy LOAD_POLICY]
[--use_fast_tree_observation]
[--max_depth MAX_DEPTH] [--policy POLICY]
[--action_size ACTION_SIZE]
optional arguments:
-h, --help show this help message and exit
-n N_EPISODES, --n_episodes N_EPISODES
number of episodes to run
--n_agent_fixed hold the number of agent fixed
-t TRAINING_ENV_CONFIG, --training_env_config TRAINING_ENV_CONFIG
training config id (eg 0 for Test_0)
-e EVALUATION_ENV_CONFIG, --evaluation_env_config EVALUATION_ENV_CONFIG
......@@ -82,20 +188,40 @@ optional arguments:
--use_gpu USE_GPU use GPU if available
--num_threads NUM_THREADS
number of threads PyTorch can use
--render RENDER render 1 episode in 100
```
--render render 1 episode in 100
--load_policy LOAD_POLICY
policy filename (reference) to load
--use_fast_tree_observation
use FastTreeObs instead of stock TreeObs
--max_depth MAX_DEPTH
max depth
--policy POLICY policy name [DDDQN, PPO, DeadLockAvoidance,
DeadLockAvoidanceWithDecision, MultiDecision]
--action_size ACTION_SIZE
define the action size [reduced,full]
```
[**📈 Performance training in environments of various sizes**](https://wandb.ai/masterscrat/flatland-examples-reinforcement_learning/reports/Flatland-Starter-Kit-Training-in-environments-of-various-sizes--VmlldzoxNjgxMTk)
[**📈 Performance with various hyper-parameters**](https://app.wandb.ai/masterscrat/flatland-examples-reinforcement_learning/reports/Flatland-Examples--VmlldzoxNDI2MTA)
---
If you have any questions write me on the official discord channel **aiAdrian**
(Adrian Egli - adrian.egli@gmail.com)
Credits
---
[![](https://i.imgur.com/Lqrq5GE.png)](https://app.wandb.ai/masterscrat/flatland-examples-reinforcement_learning/reports/Flatland-Examples--VmlldzoxNDI2MTA)
* Florian Laurent <florian@aicrowd.com>
* Erik Nygren <erik.nygren@sbb.ch>
* Adrian Egli <adrian.egli@sbb.ch>
* Sharada Mohanty <mohanty@aicrowd.com>
* Christian Baumberger <christian.baumberger@sbb.ch>
* Guillaume Mollard <guillaume.mollard2@gmail.com>
Main links
---
* [Flatland documentation](https://flatland.aicrowd.com/)
* [NeurIPS 2020 Challenge](https://www.aicrowd.com/challenges/neurips-2020-flatland-challenge/)
* [Flatland Challenge](https://www.aicrowd.com/challenges/flatland)
Communication
---
......
File deleted
File deleted
File deleted
File deleted
File deleted
File deleted
File deleted
File added
File added
File added
File added
File added
File added
File added
......@@ -22,7 +22,8 @@ from reinforcement_learning.dddqn_policy import DDDQNPolicy
from reinforcement_learning.deadlockavoidance_with_decision_agent import DeadLockAvoidanceWithDecisionAgent
from reinforcement_learning.multi_decision_agent import MultiDecisionAgent
from reinforcement_learning.ppo_agent import PPOPolicy
from utils.agent_action_config import get_flatland_full_action_size, get_action_size, map_actions, map_action
from utils.agent_action_config import get_flatland_full_action_size, get_action_size, map_actions, map_action, \
set_action_size_reduced, set_action_size_full, map_action_policy
from utils.dead_lock_avoidance_agent import DeadLockAvoidanceAgent
base_dir = Path(__file__).resolve().parent.parent
......@@ -169,20 +170,26 @@ def train_agent(train_params, train_env_params, eval_env_params, obs_params):
scores_window = deque(maxlen=checkpoint_interval) # todo smooth when rendering instead
completion_window = deque(maxlen=checkpoint_interval)
if train_params.action_size == "reduced":
set_action_size_reduced()
else:
set_action_size_full()
# Double Dueling DQN policy
policy = None
if False:
if train_params.policy == "DDDQN":
policy = DDDQNPolicy(state_size, get_action_size(), train_params)
if True:
elif train_params.policy == "PPO":
policy = PPOPolicy(state_size, get_action_size(), use_replay_buffer=False, in_parameters=train_params)
if False:
policy = DeadLockAvoidanceAgent(train_env, get_action_size())
if False:
elif train_params.policy == "DeadLockAvoidance":
policy = DeadLockAvoidanceAgent(train_env, get_action_size(), enable_eps=False)
elif train_params.policy == "DeadLockAvoidanceWithDecision":
# inter_policy = PPOPolicy(state_size, get_action_size(), use_replay_buffer=False, in_parameters=train_params)
inter_policy = DDDQNPolicy(state_size, get_action_size(), train_params)
policy = DeadLockAvoidanceWithDecisionAgent(train_env, state_size, get_action_size(), inter_policy)
if False:
elif train_params.policy == "MultiDecision":
policy = MultiDecisionAgent(state_size, get_action_size(), train_params)
else:
policy = PPOPolicy(state_size, get_action_size(), use_replay_buffer=False, in_parameters=train_params)
# make sure that at least one policy is set
if policy is None:
......@@ -211,7 +218,7 @@ def train_agent(train_params, train_env_params, eval_env_params, obs_params):
hdd.free / (2 ** 30)))
# TensorBoard writer
writer = SummaryWriter()
writer = SummaryWriter(comment="_" + train_params.policy + "_" + train_params.action_size)
training_timer = Timer()
training_timer.start()
......@@ -235,8 +242,12 @@ def train_agent(train_params, train_env_params, eval_env_params, obs_params):
# Reset environment
reset_timer.start()
number_of_agents = int(min(n_agents, 1 + np.floor(episode_idx / 200)))
train_env_params.n_agents = episode_idx % number_of_agents + 1
if train_params.n_agent_fixed:
number_of_agents = n_agents
train_env_params.n_agents = n_agents
else:
number_of_agents = int(min(n_agents, 1 + np.floor(episode_idx / 200)))
train_env_params.n_agents = episode_idx % number_of_agents + 1
train_env = create_rail_env(train_env_params, tree_observation)
obs, info = train_env.reset(regenerate_rail=True, regenerate_schedule=True)
......@@ -308,7 +319,7 @@ def train_agent(train_params, train_env_params, eval_env_params, obs_params):
learn_timer.start()
policy.step(agent_handle,
agent_prev_obs[agent_handle],
agent_prev_action[agent_handle] - 1,
map_action_policy(agent_prev_action[agent_handle]),
all_rewards[agent_handle],
agent_obs[agent_handle],
done[agent_handle])
......@@ -431,6 +442,7 @@ def train_agent(train_params, train_env_params, eval_env_params, obs_params):
writer.add_scalar("timer/learn", learn_timer.get(), episode_idx)
writer.add_scalar("timer/preproc", preproc_timer.get(), episode_idx)
writer.add_scalar("timer/total", training_timer.get_current(), episode_idx)
writer.flush()
def format_action_prob(action_probs):
......@@ -503,7 +515,8 @@ def eval_policy(env, tree_observation, policy, train_params, obs_params):
if __name__ == "__main__":
parser = ArgumentParser()
parser.add_argument("-n", "--n_episodes", help="number of episodes to run", default=12000, type=int)
parser.add_argument("-n", "--n_episodes", help="number of episodes to run", default=5000, type=int)
parser.add_argument("--n_agent_fixed", help="hold the number of agent fixed", action='store_true')
parser.add_argument("-t", "--training_env_config", help="training config id (eg 0 for Test_0)", default=1,
type=int)
parser.add_argument("-e", "--evaluation_env_config", help="evaluation config id (eg 0 for Test_0)", default=1,
......@@ -531,6 +544,10 @@ if __name__ == "__main__":
parser.add_argument("--use_fast_tree_observation", help="use FastTreeObs instead of stock TreeObs",
action='store_true')
parser.add_argument("--max_depth", help="max depth", default=2, type=int)
parser.add_argument("--policy",
help="policy name [DDDQN, PPO, DeadLockAvoidance, DeadLockAvoidanceWithDecision, MultiDecision]",
default="DeadLockAvoidance")
parser.add_argument("--action_size", help="define the action size [reduced,full]", default="full", type=str)
training_params = parser.parse_args()
env_params = [
......
......@@ -34,7 +34,7 @@ class EpisodeBuffers:
class ActorCriticModel(nn.Module):
def __init__(self, state_size, action_size, device, hidsize1=128, hidsize2=128):
def __init__(self, state_size, action_size, device, hidsize1=512, hidsize2=256):
super(ActorCriticModel, self).__init__()
self.device = device
self.actor = nn.Sequential(
......@@ -85,7 +85,7 @@ class ActorCriticModel(nn.Module):
return obj
def load(self, filename):
print("load policy from file", filename)
print("load model from file", filename)
self.actor = self._load(self.actor, filename + ".actor")
self.critic = self._load(self.critic, filename + ".value")
......@@ -284,6 +284,8 @@ class PPOPolicy(LearningPolicy):
obj.load_state_dict(torch.load(filename, map_location=self.device))
except:
print(" >> failed!")
else:
print(" >> file not found!")
return obj
def load(self, filename):
......
......@@ -3,6 +3,7 @@ from collections import namedtuple
import gym
import numpy as np
from torch.utils.tensorboard import SummaryWriter
from reinforcement_learning.dddqn_policy import DDDQNPolicy
from reinforcement_learning.ppo_agent import PPOPolicy
......@@ -36,6 +37,9 @@ def cartpole(use_dddqn=False):
episode = 0
checkpoint_interval = 20
scores_window = deque(maxlen=100)
writer = SummaryWriter()
while True:
episode += 1
state = env.reset()
......@@ -79,6 +83,10 @@ def cartpole(use_dddqn=False):
policy.memory)),
end=" ")
writer.add_scalar("CartPole/value", tot_reward, episode)
writer.add_scalar("CartPole/smoothed_value", np.mean(scores_window), episode)
writer.flush()
if __name__ == "__main__":
cartpole()
'''
I did experiments in an early submission. Please note that the epsilon can have an
effects on the evaluation outcome :
DDDQNPolicy experiments - EPSILON impact analysis
----------------------------------------------------------------------------------------
checkpoint = "./checkpoints/201124171810-7800.pth" # Training on AGENTS=10 with Depth=2
......@@ -25,14 +27,17 @@ from pathlib import Path
import numpy as np
from flatland.core.env_observation_builder import DummyObservationBuilder
from flatland.envs.agent_utils import RailAgentStatus
from flatland.envs.observations import TreeObsForRailEnv
from flatland.envs.predictions import ShortestPathPredictorForRailEnv
from flatland.evaluators.client import FlatlandRemoteClient
from flatland.evaluators.client import TimeoutException
from reinforcement_learning.ppo_agent import PPOPolicy
from reinforcement_learning.dddqn_policy import DDDQNPolicy
from reinforcement_learning.deadlockavoidance_with_decision_agent import DeadLockAvoidanceWithDecisionAgent
from utils.agent_action_config import get_action_size, map_actions
from reinforcement_learning.multi_decision_agent import MultiDecisionAgent
from reinforcement_learning.ppo_agent import PPOPolicy
from utils.agent_action_config import get_action_size, map_actions, set_action_size_reduced, set_action_size_full
from utils.dead_lock_avoidance_agent import DeadLockAvoidanceAgent
from utils.deadlock_check import check_if_all_blocked
from utils.fast_tree_obs import FastTreeObs
......@@ -41,29 +46,68 @@ from utils.observation_utils import normalize_observation
base_dir = Path(__file__).resolve().parent.parent
sys.path.append(str(base_dir))
from reinforcement_learning.dddqn_policy import DDDQNPolicy
####################################################
# EVALUATION PARAMETERS
set_action_size_full()
# Print per-step logs
VERBOSE = True
USE_FAST_TREEOBS = True
USE_PPO_AGENT = True
# Checkpoint to use (remember to push it!)
checkpoint = "./checkpoints/201219090514-8600.pth" #
# checkpoint = "./checkpoints/201215212134-12000.pth" #
checkpoint = "./checkpoints/201220171629-12000.pth" # DDDQN - EPSILON: 0.0 - 13.940848323912533
checkpoint = "./checkpoints/201220203236-12000.pth" # PPO - EPSILON: 0.0 - 13.660942453931114
checkpoint = "./checkpoints/201220214325-12000.pth" # PPO - EPSILON: 0.0 - 13.463600936043
EPSILON = 0.0
if False:
# -------------------------------------------------------------------------------------------------------
# RL solution
# -------------------------------------------------------------------------------------------------------
# 116591 adrian_egli
# graded 71.305 0.633 RL Successfully Graded ! More details about this submission can be found at:
# http://gitlab.aicrowd.com/adrian_egli/neurips2020-flatland-starter-kit/issues/51
# Fri, 22 Jan 2021 23:37:56
set_action_size_reduced()
load_policy = "DDDQN"
checkpoint = "./checkpoints/210122120236-3000.pth" # 17.011131341978228
EPSILON = 0.0
if False:
# -------------------------------------------------------------------------------------------------------
# RL solution
# -------------------------------------------------------------------------------------------------------
# 116658 adrian_egli
# graded 73.821 0.655 RL Successfully Graded ! More details about this submission can be found at:
# http://gitlab.aicrowd.com/adrian_egli/neurips2020-flatland-starter-kit/issues/52
# Sat, 23 Jan 2021 07:41:35
set_action_size_reduced()
load_policy = "PPO"
checkpoint = "./checkpoints/210122235754-5000.pth" # 16.00113400887389
EPSILON = 0.0
if True:
# -------------------------------------------------------------------------------------------------------
# RL solution
# -------------------------------------------------------------------------------------------------------
# 116659 adrian_egli
# graded 80.579 0.715 RL Successfully Graded ! More details about this submission can be found at:
# http://gitlab.aicrowd.com/adrian_egli/neurips2020-flatland-starter-kit/issues/53
# Sat, 23 Jan 2021 07:45:49
set_action_size_reduced()
load_policy = "DDDQN"
checkpoint = "./checkpoints/210122165109-5000.pth" # 17.993750197899438
EPSILON = 0.0
if False:
# -------------------------------------------------------------------------------------------------------
# !! This is not a RL solution !!!!
# -------------------------------------------------------------------------------------------------------
# 116727 adrian_egli
# graded 106.786 0.768 RL Successfully Graded ! More details about this submission can be found at:
# http://gitlab.aicrowd.com/adrian_egli/neurips2020-flatland-starter-kit/issues/54
# Sat, 23 Jan 2021 14:31:50
set_action_size_reduced()
load_policy = "DeadLockAvoidance"
checkpoint = None
EPSILON = 0.0
# Use last action cache
USE_ACTION_CACHE = False
USE_DEAD_LOCK_AVOIDANCE_AGENT = False # 21.54485505223213
USE_MULTI_DECISION_AGENT = False
# Observation parameters (must match training parameters!)
observation_tree_depth = 2
......@@ -102,15 +146,6 @@ else:
n_nodes = sum([np.power(4, i) for i in range(observation_tree_depth + 1)])
state_size = n_features_per_node * n_nodes
action_size = get_action_size()
# Creates the policy. No GPU on evaluation server.
if not USE_PPO_AGENT:
trained_policy = DDDQNPolicy(state_size, action_size, Namespace(**{'use_gpu': False}), evaluation_mode=True)
else:
trained_policy = PPOPolicy(state_size, action_size)
trained_policy.load(checkpoint)
#####################################################################
# Main evaluation loop
#####################################################################
......@@ -145,9 +180,25 @@ while True:
tree_observation.set_env(local_env)
tree_observation.reset()
policy = trained_policy
if USE_MULTI_DECISION_AGENT:
policy = DeadLockAvoidanceWithDecisionAgent(local_env, state_size, action_size, trained_policy)
# Creates the policy. No GPU on evaluation server.
if load_policy == "DDDQN":
policy = DDDQNPolicy(state_size, get_action_size(), Namespace(**{'use_gpu': False}), evaluation_mode=True)
elif load_policy == "PPO":
policy = PPOPolicy(state_size, get_action_size())
elif load_policy == "DeadLockAvoidance":
policy = DeadLockAvoidanceAgent(local_env, get_action_size(), enable_eps=False)
elif load_policy == "DeadLockAvoidanceWithDecision":
# inter_policy = PPOPolicy(state_size, get_action_size(), use_replay_buffer=False, in_parameters=train_params)
inter_policy = DDDQNPolicy(state_size, get_action_size(), Namespace(**{'use_gpu': False}), evaluation_mode=True)
policy = DeadLockAvoidanceWithDecisionAgent(local_env, state_size, get_action_size(), inter_policy)
elif load_policy == "MultiDecision":
policy = MultiDecisionAgent(state_size, get_action_size(), Namespace(**{'use_gpu': False}))
else:
policy = PPOPolicy(state_size, get_action_size(), use_replay_buffer=False,
in_parameters=Namespace(**{'use_gpu': False}))
policy.load(checkpoint)
policy.reset(local_env)
observation = tree_observation.get_many(list(range(nb_agents)))
......@@ -168,9 +219,6 @@ while True:
agent_last_action = {}
nb_hit = 0
if USE_DEAD_LOCK_AVOIDANCE_AGENT:
policy = DeadLockAvoidanceAgent(local_env, action_size)
policy.start_episode(train=False)
while True:
try:
......@@ -185,14 +233,7 @@ while True:
time_start = time.time()
action_dict = {}
policy.start_step(train=False)
if USE_DEAD_LOCK_AVOIDANCE_AGENT:
observation = np.zeros((local_env.get_num_agents(), 2))
for agent_handle in range(nb_agents):
if USE_DEAD_LOCK_AVOIDANCE_AGENT:
observation[agent_handle][0] = agent_handle
observation[agent_handle][1] = steps
if info['action_required'][agent_handle]:
if agent_handle in agent_last_obs and np.all(
agent_last_obs[agent_handle] == observation[agent_handle]):
......@@ -234,7 +275,11 @@ while True:
step_time = time.time() - time_start
time_taken_per_step.append(step_time)
nb_agents_done = sum(done[idx] for idx in local_env.get_agent_handles())
nb_agents_done = 0
for i_agent, agent in enumerate(local_env.agents):
# manage the boolean flag to check if all agents are indeed done (or done_removed)
if (agent.status in [RailAgentStatus.DONE, RailAgentStatus.DONE_REMOVED]):
nb_agents_done += 1
if VERBOSE or done['__all__']:
print(
......