Commit acb3cda3 authored by MasterScrat's avatar MasterScrat
Browse files

Cleanup, preparing README to move to starter kit repo

parent 6c54e0fd
🚂 Flatland Examples
🚂 Starter Kit - NeurIPS 2020 Flatland Challenge
===
This repository contains simple examples to get started with the Flatland environment.
This starter kit contains 2 example policies to get started with this challenge:
- a simple single-agent DQN method
- a more robust multi-agent DQN method that you can submit out of the box to the challenge 🚀
Sequential Agent
---
The sequential agent shows a very simple baseline that will move agents one after the other, taking them from their starting point to their target following the shortest path possible.
This is very inefficient but it solves all the instances generated by the `sparse_level_generator`. However when being scored in the AIcrowd challenge, this agent will fail due to the duration it needs to solve an episode.
Here you see it in action:
![Sequential_Agent](https://i.imgur.com/DsbG6zK.gif)
Reinforcement Learning
---
The reinforcement learning agents show how to use a simple DQN algorithm implemented using PyTorch to solve small Flatland problems.
You can train agents in Colab for free! [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1GbPwZNQU7KJIJtilcGBTtpOAD3EabAzJ?usp=sharing)
- **[🔗 Train the single-agent DQN policy](https://flatland.aicrowd.com/getting-started/rl/single-agent.html)**
- **[🔗 Train the multi-agent DQN policy](https://flatland.aicrowd.com/getting-started/rl/multi-agent.html)**
- **[🔗 Submit a trained multi-agent policy](https://flatland.aicrowd.com/getting-started/rl/single-agent.html)**
The single-agent example is meant as a minimal example of how to use DQN. The multi-agent is a better starting point to create your own solution.
You can fully train the multi-agent policy in Colab for free! [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1GbPwZNQU7KJIJtilcGBTtpOAD3EabAzJ?usp=sharing)
```bash
# Single agent environment, 150 episodes
$ python reinforcement_learning/single_agent_training.py -n 150
# Multi-agent environment, 150 episodes
# Train the multi-agent policy for 150 episodes
python reinforcement_learning/multi_agent_training.py -n 150
```
The multi-agent example can be tuned using command-line arguments:
The multi-agent policy training can be tuned using command-line arguments:
```
usage: multi_agent_training.py [-h] [-n N_EPISODES] [--eps_start EPS_START]
[--eps_end EPS_END] [--eps_decay EPS_DECAY]
```console
usage: multi_agent_training.py [-h] [-n N_EPISODES] [-t TRAINING_ENV_CONFIG]
[-e EVALUATION_ENV_CONFIG]
[--n_evaluation_episodes N_EVALUATION_EPISODES]
[--checkpoint_interval CHECKPOINT_INTERVAL]
[--eps_start EPS_START] [--eps_end EPS_END]
[--eps_decay EPS_DECAY]
[--buffer_size BUFFER_SIZE]
[--buffer_min_size BUFFER_MIN_SIZE]
[--restore_replay_buffer RESTORE_REPLAY_BUFFER]
[--save_replay_buffer SAVE_REPLAY_BUFFER]
[--batch_size BATCH_SIZE] [--gamma GAMMA]
[--tau TAU] [--learning_rate LEARNING_RATE]
[--hidden_size HIDDEN_SIZE]
......@@ -49,6 +42,14 @@ optional arguments:
-h, --help show this help message and exit
-n N_EPISODES, --n_episodes N_EPISODES
number of episodes to run
-t TRAINING_ENV_CONFIG, --training_env_config TRAINING_ENV_CONFIG
training config id (eg 0 for Test_0)
-e EVALUATION_ENV_CONFIG, --evaluation_env_config EVALUATION_ENV_CONFIG
evaluation config id (eg 0 for Test_0)
--n_evaluation_episodes N_EVALUATION_EPISODES
number of evaluation episodes
--checkpoint_interval CHECKPOINT_INTERVAL
checkpoint interval
--eps_start EPS_START
max exploration
--eps_end EPS_END min exploration
......@@ -58,6 +59,10 @@ optional arguments:
replay buffer size
--buffer_min_size BUFFER_MIN_SIZE
min buffer size to start training
--restore_replay_buffer RESTORE_REPLAY_BUFFER
replay buffer to restore
--save_replay_buffer SAVE_REPLAY_BUFFER
save replay buffer at each evaluation interval
--batch_size BATCH_SIZE
minibatch size
--gamma GAMMA discount factor
......@@ -70,23 +75,14 @@ optional arguments:
how often to update the network
--use_gpu USE_GPU use GPU if available
--num_threads NUM_THREADS
number of threads to use
number of threads PyTorch can use
--render RENDER render 1 episode in 100
```
The single-agent example is meant as a minimal example of how to use DQN. The multi-agent is a better starting point to create your own solution!
[**📈 Results using the multi-agent example with various hyper-parameters**](https://app.wandb.ai/masterscrat/flatland-examples-reinforcement_learning/reports/Flatland-Examples--VmlldzoxNDI2MTA)
[![](https://i.imgur.com/Lqrq5GE.png)](https://app.wandb.ai/masterscrat/flatland-examples-reinforcement_learning/reports/Flatland-Examples--VmlldzoxNDI2MTA)
You can find more details about these examples in the documentation:
- **[🔗 Single agent DQN training](https://flatland.aicrowd.com/getting-started/rl/single-agent.html)**
- **[🔗 Multi-agent DQN training](https://flatland.aicrowd.com/getting-started/rl/multi-agent.html)**
![Conflict_Avoidance](https://i.imgur.com/AvBHKaD.gif)
Main links
---
......
......@@ -27,7 +27,7 @@ from reinforcement_learning.dddqn_policy import DDDQNPolicy
def eval_policy(env_params, checkpoint, n_eval_episodes, max_steps, action_size, state_size, seed, render, allow_skipping, allow_caching):
# Evaluation is faster on CPU (except if you use a really huge)
# Evaluation is faster on CPU (except if you use a really huge policy)
parameters = {
'use_gpu': False
}
......
......@@ -14,38 +14,45 @@ sys.path.append(str(base_dir))
from reinforcement_learning.ordered_policy import OrderedPolicy
"""
This file shows how to move agents in a sequential way: it moves the trains one by one, following a shortest path strategy.
This is obviously very slow, but it's a good way to get familiar with the different Flatland components: RailEnv, TreeObsForRailEnv, etc...
multi_agent_training.py is a better starting point to train your own solution!
"""
np.random.seed(2)
x_dim = 20 # np.random.randint(8, 20)
y_dim = 20 # np.random.randint(8, 20)
n_agents = 10 # np.random.randint(3, 8)
x_dim = np.random.randint(8, 20)
y_dim = np.random.randint(8, 20)
n_agents = np.random.randint(3, 8)
n_goals = n_agents + np.random.randint(0, 3)
min_dist = int(0.75 * min(x_dim, y_dim))
env = RailEnv(width=x_dim,
height=y_dim,
rail_generator=complex_rail_generator(
nr_start_goal=n_goals, nr_extra=5, min_dist=min_dist,
max_dist=99999,
seed=0
),
schedule_generator=complex_schedule_generator(),
obs_builder_object=TreeObsForRailEnv(max_depth=1, predictor=ShortestPathPredictorForRailEnv()),
number_of_agents=n_agents)
env = RailEnv(
width=x_dim,
height=y_dim,
rail_generator=complex_rail_generator(
nr_start_goal=n_goals, nr_extra=5, min_dist=min_dist,
max_dist=99999,
seed=0
),
schedule_generator=complex_schedule_generator(),
obs_builder_object=TreeObsForRailEnv(max_depth=1, predictor=ShortestPathPredictorForRailEnv()),
number_of_agents=n_agents)
env.reset(True, True)
tree_depth = 1
observation_helper = TreeObsForRailEnv(max_depth=tree_depth, predictor=ShortestPathPredictorForRailEnv())
env_renderer = RenderTool(env, gl="PGL", )
handle = env.get_agent_handles()
n_episodes = 1
n_episodes = 10
max_steps = 100 * (env.height + env.width)
record_images = False
policy = OrderedPolicy()
action_dict = dict()
for trials in range(1, n_episodes + 1):
# Reset environment
obs, info = env.reset(True, True)
done = env.dones
......
......@@ -19,22 +19,28 @@ sys.path.append(str(base_dir))
from reinforcement_learning.dddqn_policy import DDDQNPolicy
from utils.observation_utils import normalize_observation
# TODO:
# - add timeout handling
# - keep only code relative to training and running the DDQN agent
# Writeup how to improve:
# - use FastTreeObsForRailEnv instead of TreeObsForRailEnv
# - add PER (using custom code, or using cpprb)
# - other improvements from Rainbow paper eg n-step... (https://arxiv.org/abs/1710.02298)
####################################################
# EVALUATION PARAMETERS #
# EVALUATION PARAMETERS
# Checkpoint to use
checkpoint = "checkpoints/sample-checkpoint.pth"
# Print detailed logs (disable when submitting)
VERBOSE = True
# Beyond this env width, skip the episode
MAX_ENV_WIDTH = 35
# Checkpoint to use (remember to push it!)
checkpoint = "checkpoints/sample-checkpoint.pth"
# Use action cache:
# Saves the last observation-action mapping for each
# Only works for deterministic agents!
# Use last action cache
USE_ACTION_CACHE = True
# Observation parameters
# These need to match your training parameters!
# Observation parameters (must match training parameters!)
observation_tree_depth = 2
observation_radius = 10
observation_max_path_depth = 30
......@@ -64,9 +70,9 @@ evaluation_number = 0
while True:
evaluation_number += 1
# We use a dummy observation and call TreeObsForRailEnv ourselves.
# This way we decide if we want to calculate the observations or not,
# instead of having them calculated every time we perform an env step.
# We use a dummy observation and call TreeObsForRailEnv ourselves when needed.
# This way we decide if we want to calculate the observations or not instead
# of having them calculated every time we perform an env step.
time_start = time.time()
observation, info = remote_client.env_create(
obs_builder_object=DummyObservationBuilder()
......@@ -77,7 +83,7 @@ while True:
# If the remote_client returns False on a `env_create` call,
# then it basically means that your agent has already been
# evaluated on all the required evaluation environments,
# and hence its safe to break out of the main evaluation loop.
# and hence it's safe to break out of the main evaluation loop.
break
local_env = remote_client.env
......@@ -97,6 +103,8 @@ while True:
time_taken_per_step = []
steps = 0
# Action cache: keep track of last observation to avoid running the same inferrence multiple times.
# This only makes sense for deterministic policies.
agent_last_obs = {}
agent_last_action = {}
nb_hit = 0
......@@ -109,15 +117,17 @@ while True:
obs_time, agent_time, step_time = 0.0, 0.0, 0.0
no_ops_mode = False
if local_env.width <= MAX_ENV_WIDTH and not check_if_all_blocked(env=local_env):
if not check_if_all_blocked(env=local_env):
time_start = time.time()
action_dict = {}
for agent in range(nb_agents):
if observation[agent] and info['action_required'][agent]:
if agent in agent_last_obs and np.all(agent_last_obs[agent] == observation[agent]):
# cache hit
action = agent_last_action[agent]
nb_hit += 1
else:
# otherwise, run normalization and inference
norm_obs = normalize_observation(observation[agent], tree_depth=observation_tree_depth, observation_radius=observation_radius)
action = policy.act(norm_obs, eps=0.0)
......@@ -139,7 +149,7 @@ while True:
obs_time = time.time() - time_start
else:
# Too many agents or fully deadlocked: no-op to finish the episode ASAP 🏃💨
# Fully deadlocked: perform no-ops to finish the episode ASAP 🏃💨
no_ops_mode = True
time_start = time.time()
......@@ -149,7 +159,7 @@ while True:
nb_agents_done = sum(done[idx] for idx in local_env.get_agent_handles())
if not done['__all__']:
if not done['__all__'] and VERBOSE:
print("Step {}/{}\tAgents done: {}\t Obs time {:.3f}s\t Inference time {:.5f}s\t Step time {:.3f}s\t Cache hits {}\t No-ops? {}".format(
str(steps).zfill(4),
max_nb_steps,
......@@ -181,7 +191,7 @@ while True:
print("Mean/Std of Time per Step : ", np_time_taken_per_step.mean(), np_time_taken_per_step.std())
print("=" * 100)
print("Evaluation of all environments complete...")
print("Evaluation of all environments complete!")
########################################################################
# Submit your Results
#
......
#!/bin/bash
# maximizes pytorch throughput when using multiprocessing
# export OMP_NUM_THREADS="1"
python ./run.py
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment