Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • hebe0663/neurips2020-flatland-starter-kit
  • flatland/neurips2020-flatland-starter-kit
  • manavsinghal157/marl-flatland
3 results
Show changes
Commits on Source (60)
Showing
with 594 additions and 273 deletions
🚂 Starter Kit - NeurIPS 2020 Flatland Challenge
===
🚂 This code is based on the official starter kit - NeurIPS 2020 Flatland Challenge
---
This starter kit contains 2 example policies to get started with this challenge:
- a simple single-agent DQN method
- a more robust multi-agent DQN method that you can submit out of the box to the challenge 🚀
You can use for your own experiments full or reduced action space.
```python
def map_action(action):
# if full action space is used -> no mapping required
if get_action_size() == get_flatland_full_action_size():
return action
# if reduced action space is used -> the action has to be mapped to real flatland actions
# The reduced action space removes the DO_NOTHING action from Flatland.
if action == 0:
return RailEnvActions.MOVE_LEFT
if action == 1:
return RailEnvActions.MOVE_FORWARD
if action == 2:
return RailEnvActions.MOVE_RIGHT
if action == 3:
return RailEnvActions.STOP_MOVING
```
**🔗 [Train the single-agent DQN policy](https://flatland.aicrowd.com/getting-started/rl/single-agent.html)**
```python
set_action_size_full()
```
or
```python
set_action_size_reduced()
```
action space. The reduced action space just removes DO_NOTHING.
**🔗 [Train the multi-agent DQN policy](https://flatland.aicrowd.com/getting-started/rl/multi-agent.html)**
---
The used policy is based on the FastTreeObs in the official starter kit - NeurIPS 2020 Flatland Challenge. But the
FastTreeObs in this repo is an extended version.
[fast_tree_obs.py](./utils/fast_tree_obs.py)
**🔗 [Submit a trained policy](https://flatland.aicrowd.com/getting-started/first-submission.html)**
---
Have a look into the [run.py](./run.py) file. There you can select using PPO or DDDQN as RL agents.
```python
####################################################
# EVALUATION PARAMETERS
set_action_size_full()
# Print per-step logs
VERBOSE = True
USE_FAST_TREEOBS = True
if False:
# -------------------------------------------------------------------------------------------------------
# RL solution
# -------------------------------------------------------------------------------------------------------
# 116591 adrian_egli
# graded 71.305 0.633 RL Successfully Graded ! More details about this submission can be found at:
# http://gitlab.aicrowd.com/adrian_egli/neurips2020-flatland-starter-kit/issues/51
# Fri, 22 Jan 2021 23:37:56
set_action_size_reduced()
load_policy = "DDDQN"
checkpoint = "./checkpoints/210122120236-3000.pth" # 17.011131341978228
EPSILON = 0.0
if False:
# -------------------------------------------------------------------------------------------------------
# RL solution
# -------------------------------------------------------------------------------------------------------
# 116658 adrian_egli
# graded 73.821 0.655 RL Successfully Graded ! More details about this submission can be found at:
# http://gitlab.aicrowd.com/adrian_egli/neurips2020-flatland-starter-kit/issues/52
# Sat, 23 Jan 2021 07:41:35
set_action_size_reduced()
load_policy = "PPO"
checkpoint = "./checkpoints/210122235754-5000.pth" # 16.00113400887389
EPSILON = 0.0
if True:
# -------------------------------------------------------------------------------------------------------
# RL solution
# -------------------------------------------------------------------------------------------------------
# 116659 adrian_egli
# graded 80.579 0.715 RL Successfully Graded ! More details about this submission can be found at:
# http://gitlab.aicrowd.com/adrian_egli/neurips2020-flatland-starter-kit/issues/53
# Sat, 23 Jan 2021 07:45:49
set_action_size_reduced()
load_policy = "DDDQN"
checkpoint = "./checkpoints/210122165109-5000.pth" # 17.993750197899438
EPSILON = 0.0
if False:
# -------------------------------------------------------------------------------------------------------
# !! This is not a RL solution !!!!
# -------------------------------------------------------------------------------------------------------
# 116727 adrian_egli
# graded 106.786 0.768 RL Successfully Graded ! More details about this submission can be found at:
# http://gitlab.aicrowd.com/adrian_egli/neurips2020-flatland-starter-kit/issues/54
# Sat, 23 Jan 2021 14:31:50
set_action_size_reduced()
load_policy = "DeadLockAvoidance"
checkpoint = None
EPSILON = 0.0
```
The single-agent example is meant as a minimal example of how to use DQN. The multi-agent is a better starting point to create your own solution.
---
A deadlock avoidance agent is implemented. The agent only lets the train take the shortest route. And it tries to avoid as many deadlocks as possible.
* [dead_lock_avoidance_agent.py](./utils/dead_lock_avoidance_agent.py)
You can fully train the multi-agent policy in Colab for free! [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1GbPwZNQU7KJIJtilcGBTtpOAD3EabAzJ?usp=sharing)
Sample training usage
---
The policy interface has changed, please have a look into
* [policy.py](./reinforcement_learning/policy.py)
Train the multi-agent policy for 150 episodes:
```bash
python reinforcement_learning/multi_agent_training.py -n 150
---
See the tensorboard training output to get some insights:
```
tensorboard --logdir ./runs_bench
```
The multi-agent policy training can be tuned using command-line arguments:
---
```
python reinforcement_learning/multi_agent_training.py --use_fast_tree_observation --checkpoint_interval 1000 -n 5000
--policy DDDQN -t 2 --action_size reduced --buffer_siz 128000
```
```console
usage: multi_agent_training.py [-h] [-n N_EPISODES] [-t TRAINING_ENV_CONFIG]
[multi_agent_training.py](./reinforcement_learning/multi_agent_training.py)
has new or changed parameters. Most important new or changed parameters for training.
* policy : [DDDQN, PPO, DeadLockAvoidance, DeadLockAvoidanceWithDecision, MultiDecision] : Default value
DeadLockAvoidance
* use_fast_tree_observation : [false,true] : Default value = true
* action_size: [full, reduced] : Default value = full
```
usage: multi_agent_training.py [-h] [-n N_EPISODES] [--n_agent_fixed]
[-t TRAINING_ENV_CONFIG]
[-e EVALUATION_ENV_CONFIG]
[--n_evaluation_episodes N_EVALUATION_EPISODES]
[--checkpoint_interval CHECKPOINT_INTERVAL]
......@@ -42,12 +144,16 @@ usage: multi_agent_training.py [-h] [-n N_EPISODES] [-t TRAINING_ENV_CONFIG]
[--hidden_size HIDDEN_SIZE]
[--update_every UPDATE_EVERY]
[--use_gpu USE_GPU] [--num_threads NUM_THREADS]
[--render RENDER]
[--render] [--load_policy LOAD_POLICY]
[--use_fast_tree_observation]
[--max_depth MAX_DEPTH] [--policy POLICY]
[--action_size ACTION_SIZE]
optional arguments:
-h, --help show this help message and exit
-n N_EPISODES, --n_episodes N_EPISODES
number of episodes to run
--n_agent_fixed hold the number of agent fixed
-t TRAINING_ENV_CONFIG, --training_env_config TRAINING_ENV_CONFIG
training config id (eg 0 for Test_0)
-e EVALUATION_ENV_CONFIG, --evaluation_env_config EVALUATION_ENV_CONFIG
......@@ -82,20 +188,40 @@ optional arguments:
--use_gpu USE_GPU use GPU if available
--num_threads NUM_THREADS
number of threads PyTorch can use
--render RENDER render 1 episode in 100
```
--render render 1 episode in 100
--load_policy LOAD_POLICY
policy filename (reference) to load
--use_fast_tree_observation
use FastTreeObs instead of stock TreeObs
--max_depth MAX_DEPTH
max depth
--policy POLICY policy name [DDDQN, PPO, DeadLockAvoidance,
DeadLockAvoidanceWithDecision, MultiDecision]
--action_size ACTION_SIZE
define the action size [reduced,full]
```
[**📈 Performance training in environments of various sizes**](https://wandb.ai/masterscrat/flatland-examples-reinforcement_learning/reports/Flatland-Starter-Kit-Training-in-environments-of-various-sizes--VmlldzoxNjgxMTk)
[**📈 Performance with various hyper-parameters**](https://app.wandb.ai/masterscrat/flatland-examples-reinforcement_learning/reports/Flatland-Examples--VmlldzoxNDI2MTA)
---
If you have any questions write me on the official discord channel **aiAdrian**
(Adrian Egli - adrian.egli@gmail.com)
Credits
---
[![](https://i.imgur.com/Lqrq5GE.png)](https://app.wandb.ai/masterscrat/flatland-examples-reinforcement_learning/reports/Flatland-Examples--VmlldzoxNDI2MTA)
* Florian Laurent <florian@aicrowd.com>
* Erik Nygren <erik.nygren@sbb.ch>
* Adrian Egli <adrian.egli@sbb.ch>
* Sharada Mohanty <mohanty@aicrowd.com>
* Christian Baumberger <christian.baumberger@sbb.ch>
* Guillaume Mollard <guillaume.mollard2@gmail.com>
Main links
---
* [Flatland documentation](https://flatland.aicrowd.com/)
* [NeurIPS 2020 Challenge](https://www.aicrowd.com/challenges/neurips-2020-flatland-challenge/)
* [Flatland Challenge](https://www.aicrowd.com/challenges/flatland)
Communication
---
......
File deleted
File deleted
File added
File added
File added
File added
File added
File added
File added
File deleted
......@@ -2,7 +2,6 @@ import copy
import os
import pickle
import random
from collections import namedtuple, deque, Iterable
import numpy as np
import torch
......@@ -10,13 +9,15 @@ import torch.nn.functional as F
import torch.optim as optim
from reinforcement_learning.model import DuelingQNetwork
from reinforcement_learning.policy import Policy
from reinforcement_learning.policy import Policy, LearningPolicy
from reinforcement_learning.replay_buffer import ReplayBuffer
class DDDQNPolicy(Policy):
class DDDQNPolicy(LearningPolicy):
"""Dueling Double DQN policy"""
def __init__(self, state_size, action_size, in_parameters, evaluation_mode=False):
print(">> DDDQNPolicy")
super(Policy, self).__init__()
self.ddqn_parameters = in_parameters
......@@ -55,11 +56,13 @@ class DDDQNPolicy(Policy):
self.qnetwork_target = copy.deepcopy(self.qnetwork_local)
self.optimizer = optim.Adam(self.qnetwork_local.parameters(), lr=self.learning_rate)
self.memory = ReplayBuffer(action_size, self.buffer_size, self.batch_size, self.device)
self.t_step = 0
self.loss = 0.0
else:
self.memory = ReplayBuffer(action_size, 1, 1, self.device)
self.loss = 0.0
def act(self, state, eps=0.):
def act(self, handle, state, eps=0.):
state = torch.from_numpy(state).float().unsqueeze(0).to(self.device)
self.qnetwork_local.eval()
with torch.no_grad():
......@@ -88,7 +91,7 @@ class DDDQNPolicy(Policy):
def _learn(self):
experiences = self.memory.sample()
states, actions, rewards, next_states, dones = experiences
states, actions, rewards, next_states, dones, _ = experiences
# Get expected Q values from local model
q_expected = self.qnetwork_local(states).gather(1, actions)
......@@ -151,7 +154,7 @@ class DDDQNPolicy(Policy):
self.memory.memory = pickle.load(f)
def test(self):
self.act(np.array([[0] * self.state_size]))
self.act(0, np.array([[0] * self.state_size]))
self._learn()
def clone(self):
......@@ -159,55 +162,3 @@ class DDDQNPolicy(Policy):
me.qnetwork_target = copy.deepcopy(self.qnetwork_local)
me.qnetwork_target = copy.deepcopy(self.qnetwork_target)
return me
Experience = namedtuple("Experience", field_names=["state", "action", "reward", "next_state", "done"])
class ReplayBuffer:
"""Fixed-size buffer to store experience tuples."""
def __init__(self, action_size, buffer_size, batch_size, device):
"""Initialize a ReplayBuffer object.
Params
======
action_size (int): dimension of each action
buffer_size (int): maximum size of buffer
batch_size (int): size of each training batch
"""
self.action_size = action_size
self.memory = deque(maxlen=buffer_size)
self.batch_size = batch_size
self.device = device
def add(self, state, action, reward, next_state, done):
"""Add a new experience to memory."""
e = Experience(np.expand_dims(state, 0), action, reward, np.expand_dims(next_state, 0), done)
self.memory.append(e)
def sample(self):
"""Randomly sample a batch of experiences from memory."""
experiences = random.sample(self.memory, k=self.batch_size)
states = torch.from_numpy(self.__v_stack_impr([e.state for e in experiences if e is not None])) \
.float().to(self.device)
actions = torch.from_numpy(self.__v_stack_impr([e.action for e in experiences if e is not None])) \
.long().to(self.device)
rewards = torch.from_numpy(self.__v_stack_impr([e.reward for e in experiences if e is not None])) \
.float().to(self.device)
next_states = torch.from_numpy(self.__v_stack_impr([e.next_state for e in experiences if e is not None])) \
.float().to(self.device)
dones = torch.from_numpy(self.__v_stack_impr([e.done for e in experiences if e is not None]).astype(np.uint8)) \
.float().to(self.device)
return states, actions, rewards, next_states, dones
def __len__(self):
"""Return the current size of internal memory."""
return len(self.memory)
def __v_stack_impr(self, states):
sub_dim = len(states[0][0]) if isinstance(states[0], Iterable) else 1
np_states = np.reshape(np.array(states), (len(states), sub_dim))
return np_states
from flatland.envs.agent_utils import RailAgentStatus
from flatland.envs.rail_env import RailEnv, RailEnvActions
from reinforcement_learning.policy import HybridPolicy
from reinforcement_learning.ppo_agent import PPOPolicy
from utils.agent_action_config import map_rail_env_action
from utils.dead_lock_avoidance_agent import DeadLockAvoidanceAgent
class DeadLockAvoidanceWithDecisionAgent(HybridPolicy):
def __init__(self, env: RailEnv, state_size, action_size, learning_agent):
print(">> DeadLockAvoidanceWithDecisionAgent")
super(DeadLockAvoidanceWithDecisionAgent, self).__init__()
self.env = env
self.state_size = state_size
self.action_size = action_size
self.learning_agent = learning_agent
self.dead_lock_avoidance_agent = DeadLockAvoidanceAgent(self.env, action_size, False)
self.policy_selector = PPOPolicy(state_size, 2)
self.memory = self.learning_agent.memory
self.loss = self.learning_agent.loss
def step(self, handle, state, action, reward, next_state, done):
select = self.policy_selector.act(handle, state, 0.0)
self.policy_selector.step(handle, state, select, reward, next_state, done)
self.dead_lock_avoidance_agent.step(handle, state, action, reward, next_state, done)
self.learning_agent.step(handle, state, action, reward, next_state, done)
self.loss = self.learning_agent.loss
def act(self, handle, state, eps=0.):
select = self.policy_selector.act(handle, state, eps)
if select == 0:
return self.learning_agent.act(handle, state, eps)
return self.dead_lock_avoidance_agent.act(handle, state, -1.0)
def save(self, filename):
self.dead_lock_avoidance_agent.save(filename)
self.learning_agent.save(filename)
self.policy_selector.save(filename + '.selector')
def load(self, filename):
self.dead_lock_avoidance_agent.load(filename)
self.learning_agent.load(filename)
self.policy_selector.load(filename + '.selector')
def start_step(self, train):
self.dead_lock_avoidance_agent.start_step(train)
self.learning_agent.start_step(train)
self.policy_selector.start_step(train)
def end_step(self, train):
self.dead_lock_avoidance_agent.end_step(train)
self.learning_agent.end_step(train)
self.policy_selector.end_step(train)
def start_episode(self, train):
self.dead_lock_avoidance_agent.start_episode(train)
self.learning_agent.start_episode(train)
self.policy_selector.start_episode(train)
def end_episode(self, train):
self.dead_lock_avoidance_agent.end_episode(train)
self.learning_agent.end_episode(train)
self.policy_selector.end_episode(train)
def load_replay_buffer(self, filename):
self.dead_lock_avoidance_agent.load_replay_buffer(filename)
self.learning_agent.load_replay_buffer(filename)
self.policy_selector.load_replay_buffer(filename + ".selector")
def test(self):
self.dead_lock_avoidance_agent.test()
self.learning_agent.test()
self.policy_selector.test()
def reset(self, env: RailEnv):
self.env = env
self.dead_lock_avoidance_agent.reset(env)
self.learning_agent.reset(env)
self.policy_selector.reset(env)
def clone(self):
return self
......@@ -26,7 +26,8 @@ from utils.observation_utils import normalize_observation
from reinforcement_learning.dddqn_policy import DDDQNPolicy
def eval_policy(env_params, checkpoint, n_eval_episodes, max_steps, action_size, state_size, seed, render, allow_skipping, allow_caching):
def eval_policy(env_params, checkpoint, n_eval_episodes, max_steps, action_size, state_size, seed, render,
allow_skipping, allow_caching):
# Evaluation is faster on CPU (except if you use a really huge policy)
parameters = {
'use_gpu': False
......@@ -140,11 +141,12 @@ def eval_policy(env_params, checkpoint, n_eval_episodes, max_steps, action_size,
else:
preproc_timer.start()
norm_obs = normalize_observation(obs[agent], tree_depth=observation_tree_depth, observation_radius=observation_radius)
norm_obs = normalize_observation(obs[agent], tree_depth=observation_tree_depth,
observation_radius=observation_radius)
preproc_timer.end()
inference_timer.start()
action = policy.act(norm_obs, eps=0.0)
action = policy.act(agent, norm_obs, eps=0.0)
inference_timer.end()
action_dict.update({agent: action})
......@@ -319,12 +321,15 @@ def evaluate_agents(file, n_evaluation_episodes, use_gpu, render, allow_skipping
results = []
if render:
results.append(eval_policy(params, file, eval_per_thread, max_steps, action_size, state_size, 0, render, allow_skipping, allow_caching))
results.append(
eval_policy(params, file, eval_per_thread, max_steps, action_size, state_size, 0, render, allow_skipping,
allow_caching))
else:
with Pool() as p:
results = p.starmap(eval_policy,
[(params, file, 1, max_steps, action_size, state_size, seed * nb_threads, render, allow_skipping, allow_caching)
[(params, file, 1, max_steps, action_size, state_size, seed * nb_threads, render,
allow_skipping, allow_caching)
for seed in
range(total_nb_eval)])
......@@ -367,10 +372,12 @@ if __name__ == "__main__":
parser.add_argument("--use_gpu", dest="use_gpu", help="use GPU if available", action='store_true')
parser.add_argument("--render", help="render a single episode", action='store_true')
parser.add_argument("--allow_skipping", help="skips to the end of the episode if all agents are deadlocked", action='store_true')
parser.add_argument("--allow_skipping", help="skips to the end of the episode if all agents are deadlocked",
action='store_true')
parser.add_argument("--allow_caching", help="caches the last observation-action pair", action='store_true')
args = parser.parse_args()
os.environ["OMP_NUM_THREADS"] = str(1)
evaluate_agents(file=args.file, n_evaluation_episodes=args.n_evaluation_episodes, use_gpu=args.use_gpu, render=args.render,
evaluate_agents(file=args.file, n_evaluation_episodes=args.n_evaluation_episodes, use_gpu=args.use_gpu,
render=args.render,
allow_skipping=args.allow_skipping, allow_caching=args.allow_caching)
......@@ -9,19 +9,22 @@ from pprint import pprint
import numpy as np
import psutil
from flatland.core.grid.grid4_utils import get_new_position
from flatland.envs.agent_utils import RailAgentStatus
from flatland.envs.malfunction_generators import malfunction_from_params, MalfunctionParameters
from flatland.envs.observations import TreeObsForRailEnv
from flatland.envs.predictions import ShortestPathPredictorForRailEnv
from flatland.envs.rail_env import RailEnv, RailEnvActions, fast_count_nonzero
from flatland.envs.rail_env import RailEnv, RailEnvActions
from flatland.envs.rail_generators import sparse_rail_generator
from flatland.envs.schedule_generators import sparse_schedule_generator
from flatland.utils.rendertools import RenderTool
from torch.utils.tensorboard import SummaryWriter
from reinforcement_learning.dddqn_policy import DDDQNPolicy
from reinforcement_learning.ppo_agent import PPOAgent
from reinforcement_learning.deadlockavoidance_with_decision_agent import DeadLockAvoidanceWithDecisionAgent
from reinforcement_learning.multi_decision_agent import MultiDecisionAgent
from reinforcement_learning.ppo_agent import PPOPolicy
from utils.agent_action_config import get_flatland_full_action_size, get_action_size, map_actions, map_action, \
set_action_size_reduced, set_action_size_full, map_action_policy
from utils.dead_lock_avoidance_agent import DeadLockAvoidanceAgent
base_dir = Path(__file__).resolve().parent.parent
sys.path.append(str(base_dir))
......@@ -78,41 +81,6 @@ def create_rail_env(env_params, tree_observation):
)
def get_agent_positions(env):
agent_positions: np.ndarray = np.full((env.height, env.width), -1)
for agent_handle in env.get_agent_handles():
agent = env.agents[agent_handle]
if agent.status == RailAgentStatus.ACTIVE:
position = agent.position
if position is None:
position = agent.initial_position
agent_positions[position] = agent_handle
return agent_positions
def check_for_dealock(handle, env, agent_positions):
agent = env.agents[handle]
if agent.status == RailAgentStatus.DONE or agent.status == RailAgentStatus.DONE_REMOVED:
return False
position = agent.position
if position is None:
position = agent.initial_position
possible_transitions = env.rail.get_transitions(*position, agent.direction)
num_transitions = fast_count_nonzero(possible_transitions)
for dir_loop in range(4):
if possible_transitions[dir_loop] == 1:
new_position = get_new_position(position, dir_loop)
opposite_agent = agent_positions[new_position]
if opposite_agent != handle and opposite_agent != -1:
num_transitions -= 1
else:
return False
is_deadlock = num_transitions <= 0
return is_deadlock
def train_agent(train_params, train_env_params, eval_env_params, obs_params):
# Environment parameters
n_agents = train_env_params.n_agents
......@@ -188,10 +156,7 @@ def train_agent(train_params, train_env_params, eval_env_params, obs_params):
# Calculate the state size given the depth of the tree observation and the number of features
state_size = tree_observation.observation_dim
# The action space of flatland is 5 discrete actions
action_size = 5
action_count = [0] * action_size
action_count = [0] * get_flatland_full_action_size()
action_dict = dict()
agent_obs = [None] * n_agents
agent_prev_obs = [None] * n_agents
......@@ -205,12 +170,33 @@ def train_agent(train_params, train_env_params, eval_env_params, obs_params):
scores_window = deque(maxlen=checkpoint_interval) # todo smooth when rendering instead
completion_window = deque(maxlen=checkpoint_interval)
if train_params.action_size == "reduced":
set_action_size_reduced()
else:
set_action_size_full()
# Double Dueling DQN policy
policy = DDDQNPolicy(state_size, action_size, train_params)
if True:
policy = PPOAgent(state_size, action_size)
if train_params.policy == "DDDQN":
policy = DDDQNPolicy(state_size, get_action_size(), train_params)
elif train_params.policy == "PPO":
policy = PPOPolicy(state_size, get_action_size(), use_replay_buffer=False, in_parameters=train_params)
elif train_params.policy == "DeadLockAvoidance":
policy = DeadLockAvoidanceAgent(train_env, get_action_size(), enable_eps=False)
elif train_params.policy == "DeadLockAvoidanceWithDecision":
# inter_policy = PPOPolicy(state_size, get_action_size(), use_replay_buffer=False, in_parameters=train_params)
inter_policy = DDDQNPolicy(state_size, get_action_size(), train_params)
policy = DeadLockAvoidanceWithDecisionAgent(train_env, state_size, get_action_size(), inter_policy)
elif train_params.policy == "MultiDecision":
policy = MultiDecisionAgent(state_size, get_action_size(), train_params)
else:
policy = PPOPolicy(state_size, get_action_size(), use_replay_buffer=False, in_parameters=train_params)
# make sure that at least one policy is set
if policy is None:
policy = DDDQNPolicy(state_size, get_action_size(), train_params)
# Load existing policy
if train_params.load_policy is not "":
if train_params.load_policy != "":
policy.load(train_params.load_policy)
# Loads existing replay buffer
......@@ -232,7 +218,7 @@ def train_agent(train_params, train_env_params, eval_env_params, obs_params):
hdd.free / (2 ** 30)))
# TensorBoard writer
writer = SummaryWriter()
writer = SummaryWriter(comment="_" + train_params.policy + "_" + train_params.action_size)
training_timer = Timer()
training_timer.start()
......@@ -256,12 +242,16 @@ def train_agent(train_params, train_env_params, eval_env_params, obs_params):
# Reset environment
reset_timer.start()
number_of_agents = int(min(n_agents, 1 + np.floor(episode_idx / 200)))
train_env_params.n_agents = episode_idx % number_of_agents + 1
if train_params.n_agent_fixed:
number_of_agents = n_agents
train_env_params.n_agents = n_agents
else:
number_of_agents = int(min(n_agents, 1 + np.floor(episode_idx / 200)))
train_env_params.n_agents = episode_idx % number_of_agents + 1
train_env = create_rail_env(train_env_params, tree_observation)
obs, info = train_env.reset(regenerate_rail=True, regenerate_schedule=True)
policy.reset()
policy.reset(train_env)
reset_timer.end()
if train_params.render:
......@@ -296,10 +286,9 @@ def train_agent(train_params, train_env_params, eval_env_params, obs_params):
agent = train_env.agents[agent_handle]
if info['action_required'][agent_handle]:
update_values[agent_handle] = True
action = policy.act(agent_obs[agent_handle], eps=eps_start)
action_count[action] += 1
actions_taken.append(action)
action = policy.act(agent_handle, agent_obs[agent_handle], eps=eps_start)
action_count[map_action(action)] += 1
actions_taken.append(map_action(action))
else:
# An action is not required if the train hasn't joined the railway network,
# if it already reached its target, or if is currently malfunctioning.
......@@ -311,42 +300,7 @@ def train_agent(train_params, train_env_params, eval_env_params, obs_params):
# Environment step
step_timer.start()
next_obs, all_rewards, done, info = train_env.step(action_dict)
# Reward shaping .Dead-lock .NotMoving .NotStarted
if False:
agent_positions = get_agent_positions(train_env)
for agent_handle in train_env.get_agent_handles():
agent = train_env.agents[agent_handle]
act = action_dict.get(agent_handle, RailEnvActions.MOVE_FORWARD)
if agent.status == RailAgentStatus.ACTIVE:
pos = agent.position
dir = agent.direction
possible_transitions = train_env.rail.get_transitions(*pos, dir)
num_transitions = fast_count_nonzero(possible_transitions)
if act == RailEnvActions.STOP_MOVING:
all_rewards[agent_handle] -= 2.0
if num_transitions == 1:
if act != RailEnvActions.MOVE_FORWARD:
all_rewards[agent_handle] -= 1.0
if check_for_dealock(agent_handle, train_env, agent_positions):
all_rewards[agent_handle] -= 5.0
elif agent.status == RailAgentStatus.READY_TO_DEPART:
all_rewards[agent_handle] -= 5.0
else:
if True:
agent_positions = get_agent_positions(train_env)
for agent_handle in train_env.get_agent_handles():
agent = train_env.agents[agent_handle]
act = action_dict.get(agent_handle, RailEnvActions.MOVE_FORWARD)
if agent.status == RailAgentStatus.ACTIVE:
if done[agent_handle] == False:
if check_for_dealock(agent_handle, train_env, agent_positions):
all_rewards[agent_handle] -= 1000.0
done[agent_handle] = True
next_obs, all_rewards, done, info = train_env.step(map_actions(action_dict))
step_timer.end()
# Render an episode at some interval
......@@ -365,7 +319,7 @@ def train_agent(train_params, train_env_params, eval_env_params, obs_params):
learn_timer.start()
policy.step(agent_handle,
agent_prev_obs[agent_handle],
agent_prev_action[agent_handle],
map_action_policy(agent_prev_action[agent_handle]),
all_rewards[agent_handle],
agent_obs[agent_handle],
done[agent_handle])
......@@ -415,19 +369,19 @@ def train_agent(train_params, train_env_params, eval_env_params, obs_params):
policy.save_replay_buffer('./replay_buffers/' + training_id + '-' + str(episode_idx) + '.pkl')
# reset action count
action_count = [0] * action_size
action_count = [0] * get_flatland_full_action_size()
print(
'\r🚂 Episode {}'
'\t 🚉 nAgents {}'
'\t 🏆 Score: {:7.3f}'
'\t 🚉 nAgents {:2}/{:2}'
' 🏆 Score: {:7.3f}'
' Avg: {:7.3f}'
'\t 💯 Done: {:6.2f}%'
' Avg: {:6.2f}%'
'\t 🎲 Epsilon: {:.3f} '
'\t 🔀 Action Probs: {}'.format(
episode_idx,
train_env_params.n_agents,
train_env_params.n_agents, number_of_agents,
normalized_score,
smoothed_normalized_score,
100 * completion,
......@@ -488,6 +442,7 @@ def train_agent(train_params, train_env_params, eval_env_params, obs_params):
writer.add_scalar("timer/learn", learn_timer.get(), episode_idx)
writer.add_scalar("timer/preproc", preproc_timer.get(), episode_idx)
writer.add_scalar("timer/total", training_timer.get_current(), episode_idx)
writer.flush()
def format_action_prob(action_probs):
......@@ -517,6 +472,7 @@ def eval_policy(env, tree_observation, policy, train_params, obs_params):
score = 0.0
obs, info = env.reset(regenerate_rail=True, regenerate_schedule=True)
policy.reset(env)
final_step = 0
policy.start_episode(train=False)
......@@ -530,10 +486,10 @@ def eval_policy(env, tree_observation, policy, train_params, obs_params):
action = 0
if info['action_required'][agent]:
if tree_observation.check_is_observation_valid(agent_obs[agent]):
action = policy.act(agent_obs[agent], eps=0.0)
action = policy.act(agent, agent_obs[agent], eps=0.0)
action_dict.update({agent: action})
policy.end_step(train=False)
obs, all_rewards, done, info = env.step(action_dict)
obs, all_rewards, done, info = env.step(map_actions(action_dict))
for agent in env.get_agent_handles():
score += all_rewards[agent]
......@@ -559,17 +515,18 @@ def eval_policy(env, tree_observation, policy, train_params, obs_params):
if __name__ == "__main__":
parser = ArgumentParser()
parser.add_argument("-n", "--n_episodes", help="number of episodes to run", default=25000, type=int)
parser.add_argument("-t", "--training_env_config", help="training config id (eg 0 for Test_0)", default=0,
parser.add_argument("-n", "--n_episodes", help="number of episodes to run", default=5000, type=int)
parser.add_argument("--n_agent_fixed", help="hold the number of agent fixed", action='store_true')
parser.add_argument("-t", "--training_env_config", help="training config id (eg 0 for Test_0)", default=1,
type=int)
parser.add_argument("-e", "--evaluation_env_config", help="evaluation config id (eg 0 for Test_0)", default=0,
parser.add_argument("-e", "--evaluation_env_config", help="evaluation config id (eg 0 for Test_0)", default=1,
type=int)
parser.add_argument("--n_evaluation_episodes", help="number of evaluation episodes", default=5, type=int)
parser.add_argument("--checkpoint_interval", help="checkpoint interval", default=200, type=int)
parser.add_argument("--n_evaluation_episodes", help="number of evaluation episodes", default=10, type=int)
parser.add_argument("--checkpoint_interval", help="checkpoint interval", default=100, type=int)
parser.add_argument("--eps_start", help="max exploration", default=1.0, type=float)
parser.add_argument("--eps_end", help="min exploration", default=0.05, type=float)
parser.add_argument("--eps_end", help="min exploration", default=0.01, type=float)
parser.add_argument("--eps_decay", help="exploration decay", default=0.9975, type=float)
parser.add_argument("--buffer_size", help="replay buffer size", default=int(1e7), type=int)
parser.add_argument("--buffer_size", help="replay buffer size", default=int(32_000), type=int)
parser.add_argument("--buffer_min_size", help="min buffer size to start training", default=0, type=int)
parser.add_argument("--restore_replay_buffer", help="replay buffer to restore", default="", type=str)
parser.add_argument("--save_replay_buffer", help="save replay buffer at each evaluation interval", default=False,
......@@ -587,6 +544,10 @@ if __name__ == "__main__":
parser.add_argument("--use_fast_tree_observation", help="use FastTreeObs instead of stock TreeObs",
action='store_true')
parser.add_argument("--max_depth", help="max depth", default=2, type=int)
parser.add_argument("--policy",
help="policy name [DDDQN, PPO, DeadLockAvoidance, DeadLockAvoidanceWithDecision, MultiDecision]",
default="DeadLockAvoidance")
parser.add_argument("--action_size", help="define the action size [reduced,full]", default="full", type=str)
training_params = parser.parse_args()
env_params = [
......
from flatland.envs.rail_env import RailEnv
from reinforcement_learning.dddqn_policy import DDDQNPolicy
from reinforcement_learning.policy import LearningPolicy, DummyMemory
from reinforcement_learning.ppo_agent import PPOPolicy
class MultiDecisionAgent(LearningPolicy):
def __init__(self, state_size, action_size, in_parameters=None):
print(">> MultiDecisionAgent")
super(MultiDecisionAgent, self).__init__()
self.state_size = state_size
self.action_size = action_size
self.in_parameters = in_parameters
self.memory = DummyMemory()
self.loss = 0
self.ppo_policy = PPOPolicy(state_size, action_size, use_replay_buffer=False, in_parameters=in_parameters)
self.dddqn_policy = DDDQNPolicy(state_size, action_size, in_parameters)
self.policy_selector = PPOPolicy(state_size, 2)
def step(self, handle, state, action, reward, next_state, done):
self.ppo_policy.step(handle, state, action, reward, next_state, done)
self.dddqn_policy.step(handle, state, action, reward, next_state, done)
select = self.policy_selector.act(handle, state, 0.0)
self.policy_selector.step(handle, state, select, reward, next_state, done)
def act(self, handle, state, eps=0.):
select = self.policy_selector.act(handle, state, eps)
if select == 0:
return self.dddqn_policy.act(handle, state, eps)
return self.policy_selector.act(handle, state, eps)
def save(self, filename):
self.ppo_policy.save(filename)
self.dddqn_policy.save(filename)
self.policy_selector.save(filename)
def load(self, filename):
self.ppo_policy.load(filename)
self.dddqn_policy.load(filename)
self.policy_selector.load(filename)
def start_step(self, train):
self.ppo_policy.start_step(train)
self.dddqn_policy.start_step(train)
self.policy_selector.start_step(train)
def end_step(self, train):
self.ppo_policy.end_step(train)
self.dddqn_policy.end_step(train)
self.policy_selector.end_step(train)
def start_episode(self, train):
self.ppo_policy.start_episode(train)
self.dddqn_policy.start_episode(train)
self.policy_selector.start_episode(train)
def end_episode(self, train):
self.ppo_policy.end_episode(train)
self.dddqn_policy.end_episode(train)
self.policy_selector.end_episode(train)
def load_replay_buffer(self, filename):
self.ppo_policy.load_replay_buffer(filename)
self.dddqn_policy.load_replay_buffer(filename)
self.policy_selector.load_replay_buffer(filename)
def test(self):
self.ppo_policy.test()
self.dddqn_policy.test()
self.policy_selector.test()
def reset(self, env: RailEnv):
self.ppo_policy.reset(env)
self.dddqn_policy.reset(env)
self.policy_selector.reset(env)
def clone(self):
multi_descision_agent = MultiDecisionAgent(
self.state_size,
self.action_size,
self.in_parameters
)
multi_descision_agent.ppo_policy = self.ppo_policy.clone()
multi_descision_agent.dddqn_policy = self.dddqn_policy.clone()
multi_descision_agent.policy_selector = self.policy_selector.clone()
return multi_descision_agent
import numpy as np
from flatland.envs.rail_env import RailEnv
from reinforcement_learning.policy import Policy
from reinforcement_learning.ppo_agent import PPOAgent
from reinforcement_learning.ppo_agent import PPOPolicy
from utils.dead_lock_avoidance_agent import DeadLockAvoidanceAgent
......@@ -12,7 +13,7 @@ class MultiPolicy(Policy):
self.memory = []
self.loss = 0
self.deadlock_avoidance_policy = DeadLockAvoidanceAgent(env, action_size, False)
self.ppo_policy = PPOAgent(state_size + action_size, action_size)
self.ppo_policy = PPOPolicy(state_size + action_size, action_size)
def load(self, filename):
self.ppo_policy.load(filename)
......@@ -45,9 +46,9 @@ class MultiPolicy(Policy):
self.loss = self.ppo_policy.loss
return action_ppo
def reset(self):
self.ppo_policy.reset()
self.deadlock_avoidance_policy.reset()
def reset(self, env: RailEnv):
self.ppo_policy.reset(env)
self.deadlock_avoidance_policy.reset(env)
def test(self):
self.ppo_policy.test()
......
......@@ -15,7 +15,7 @@ class OrderedPolicy(Policy):
def __init__(self):
self.action_size = 5
def act(self, state, eps=0.):
def act(self, handle, state, eps=0.):
_, distance, _ = split_tree_into_feature_groups(state, 1)
distance = distance[1:]
min_dist = min_gt(distance, 0)
......
import torch.nn as nn
from flatland.envs.rail_env import RailEnv
class DummyMemory:
def __init__(self):
self.memory = []
def __len__(self):
return 0
class Policy:
def step(self, handle, state, action, reward, next_state, done):
raise NotImplementedError
def act(self, state, eps=0.):
def act(self, handle, state, eps=0.):
raise NotImplementedError
def save(self, filename):
......@@ -13,16 +22,16 @@ class Policy:
def load(self, filename):
raise NotImplementedError
def start_step(self,train):
def start_step(self, train):
pass
def end_step(self,train):
def end_step(self, train):
pass
def start_episode(self,train):
def start_episode(self, train):
pass
def end_episode(self,train):
def end_episode(self, train):
pass
def load_replay_buffer(self, filename):
......@@ -31,8 +40,23 @@ class Policy:
def test(self):
pass
def reset(self):
def reset(self, env: RailEnv):
pass
def clone(self):
return self
\ No newline at end of file
return self
class HeuristicPolicy(Policy):
def __init__(self):
super(HeuristicPolicy).__init__()
class LearningPolicy(Policy):
def __init__(self):
super(LearningPolicy).__init__()
class HybridPolicy(Policy):
def __init__(self):
super(HybridPolicy).__init__()
......@@ -3,18 +3,16 @@ import os
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.distributions import Categorical
# Hyperparameters
from reinforcement_learning.policy import Policy
from reinforcement_learning.policy import LearningPolicy
from reinforcement_learning.replay_buffer import ReplayBuffer
device = torch.device("cpu") # "cuda:0" if torch.cuda.is_available() else "cpu")
print("device:", device)
# https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html
class DataBuffers:
class EpisodeBuffers:
def __init__(self):
self.reset()
......@@ -36,15 +34,17 @@ class DataBuffers:
class ActorCriticModel(nn.Module):
def __init__(self, state_size, action_size, hidsize1=128, hidsize2=128):
def __init__(self, state_size, action_size, device, hidsize1=512, hidsize2=256):
super(ActorCriticModel, self).__init__()
self.device = device
self.actor = nn.Sequential(
nn.Linear(state_size, hidsize1),
nn.Tanh(),
nn.Linear(hidsize1, hidsize2),
nn.Tanh(),
nn.Linear(hidsize2, action_size)
)
nn.Linear(hidsize2, action_size),
nn.Softmax(dim=-1)
).to(self.device)
self.critic = nn.Sequential(
nn.Linear(state_size, hidsize1),
......@@ -52,18 +52,18 @@ class ActorCriticModel(nn.Module):
nn.Linear(hidsize1, hidsize2),
nn.Tanh(),
nn.Linear(hidsize2, 1)
)
).to(self.device)
def forward(self, x):
raise NotImplementedError
def act_prob(self, states, softmax_dim=0):
x = self.actor(states)
prob = F.softmax(x, dim=softmax_dim)
return prob
def get_actor_dist(self, state):
action_probs = self.actor(state)
dist = Categorical(action_probs)
return dist
def evaluate(self, states, actions):
action_probs = self.act_prob(states)
action_probs = self.actor(states)
dist = Categorical(action_probs)
action_logprobs = dist.log_prob(actions)
dist_entropy = dist.entropy()
......@@ -79,49 +79,96 @@ class ActorCriticModel(nn.Module):
if os.path.exists(filename):
print(' >> ', filename)
try:
obj.load_state_dict(torch.load(filename, map_location=device))
obj.load_state_dict(torch.load(filename, map_location=self.device))
except:
print(" >> failed!")
return obj
def load(self, filename):
print("load policy from file", filename)
print("load model from file", filename)
self.actor = self._load(self.actor, filename + ".actor")
self.critic = self._load(self.critic, filename + ".critic")
self.critic = self._load(self.critic, filename + ".value")
class PPOAgent(Policy):
def __init__(self, state_size, action_size):
super(PPOAgent, self).__init__()
class PPOPolicy(LearningPolicy):
def __init__(self, state_size, action_size, use_replay_buffer=False, in_parameters=None):
print(">> PPOPolicy")
super(PPOPolicy, self).__init__()
# parameters
self.learning_rate = 0.1e-3
self.gamma = 0.98
self.ppo_parameters = in_parameters
if self.ppo_parameters is not None:
self.hidsize = self.ppo_parameters.hidden_size
self.buffer_size = self.ppo_parameters.buffer_size
self.batch_size = self.ppo_parameters.batch_size
self.learning_rate = self.ppo_parameters.learning_rate
self.gamma = self.ppo_parameters.gamma
# Device
if self.ppo_parameters.use_gpu and torch.cuda.is_available():
self.device = torch.device("cuda:0")
# print("🐇 Using GPU")
else:
self.device = torch.device("cpu")
# print("🐢 Using CPU")
else:
self.hidsize = 128
self.learning_rate = 1.0e-3
self.gamma = 0.95
self.buffer_size = 32_000
self.batch_size = 1024
self.device = torch.device("cpu")
self.surrogate_eps_clip = 0.1
self.K_epoch = 3
self.weight_loss = 0.9
self.K_epoch = 10
self.weight_loss = 0.5
self.weight_entropy = 0.01
# objects
self.memory = DataBuffers()
self.buffer_min_size = 0
self.use_replay_buffer = use_replay_buffer
self.current_episode_memory = EpisodeBuffers()
self.memory = ReplayBuffer(action_size, self.buffer_size, self.batch_size, self.device)
self.loss = 0
self.actor_critic_model = ActorCriticModel(state_size, action_size)
self.actor_critic_model = ActorCriticModel(state_size, action_size,self.device,
hidsize1=self.hidsize,
hidsize2=self.hidsize)
self.optimizer = optim.Adam(self.actor_critic_model.parameters(), lr=self.learning_rate)
self.lossFunction = nn.MSELoss()
self.loss_function = nn.MSELoss() # nn.SmoothL1Loss()
def reset(self):
def reset(self, env):
pass
def act(self, state, eps=None):
def act(self, handle, state, eps=None):
# sample a action to take
prob = self.actor_critic_model.act_prob(torch.from_numpy(state).float())
return Categorical(prob).sample().item()
torch_state = torch.tensor(state, dtype=torch.float).to(self.device)
dist = self.actor_critic_model.get_actor_dist(torch_state)
action = dist.sample()
return action.item()
def step(self, handle, state, action, reward, next_state, done):
# record transitions ([state] -> [action] -> [reward, nextstate, done])
prob = self.actor_critic_model.act_prob(torch.from_numpy(state).float())
transition = (state, action, reward, next_state, prob[action].item(), done)
self.memory.push_transition(handle, transition)
# record transitions ([state] -> [action] -> [reward, next_state, done])
torch_action = torch.tensor(action, dtype=torch.float).to(self.device)
torch_state = torch.tensor(state, dtype=torch.float).to(self.device)
# evaluate actor
dist = self.actor_critic_model.get_actor_dist(torch_state)
action_logprobs = dist.log_prob(torch_action)
transition = (state, action, reward, next_state, action_logprobs.item(), done)
self.current_episode_memory.push_transition(handle, transition)
def _push_transitions_to_replay_buffer(self,
state_list,
action_list,
reward_list,
state_next_list,
done_list,
prob_a_list):
for idx in range(len(reward_list)):
state_i = state_list[idx]
action_i = action_list[idx]
reward_i = reward_list[idx]
state_next_i = state_next_list[idx]
done_i = done_list[idx]
prob_action_i = prob_a_list[idx]
self.memory.add(state_i, action_i, reward_i, state_next_i, done_i, prob_action_i)
def _convert_transitions_to_torch_tensors(self, transitions_array):
# build empty lists(arrays)
......@@ -138,60 +185,87 @@ class PPOAgent(Policy):
discounted_reward = 0
done_list.insert(0, 1)
else:
discounted_reward = reward_i + self.gamma * discounted_reward
done_list.insert(0, 0)
discounted_reward = reward_i + self.gamma * discounted_reward
reward_list.insert(0, discounted_reward)
state_next_list.insert(0, state_next_i)
prob_a_list.insert(0, prob_action_i)
if self.use_replay_buffer:
self._push_transitions_to_replay_buffer(state_list, action_list,
reward_list, state_next_list,
done_list, prob_a_list)
# convert data to torch tensors
states, actions, rewards, states_next, dones, prob_actions = \
torch.tensor(state_list, dtype=torch.float).to(device), \
torch.tensor(action_list).to(device), \
torch.tensor(reward_list, dtype=torch.float).to(device), \
torch.tensor(state_next_list, dtype=torch.float).to(device), \
torch.tensor(done_list, dtype=torch.float).to(device), \
torch.tensor(prob_a_list).to(device)
# standard-normalize rewards
rewards = (rewards - rewards.mean()) / (rewards.std() + 1.e-5)
torch.tensor(state_list, dtype=torch.float).to(self.device), \
torch.tensor(action_list).to(self.device), \
torch.tensor(reward_list, dtype=torch.float).to(self.device), \
torch.tensor(state_next_list, dtype=torch.float).to(self.device), \
torch.tensor(done_list, dtype=torch.float).to(self.device), \
torch.tensor(prob_a_list).to(self.device)
return states, actions, rewards, states_next, dones, prob_actions
def _get_transitions_from_replay_buffer(self, states, actions, rewards, states_next, dones, probs_action):
if len(self.memory) > self.buffer_min_size and len(self.memory) > self.batch_size:
states, actions, rewards, states_next, dones, probs_action = self.memory.sample()
actions = torch.squeeze(actions)
rewards = torch.squeeze(rewards)
states_next = torch.squeeze(states_next)
dones = torch.squeeze(dones)
probs_action = torch.squeeze(probs_action)
return states, actions, rewards, states_next, dones, probs_action
def train_net(self):
for handle in range(len(self.memory)):
agent_episode_history = self.memory.get_transitions(handle)
# All agents have to propagate their experiences made during past episode
for handle in range(len(self.current_episode_memory)):
# Extract agent's episode history (list of all transitions)
agent_episode_history = self.current_episode_memory.get_transitions(handle)
if len(agent_episode_history) > 0:
# convert the replay buffer to torch tensors (arrays)
# Convert the replay buffer to torch tensors (arrays)
states, actions, rewards, states_next, dones, probs_action = \
self._convert_transitions_to_torch_tensors(agent_episode_history)
# Optimize policy for K epochs:
for _ in range(self.K_epoch):
# evaluating actions (actor) and values (critic)
for k_loop in range(int(self.K_epoch)):
if self.use_replay_buffer:
states, actions, rewards, states_next, dones, probs_action = \
self._get_transitions_from_replay_buffer(
states, actions, rewards, states_next, dones, probs_action
)
# Evaluating actions (actor) and values (critic)
logprobs, state_values, dist_entropy = self.actor_critic_model.evaluate(states, actions)
# finding the ratios (pi_thetas / pi_thetas_replayed):
# Finding the ratios (pi_thetas / pi_thetas_replayed):
ratios = torch.exp(logprobs - probs_action.detach())
# finding Surrogate Loss:
# Finding Surrogate Loos
advantages = rewards - state_values.detach()
surr1 = ratios * advantages
surr2 = torch.clamp(ratios, 1 - self.surrogate_eps_clip, 1 + self.surrogate_eps_clip) * advantages
surr2 = torch.clamp(ratios, 1. - self.surrogate_eps_clip, 1. + self.surrogate_eps_clip) * advantages
# The loss function is used to estimate the gardient and use the entropy function based
# heuristic to penalize the gradient function when the policy becomes deterministic this would let
# the gradient becomes very flat and so the gradient is no longer useful.
loss = \
-torch.min(surr1, surr2) \
+ self.weight_loss * self.lossFunction(state_values, rewards) \
+ self.weight_loss * self.loss_function(state_values, rewards) \
- self.weight_entropy * dist_entropy
# make a gradient step
# Make a gradient step
self.optimizer.zero_grad()
loss.mean().backward()
self.optimizer.step()
# store current loss to the agent
self.loss = loss.mean().detach().numpy()
# Transfer the current loss to the agents loss (information) for debug purpose only
self.loss = loss.mean().detach().cpu().numpy()
self.memory.reset()
# Reset all collect transition data
self.current_episode_memory.reset()
def end_episode(self, train):
if train:
......@@ -207,9 +281,11 @@ class PPOAgent(Policy):
if os.path.exists(filename):
print(' >> ', filename)
try:
obj.load_state_dict(torch.load(filename, map_location=device))
obj.load_state_dict(torch.load(filename, map_location=self.device))
except:
print(" >> failed!")
else:
print(" >> file not found!")
return obj
def load(self, filename):
......@@ -219,7 +295,7 @@ class PPOAgent(Policy):
self.optimizer = self._load(self.optimizer, filename + ".optimizer")
def clone(self):
policy = PPOAgent(self.state_size, self.action_size)
policy = PPOPolicy(self.state_size, self.action_size)
policy.actor_critic_model = copy.deepcopy(self.actor_critic_model)
policy.optimizer = copy.deepcopy(self.optimizer)
return self