Update Getting_Started_Training.md

a58f4ef1 · Erik Nygren · 07724835 · a58f4ef1
Commit a58f4ef1 authored 5 years ago by Erik Nygren
--- a/torch_training/Getting_Started_Training.md
+++ b/torch_training/Getting_Started_Training.md
@@ -54,7 +54,7 @@ Each node is filled with information gathered along the path to the node. Curren
 For training purposes the tree is flattend into a single array.

 ## Training
-
+### Setting up the environment
 Let us now train a simle double dueling DQN agent to navigate to its target on flatland. We start by importing flatland
 ```
 from flatland.envs.generators import complex_rail_generator
@@ -62,4 +62,139 @@ from flatland.envs.observations import TreeObsForRailEnv
 from flatland.envs.rail_env import RailEnv
 from flatland.utils.rendertools import RenderTool
 from utils.observation_utils import norm_obs_clip, split_tree
-```
\ No newline at end of file
+```
+For this simple example we want to train on randomly generated levels using the `complex_rail_generator`. We use the following parameter for our first experiment:
+```
+# Parameters for the Environment
+x_dim = 10
+y_dim = 10
+n_agents = 1
+n_goals = 5
+min_dist = 5
+```
+As mentioned above, for this experiment we are going to use the tree observation and thus we load the observation builder:
+```
+# We are training an Agent using the Tree Observation with depth 2
+observation_builder = TreeObsForRailEnv(max_depth=2)
+```
+
+And pass it as an argument to the environment setup
+````
+env = RailEnv(width=x_dim,
+              height=y_dim,
+              rail_generator=complex_rail_generator(nr_start_goal=n_goals, nr_extra=5, min_dist=min_dist,
+                                                    max_dist=99999,
+                                                    seed=0),
+              obs_builder_object=observation_builder,
+              number_of_agents=n_agents)
+```
+We have no successfully set up the environment for training. To visualize it in the renderer we also initiate the renderer with.
+```
+env_renderer = RenderTool(env, gl="PILSVG", )
+```
+###Setting up the agent
+To set up a appropriate agent we need the state and action space sizes. From the discussion above about the tree observation we end up with:
+```
+# Given the depth of the tree observation and the number of features per node we get the following state_size
+features_per_node = 9
+tree_depth = 2
+nr_nodes = 0
+for i in range(tree_depth + 1):
+    nr_nodes += np.power(4, i)
+state_size = features_per_node * nr_nodes
+
+# The action space of flatland is 5 discrete actions
+action_size = 5
+```
+In the `training_navigation.py` file you will finde further variable that we initiate in order to keep track of the training progress.
+Below you see an example code to train an agent. It is important to note that we reshape and normalize the tree observation provided by the environment to facilitate training.
+To do so, we use the utility functions `split_tree(tree=np.array(obs[a]), num_features_per_node=features_per_node, current_depth=0)` and `norm_obs_clip()`. Feel free to modify the normalization as you see fit.
+```
+# Split the observation tree into its parts and normalize the observation using the utility functions.
+    # Build agent specific local observation
+    for a in range(env.get_num_agents()):
+        rail_data, distance_data, agent_data = split_tree(tree=np.array(obs[a]),
+                                                          num_features_per_node=features_per_node,
+                                                          current_depth=0)
+        rail_data = norm_obs_clip(rail_data)
+        distance_data = norm_obs_clip(distance_data)
+        agent_data = np.clip(agent_data, -1, 1)
+        agent_obs[a] = np.concatenate((np.concatenate((rail_data, distance_data)), agent_data))
+```
+We now use the normalized `agent_obs` for our training loop:
+
+```
+for trials in range(1, n_trials + 1):
+
+    # Reset environment
+    obs = env.reset(True, True)
+    if not Training:
+        env_renderer.set_new_rail()
+
+    # Split the observation tree into its parts and normalize the observation using the utility functions.
+    # Build agent specific local observation
+    for a in range(env.get_num_agents()):
+        rail_data, distance_data, agent_data = split_tree(tree=np.array(obs[a]),
+                                                          num_features_per_node=features_per_node,
+                                                          current_depth=0)
+        rail_data = norm_obs_clip(rail_data)
+        distance_data = norm_obs_clip(distance_data)
+        agent_data = np.clip(agent_data, -1, 1)
+        agent_obs[a] = np.concatenate((np.concatenate((rail_data, distance_data)), agent_data))
+
+    # Reset score and done
+    score = 0
+    env_done = 0
+
+    # Run episode
+    for step in range(max_steps):
+
+        # Only render when not triaing
+        if not Training:
+            env_renderer.renderEnv(show=True, show_observations=True)
+
+        # Chose the actions
+        for a in range(env.get_num_agents()):
+            if not Training:
+                eps = 0
+
+            action = agent.act(agent_obs[a], eps=eps)
+            action_dict.update({a: action})
+
+            # Count number of actions takes for statistics
+            action_prob[action] += 1
+
+        # Environment step
+        next_obs, all_rewards, done, _ = env.step(action_dict)
+
+        for a in range(env.get_num_agents()):
+            rail_data, distance_data, agent_data = split_tree(tree=np.array(next_obs[a]),
+                                                              num_features_per_node=features_per_node,
+                                                              current_depth=0)
+            rail_data = norm_obs_clip(rail_data)
+            distance_data = norm_obs_clip(distance_data)
+            agent_data = np.clip(agent_data, -1, 1)
+            agent_next_obs[a] = np.concatenate((np.concatenate((rail_data, distance_data)), agent_data))
+
+        # Update replay buffer and train agent
+        for a in range(env.get_num_agents()):
+
+            # Remember and train agent
+            if Training:
+                agent.step(agent_obs[a], action_dict[a], all_rewards[a], agent_next_obs[a], done[a])
+
+            # Update the current score
+            score += all_rewards[a] / env.get_num_agents()
+
+        agent_obs = agent_next_obs.copy()
+        if done['__all__']:
+            env_done = 1
+            break
+
+    # Epsilon decay
+    eps = max(eps_end, eps_decay * eps)  # decrease epsilon
+```
+
+Running the `navigation_training.py` file trains a simple agent to navigate to any random target within the railway network. After running you should see a learning curve similiar to this one:
+
+