From 8d2070f7f4dd7dea8f3e851808044403b3245a1f Mon Sep 17 00:00:00 2001
From: u229589 <christian.baumberger@sbb.ch>
Date: Thu, 17 Oct 2019 11:56:42 +0200
Subject: [PATCH] adjust documentation on how to call the reset method in
 RailEnv

---
 changelog.md                                  |   1 +
 docs/specifications/railway.md                | 121 +++++++++---------
 docs/tutorials/01_gettingstarted.rst          |   6 +-
 docs/tutorials/02_observationbuilder.rst      |   4 +
 .../03_rail_and_schedule_generator.md         |  12 +-
 docs/tutorials/04_stochasticity.md            |   9 +-
 6 files changed, 84 insertions(+), 69 deletions(-)

diff --git a/changelog.md b/changelog.md
index 4cb76d2e..3d256c1d 100644
--- a/changelog.md
+++ b/changelog.md
@@ -9,6 +9,7 @@ Changes since Flatland 2.0.0
 
 ### Changes in rail generator and `RailEnv`
 - renaming of `distance_maps` into `distance_map`
+- by default the reset method of RailEnv is not called in the constructor of RailEnv anymore. Therefore the reset method needs to be called after the creation of a RailEnv object
 
 Changes since Flatland 1.0.0
 --------------------------
diff --git a/docs/specifications/railway.md b/docs/specifications/railway.md
index ec707f87..69abd15a 100644
--- a/docs/specifications/railway.md
+++ b/docs/specifications/railway.md
@@ -11,15 +11,15 @@ This documentation illustrates the dynamics and possibilities of Flatland enviro
 
 ### Environment
 
-Before describing the Flatland at hand, let us first define terms which will be used in this specification. Flatland is grid-like n-dimensional space of any size. A cell is the elementary element of the grid.  The cell is defined as a location where any objects can be located at. The term agent is defined as an entity that can move within the grid and must solve tasks. An agent can move in any arbitrary direction on well-defined transitions from cells to cell. The cell where the agent is located at must have enough capacity to hold the agent on. Every agent reserves exact one capacity or resource. The capacity of a cell is usually one. Thus usually only one agent can be at same time located at a given cell. The agent movement possibility can be restricted by limiting the allowed transitions. 
+Before describing the Flatland at hand, let us first define terms which will be used in this specification. Flatland is grid-like n-dimensional space of any size. A cell is the elementary element of the grid.  The cell is defined as a location where any objects can be located at. The term agent is defined as an entity that can move within the grid and must solve tasks. An agent can move in any arbitrary direction on well-defined transitions from cells to cell. The cell where the agent is located at must have enough capacity to hold the agent on. Every agent reserves exact one capacity or resource. The capacity of a cell is usually one. Thus usually only one agent can be at same time located at a given cell. The agent movement possibility can be restricted by limiting the allowed transitions.
 
 Flatland is a discrete time simulation. A discrete time simulation performs all actions with constant time step. In Flatland the simulation step moves the time forward in equal duration of time. At each step the agents can choose an action. For the chosen action the attached transition will be executed. While executing a transition Flatland checks whether the requested transition is valid. If the transition is valid the transition will update the agents position. In case the transition call is not allowed the agent will not move.
 
-In general each cell has a only one cell type attached. With the help of the cell type the allowed transitions can be defined for all agents. 
+In general each cell has a only one cell type attached. With the help of the cell type the allowed transitions can be defined for all agents.
 
-Flatland supports many different types of agents. In consequence the cell type can be further defined per agent type. In consequence the allowed transition for a agent at a given cell is now defined by the cell type and agent's type.  
+Flatland supports many different types of agents. In consequence the cell type can be further defined per agent type. In consequence the allowed transition for a agent at a given cell is now defined by the cell type and agent's type.
 
-For each agent type Flatland can have a different action space. 
+For each agent type Flatland can have a different action space.
 
 
 #### Grid
@@ -33,20 +33,20 @@ Within this documentation we use North, East, West, South as orientation indicat
 
 Cells are enumerated starting from NW, East-West axis is the second coordinate, North-South is the first coordinate as commonly used in matrix notation.
 
-Two cells $`i`$ and $`j`$ ($`i \neq j`$) are considered neighbors when the Euclidean distance between them is $`|\vec{x_i}-\vec{x_j}<= \sqrt{2}|`$. This means that the grid does not wrap around as if on a torus. (Two cells are considered neighbors when they share one edge or on node.) 
+Two cells $`i`$ and $`j`$ ($`i \neq j`$) are considered neighbors when the Euclidean distance between them is $`|\vec{x_i}-\vec{x_j}<= \sqrt{2}|`$. This means that the grid does not wrap around as if on a torus. (Two cells are considered neighbors when they share one edge or on node.)
 
 ![cell_table](https://drive.google.com/uc?export=view&id=109cD1uihDvTWnQ7PPTxC9AiNphlsY92r)
 
 For each cell the allowed transitions to all neighboring 4 cells are defined. This can be extended to include transition probabilities as well.
 
 
-#### Tile Types 
+#### Tile Types
 
 ###### Railway Grid
 
-Each Cell within the simulation grid consists of a distinct tile type which in turn limit the movement possibilities of the agent through the cell. For railway specific problem 8 basic tile types can be defined which describe a rail network. As a general fact in railway network when on navigation choice must be taken at maximum two options are available. 
+Each Cell within the simulation grid consists of a distinct tile type which in turn limit the movement possibilities of the agent through the cell. For railway specific problem 8 basic tile types can be defined which describe a rail network. As a general fact in railway network when on navigation choice must be taken at maximum two options are available.
 
-The following image gives an overview of the eight basic types. These can be rotated in steps of 45° and mirrored along the North-South of East-West axis. Please refer to Appendix A for a complete list of tiles. 
+The following image gives an overview of the eight basic types. These can be rotated in steps of 45° and mirrored along the North-South of East-West axis. Please refer to Appendix A for a complete list of tiles.
 
 
 ![cell_types](https://drive.google.com/uc?export=view&id=164iowmfRQ9O34hquxLhO2xxt49NE473P)
@@ -56,9 +56,9 @@ As a general consistency rule, it can be said that each connection out of a tile
 
 ![consistency_rule](https://drive.google.com/uc?export=view&id=1iaMIokHZ9BscMJ_Vi9t8QX_-8DzOjBKE)
 
-In the image above on the left picture there is an inconsistency at the eastern end of cell (3,2) since the there is no valid neighbor for cell (3,2). In the right picture a Cell (3,2) consists of a dead-end which leaves no unconnected transitions. 
+In the image above on the left picture there is an inconsistency at the eastern end of cell (3,2) since the there is no valid neighbor for cell (3,2). In the right picture a Cell (3,2) consists of a dead-end which leaves no unconnected transitions.
 
-Case 0 represents a wall, thus no agent can occupy the tile at any time. 
+Case 0 represents a wall, thus no agent can occupy the tile at any time.
 
 Case 1 represent a passage through the tile. While on the tile the agent on can make no navigation decision. The agent can only decide to either continue, i.e. passing on to the next connected tile, wait or move backwards (moving the tile visited before).
 
@@ -66,20 +66,20 @@ Case 2 represents a simple switch thus when coming the top position (south in th
 
 Case 3 can be seen as a superposition of Case 1. As with any other tile at maximum one agent can occupy the cell at a given time.
 
-Case 4 represents a single-slit switch. In the example a navigation choice is possible when coming from West or South. 
+Case 4 represents a single-slit switch. In the example a navigation choice is possible when coming from West or South.
 
-In Case 5 coming from all direction a navigation choice must be taken. 
+In Case 5 coming from all direction a navigation choice must be taken.
 
-Case 7 represents a deadend, thus only stop or backwards motion is possible when an agent occupies this cell. 
+Case 7 represents a deadend, thus only stop or backwards motion is possible when an agent occupies this cell.
 
 
 ###### Tile Types of Wall-Based Cell Games (Theseus and Minotaur's puzzle, Labyrinth Game)
 
-The Flatland approach can also be used the describe a variety of cell based logic games. While not going into any detail at all it is still worthwhile noting that the games are usually visualized using cell grid with wall describing forbidden transitions (negative formulation). 
+The Flatland approach can also be used the describe a variety of cell based logic games. While not going into any detail at all it is still worthwhile noting that the games are usually visualized using cell grid with wall describing forbidden transitions (negative formulation).
 
 ![minotaurus](https://drive.google.com/uc?export=view&id=1WbU6YGopLKqAjVD6-r9UhCIzDfLisb5U)
 
-Left: Wall-based Grid definition (negative definition), Right: lane-based Grid definition (positive definition) 
+Left: Wall-based Grid definition (negative definition), Right: lane-based Grid definition (positive definition)
 
 
 ## Train Traffic Management
@@ -155,12 +155,12 @@ Given the complexity and the high dependence of the multi-agent system a communi
 *   Communication must converge in a feasible time
 *   Communication…
 
-Depending on the game configuration every agent can be informed about the position of the other agents present in the respective observation range. For a local observation space the agent knows the distance to the next agent (defined with the agent type) in each direction. If no agent is present the the distance can simply be -1 or null. 
+Depending on the game configuration every agent can be informed about the position of the other agents present in the respective observation range. For a local observation space the agent knows the distance to the next agent (defined with the agent type) in each direction. If no agent is present the the distance can simply be -1 or null.
 
 
-#### Action Negotiation 
+#### Action Negotiation
 
-In order to avoid illicit situations ( for example agents crashing into each other) the intended actions for each agent in the observation range is known. Depending on the known movement intentions new movement intention must be generated by the agents. This is called a negotiation round. After a fixed amount of negotiation round the last intended action is executed for each agent. An illicit situation results in ending the game with a fixed low rewards. 
+In order to avoid illicit situations ( for example agents crashing into each other) the intended actions for each agent in the observation range is known. Depending on the known movement intentions new movement intention must be generated by the agents. This is called a negotiation round. After a fixed amount of negotiation round the last intended action is executed for each agent. An illicit situation results in ending the game with a fixed low rewards.
 
 
 ### Actions
@@ -168,7 +168,7 @@ In order to avoid illicit situations ( for example agents crashing into each oth
 
 #### Navigation
 
-The agent can be located at any cell except on case 0 cells. The agent can move along the rails to another unoccupied cell or it can just wait where he is currently located at.  
+The agent can be located at any cell except on case 0 cells. The agent can move along the rails to another unoccupied cell or it can just wait where he is currently located at.
 
 Flatland is a discrete time simulation. A discrete time simulation performs all actions in a discrete time with constant time step. In Flatland the simulation step is fixed and the time moves forward in equal duration of time. At each step every agent can choose an action. For the chosen action the attached transition will be executed. While executing a transition Flatland checks whether the requested transition is valid. If the transition is valid the transition will update the agents position. In case the transition call is not allowed the agent will not move.
 
@@ -176,14 +176,14 @@ If the agent calls an action and the attached transition is not allowed at curre
 
 An agent can move with a definable maximum speed. The default and absolute maximum speed is one spatial unit per time step. If an agent is defined to move slower, it can take a navigation action only ever N steps with N being an integer. For the transition to be made the same action must be taken N times consecutively. An agent can also have a maximum speed of 0 defined, thus it can never take a navigation step. This would be the case where an agent represents a good to be transported which can never move on its own.
 
-An agent can be defined to be picked up/dropped off by another agent or to pick up/drop off another agent. When agent A is picked up by another agent B it is said that A is linked to B. The linked agent loses all its navigation possibilities. On the other side it inherits the position from the linking agent for the time being linked. Linking and unlinking between two agents is only possible the participating agents have the same space-time coordinates for the linking and unlinking action.  
+An agent can be defined to be picked up/dropped off by another agent or to pick up/drop off another agent. When agent A is picked up by another agent B it is said that A is linked to B. The linked agent loses all its navigation possibilities. On the other side it inherits the position from the linking agent for the time being linked. Linking and unlinking between two agents is only possible the participating agents have the same space-time coordinates for the linking and unlinking action.
 
 
 #### Transportation
 
-In railway the transportation of goods or passengers is essential. Consequently agents can transport goods or passengers. It's depending on the agent's type. If the agent is a freight train, it will transport goods. It's passenger train it will transport passengers only.  But the transportation capacity for both kind of trains limited. Passenger trains have a maximum number of seats restriction. The freight trains have a maximal number of tons restriction. 
+In railway the transportation of goods or passengers is essential. Consequently agents can transport goods or passengers. It's depending on the agent's type. If the agent is a freight train, it will transport goods. It's passenger train it will transport passengers only.  But the transportation capacity for both kind of trains limited. Passenger trains have a maximum number of seats restriction. The freight trains have a maximal number of tons restriction.
 
-Passenger can take or switch trains only at stations. Passengers are agents with traveling needs.  A common passenger like to move from a starting location to a destination and it might like using trains or walking. Consequently a future Flatland must also support passenger movement (walk) in the grid and not only by using train. The goal of a passenger is to reach in an optimal manner its destination.  The quality of traveling is measured by the reward function.     
+Passenger can take or switch trains only at stations. Passengers are agents with traveling needs.  A common passenger like to move from a starting location to a destination and it might like using trains or walking. Consequently a future Flatland must also support passenger movement (walk) in the grid and not only by using train. The goal of a passenger is to reach in an optimal manner its destination.  The quality of traveling is measured by the reward function.
 
 Goods will be only transported over the railway network. Goods are agents with transportation needs. They can start their transportation chain at any station. Each good has a station as the destination attached. The destination is the end of the transportation. It's the transportation goal. Once a good reach its destination it will disappear. Disappearing mean the goods leave Flatland. Goods can't move independently on the grid. They can only move by using trains. They can switch trains at any stations. The goal of the system is to find for goods the right trains to get a feasible transportation chain.  The quality of the transportation chain is measured by the reward function.
 
@@ -216,7 +216,7 @@ The environment should allow for a broad class of problem instances. Thus the co
 
 For the train traffic the configurations should be as follows:
 
-Cell types: Case 0 - 7 
+Cell types: Case 0 - 7
 
 Agent Types allowed: Active Agents with Speed 1 and no goals, Passive agents with goals
 
@@ -236,14 +236,14 @@ It should be check prior to solving the problem that the Goal location for each
 
 #### Railway-specific Use-Cases
 
-A first idea for a Cost function for generic applicability is as follows. For each agent and each goal sum up 
+A first idea for a Cost function for generic applicability is as follows. For each agent and each goal sum up
 
 
 
 *   The timestep when the goal has been reached when not target time is given in the goal.
 *   The absolute value of the difference between the target time and the arrival time of the agent.
 
-An additional refinement proven meaningful for situations where not target time is given is to weight the longest arrival time higher as the sum off all arrival times. 
+An additional refinement proven meaningful for situations where not target time is given is to weight the longest arrival time higher as the sum off all arrival times.
 
 
 #### Further Examples (Games)
@@ -251,7 +251,7 @@ An additional refinement proven meaningful for situations where not target time
 
 ### Initialization
 
-Given that we want a generalizable agent to solve the problem, training must be performed on a diverse training set. We therefore need a level generator which can create novel tasks for to be solved in a reliable and fast fashion. 
+Given that we want a generalizable agent to solve the problem, training must be performed on a diverse training set. We therefore need a level generator which can create novel tasks for to be solved in a reliable and fast fashion.
 
 
 #### Level Generator
@@ -286,7 +286,7 @@ In this section we define a few simple tasks related to railway traffic that we
 
 #### Simple Navigation
 
-In order to onboard the broad reinforcement learning community this task is intended as an introduction to the Railway@Flatland environment. 
+In order to onboard the broad reinforcement learning community this task is intended as an introduction to the Railway@Flatland environment.
 
 
 ##### Task
@@ -367,8 +367,8 @@ The separation between rail generator and schedule generator reflects the organi
 Usually, there is a third organisation, which ensures discrimination-free access to the infrastructure for concurrent requests for the infrastructure in a **schedule planning phase**.
 However, in the **Flat**land challenge, we focus on the re-scheduling problem during live operations.
 
-Technically, 
-```python 
+Technically,
+```python
 RailGeneratorProduct = Tuple[GridTransitionMap, Optional[Any]]
 RailGenerator = Callable[[int, int, int, int], RailGeneratorProduct]
 
@@ -383,10 +383,10 @@ def sparse_rail_generator(num_cities=5, num_intersections=4, num_trainstations=2
                           num_neighb=3, grid_mode=False, enhance_intersection=False, seed=1):
 
     def generator(width, height, num_agents, num_resets=0):
-    
+
         # generate the grid and (optionally) some hints for the schedule_generator
         ...
-         
+
         return grid_map, {'agents_hints': {
             'num_agents': num_agents,
             'agent_start_targets_nodes': agent_start_targets_nodes,
@@ -405,7 +405,7 @@ def sparse_schedule_generator(speed_ratio_map: Mapping[float, float] = None) ->
         # - (initial) speed
         # - malfunction
         ...
-                
+
         return agents_position, agents_direction, agents_target, speeds, agents_malfunction
 
     return generator
@@ -424,7 +424,7 @@ The environment's `reset` takes care of applying the two generators:
              ):
         self.rail_generator: RailGenerator = rail_generator
         self.schedule_generator: ScheduleGenerator = schedule_generator
-        
+
     def reset(self, regen_rail=True, replace_agents=True):
         rail, optionals = self.rail_generator(self.width, self.height, self.get_num_agents(), self.num_resets)
 
@@ -440,21 +440,21 @@ The environment's `reset` takes care of applying the two generators:
 
 
 ### RailEnv Speeds
-One of the main contributions to the complexity of railway network operations stems from the fact that all trains travel at different speeds while sharing a very limited railway network. 
+One of the main contributions to the complexity of railway network operations stems from the fact that all trains travel at different speeds while sharing a very limited railway network.
 
-The different speed profiles can be generated using the `schedule_generator`, where you can actually chose as many different speeds as you like. 
-Keep in mind that the *fastest speed* is 1 and all slower speeds must be between 1 and 0. 
+The different speed profiles can be generated using the `schedule_generator`, where you can actually chose as many different speeds as you like.
+Keep in mind that the *fastest speed* is 1 and all slower speeds must be between 1 and 0.
 For the submission scoring you can assume that there will be no more than 5 speed profiles.
 
 
-Currently (as of **Flat**land 2.0), an agent keeps its speed over the whole episode. 
+Currently (as of **Flat**land 2.0), an agent keeps its speed over the whole episode.
 
-Because the different speeds are implemented as fractions the agents ability to perform actions has been updated. 
-We **do not allow actions to change within the cell **. 
-This means that each agent can only chose an action to be taken when entering a cell (ie. positional fraction is 0). 
-There is some real railway specific considerations such as reserved blocks that are similar to this behavior. 
-But more importantly we disabled this to simplify the use of machine learning algorithms with the environment. 
-If we allow stop actions in the middle of cells. then the controller needs to make much more observations and not only at cell changes. 
+Because the different speeds are implemented as fractions the agents ability to perform actions has been updated.
+We **do not allow actions to change within the cell **.
+This means that each agent can only chose an action to be taken when entering a cell (ie. positional fraction is 0).
+There is some real railway specific considerations such as reserved blocks that are similar to this behavior.
+But more importantly we disabled this to simplify the use of machine learning algorithms with the environment.
+If we allow stop actions in the middle of cells. then the controller needs to make much more observations and not only at cell changes.
 (Not set in stone and could be updated if the need arises).
 
 The chosen action is then executed when a step to the next cell is valid. For example
@@ -464,9 +464,9 @@ The chosen action is then executed when a step to the next cell is valid. For ex
     - Agents can make observations at any time step. Make sure to discard observations without any information. See this [example](https://gitlab.aicrowd.com/flatland/baselines/blob/master/torch_training/training_navigation.py) for a simple implementation.
 - The environment checks if agent is allowed to move to next cell only at the time of the switch to the next cell
 
-In your controller, you can check whether an agent requires an action by checking `info`: 
+In your controller, you can check whether an agent requires an action by checking `info`:
 ```python
-obs, rew, done, info = env.step(actions) 
+obs, rew, done, info = env.step(actions)
 ...
 action_dict = dict()
 for a in range(env.get_num_agents()):
@@ -474,17 +474,17 @@ for a in range(env.get_num_agents()):
         action_dict.update({a: ...})
 
 ```
-Notice that `info['action_required'][a]` 
+Notice that `info['action_required'][a]`
 * if the agent breaks down (see stochasticity below) on entering the cell (no distance elpased in the cell), an action required as long as the agent is broken down;
 when it gets back to work, the action chosen just before will be taken and executed at the end of the cell; you may check whether the agent
 gets healthy again in the next step by checking `info['malfunction'][a] == 1`.
-* when the agent has spent enough time in the cell, the next cell may not be free and the agent has to wait. 
+* when the agent has spent enough time in the cell, the next cell may not be free and the agent has to wait.
 
 
-Since later versions of **Flat**land might have varying speeds during episodes. 
-Therefore, we return the agents' speed - in your controller, you can get the agents' speed from the `info` returned by `step`: 
+Since later versions of **Flat**land might have varying speeds during episodes.
+Therefore, we return the agents' speed - in your controller, you can get the agents' speed from the `info` returned by `step`:
 ```python
-obs, rew, done, info = env.step(actions) 
+obs, rew, done, info = env.step(actions)
 ...
 for a in range(env.get_num_agents()):
     speed = info['speed'][a]
@@ -501,7 +501,7 @@ Notice that we do not guarantee that the speed will be computed at each step, bu
 
 ### RailEnv Malfunctioning / Stochasticity
 
-Stochastic events may happen during the episodes. 
+Stochastic events may happen during the episodes.
 This is very common for railway networks where the initial plan usually needs to be rescheduled during operations as minor events such as delayed departure from trainstations, malfunctions on trains or infrastructure or just the weather lead to delayed trains.
 
 We implemted a poisson process to simulate delays by stopping agents at random times for random durations. The parameters necessary for the stochastic events can be provided when creating the environment.
@@ -529,12 +529,13 @@ You can introduce stochasticity by simply creating the env as follows:
 env = RailEnv(
     ...
     stochastic_data=stochastic_data,  # Malfunction data generator
-    ...    
+    ...
 )
+env.reset()
 ```
-In your controller, you can check whether an agent is malfunctioning: 
+In your controller, you can check whether an agent is malfunctioning:
 ```python
-obs, rew, done, info = env.step(actions) 
+obs, rew, done, info = env.step(actions)
 ...
 action_dict = dict()
 for a in range(env.get_num_agents()):
@@ -566,11 +567,12 @@ env = RailEnv(width=50,
               number_of_agents=10,
               stochastic_data=stochastic_data,  # Malfunction data generator
               obs_builder_object=tree_observation)
+env.reset()
 ```
 
 
 ### Observation Builders
-Every `RailEnv` has an `obs_builder`. The `obs_builder` has full access to the `RailEnv`. 
+Every `RailEnv` has an `obs_builder`. The `obs_builder` has full access to the `RailEnv`.
 The `obs_builder` is called in the `step()` function to produce the observations.
 
 ```python
@@ -580,8 +582,9 @@ env = RailEnv(
         max_depth=2,
        predictor=ShortestPathPredictorForRailEnv(max_depth=10)
     ),
-    ...                   
+    ...
 )
+env.reset()
 ```
 
 The two principal observation builders provided are global and tree.
@@ -597,7 +600,7 @@ The two principal observation builders provided are global and tree.
   - second channel containing the other agents positions and diretions
   - third channel containing agent malfunctions
   - fourth channel containing agent fractional speeds
-            
+
 #### Tree Observation Builder
 `TreeObsForRailEnv` computes the current observation for each agent.
 
@@ -682,8 +685,8 @@ the branches of an agent's future moves to detect future conflicts.
 
 The general call structure is as follows:
 ```python
-RailEnv.step() 
-               -> ObservationBuilder.get_many() 
+RailEnv.step()
+               -> ObservationBuilder.get_many()
                                                 ->  self.predictor.get()
                                                     self.get()
                                                     self.get()
diff --git a/docs/tutorials/01_gettingstarted.rst b/docs/tutorials/01_gettingstarted.rst
index e3a2f41a..c8187421 100644
--- a/docs/tutorials/01_gettingstarted.rst
+++ b/docs/tutorials/01_gettingstarted.rst
@@ -21,7 +21,8 @@ The basic usage of RailEnv environments consists in creating a RailEnv object
 endowed with a rail generator, that generates new rail networks on each reset,
 and an observation generator object, that is supplied with environment-specific
 information at each time step and provides a suitable observation vector to the
-agents.
+agents. After the RailEnv environment is created, one need to call reset() on the
+environment in order to fully initialize the environment
 
 The simplest rail generators are envs.rail_generators.rail_from_manual_specifications_generator
 and envs.rail_generators.random_rail_generator.
@@ -43,6 +44,7 @@ For example,
                   rail_generator=rail_from_manual_specifications_generator(specs),
                   number_of_agents=1,
                   obs_builder_object=TreeObsForRailEnv(max_depth=2))
+    env.reset()
 
 Alternatively, a random environment can be generated (optionally specifying
 weights for each cell type to increase or decrease their proportion in the
@@ -71,6 +73,7 @@ generated rail networks).
                             ),
                   number_of_agents=3,
                   obs_builder_object=TreeObsForRailEnv(max_depth=2))
+    env.reset()
 
 Environments can be rendered using the utils.rendertools utilities, for example:
 
@@ -147,6 +150,7 @@ Next we configure the difficulty of our task by modifying the complex_rail_gener
                                         max_dist=99999,
                                         seed=1),
                     number_of_agents=5)
+    env.reset()
 
 The difficulty of a railway network depends on the dimensions (`width` x `height`) and the number of agents in the network.
 By varying the number of start and goal connections (nr_start_goal) and the number of extra railway elements added (nr_extra)
diff --git a/docs/tutorials/02_observationbuilder.rst b/docs/tutorials/02_observationbuilder.rst
index 8afcf710..b9ccbcd2 100644
--- a/docs/tutorials/02_observationbuilder.rst
+++ b/docs/tutorials/02_observationbuilder.rst
@@ -45,6 +45,7 @@ We can pass an instance of our custom observation builder :code:`SimpleObs` to t
                   rail_generator=random_rail_generator(),
                   number_of_agents=3,
                   obs_builder_object=SimpleObs())
+    env.reset()
 
 Anytime :code:`env.reset()` or :code:`env.step()` is called, the observation builder will return the custom observation of all agents initialized in the env.
 In the next example we highlight how to derive from existing observation builders and how to access internal variables of **Flatland**.
@@ -121,6 +122,7 @@ Note that this simple strategy fails when multiple agents are present, as each a
                     min_dist=8, max_dist=99999, seed=1),
                   number_of_agents=2,
                   obs_builder_object=SingleAgentNavigationObs())
+    env.reset()
 
     obs, all_rewards, done, _ = env.step({0: 0, 1: 1})
     for i in range(env.get_num_agents()):
@@ -136,6 +138,7 @@ navigation to target, and shows the path taken as an animation.
                   rail_generator=random_rail_generator(),
                   number_of_agents=1,
                   obs_builder_object=SingleAgentNavigationObs())
+    env.reset()
 
     obs, all_rewards, done, _ = env.step({0: 0})
 
@@ -270,6 +273,7 @@ We can then use this new observation builder and the renderer to visualize the o
                   rail_generator=complex_rail_generator(nr_start_goal=5, nr_extra=1, min_dist=8, max_dist=99999, seed=1),
                   number_of_agents=3,
                   obs_builder_object=CustomObsBuilder)
+    env.reset()
 
     obs, info = env.reset()
     env_renderer = RenderTool(env, gl="PILSVG")
diff --git a/docs/tutorials/03_rail_and_schedule_generator.md b/docs/tutorials/03_rail_and_schedule_generator.md
index 5a236a6d..96080e4b 100644
--- a/docs/tutorials/03_rail_and_schedule_generator.md
+++ b/docs/tutorials/03_rail_and_schedule_generator.md
@@ -7,13 +7,13 @@ Let's have a look at the `sparse_rail_generator`.
 ## Sparse Rail Generator
 ![Example_Sparse](https://i.imgur.com/DP8sIyx.png)
 
-The idea behind the sparse rail generator is to mimic classic railway structures where dense nodes (cities) are sparsely connected to each other and where you have to manage traffic flow between the nodes efficiently. 
+The idea behind the sparse rail generator is to mimic classic railway structures where dense nodes (cities) are sparsely connected to each other and where you have to manage traffic flow between the nodes efficiently.
 The cities in this level generator are much simplified in comparison to real city networks but it mimics parts of the problems faced in daily operations of any railway company.
 
-There are a few parameters you can tune to build your own map and test different complexity levels of the levels. 
-**Warning** some combinations of parameters do not go well together and will lead to infeasible level generation. 
-In the worst case, the level generator currently issues a warning when it cannot build the environment according to the parameters provided. 
-This will lead to a crash of the whole env. 
+There are a few parameters you can tune to build your own map and test different complexity levels of the levels.
+**Warning** some combinations of parameters do not go well together and will lead to infeasible level generation.
+In the worst case, the level generator currently issues a warning when it cannot build the environment according to the parameters provided.
+This will lead to a crash of the whole env.
 We are currently working on improvements here and are **happy for any suggestions from your side**.
 
 To build an environment you instantiate a `RailEnv` as follows:
@@ -40,6 +40,8 @@ env = RailEnv(
     number_of_agents=10,
     obs_builder_object=TreeObsForRailEnv(max_depth=3,predictor=shortest_path_predictor)
 )
+ Call reset on the environment
+env.reset()
 ```
 
 You can see that you now need both a `rail_generator` and a `schedule_generator` to generate a level. These need to work nicely together. The `rail_generator` will only generate the railway infrastructure and provide hints to the `schedule_generator` about where to place agents. The `schedule_generator` will then generate a schedule, meaning it places agents at different train stations and gives them tasks by providing individual targets.
diff --git a/docs/tutorials/04_stochasticity.md b/docs/tutorials/04_stochasticity.md
index 201e359b..c118319e 100644
--- a/docs/tutorials/04_stochasticity.md
+++ b/docs/tutorials/04_stochasticity.md
@@ -1,6 +1,6 @@
 # Stochasticity Tutorial
 
-Another area where we improved **Flat**land 2.0 are stochastic events added during the episodes. 
+Another area where we improved **Flat**land 2.0 are stochastic events added during the episodes.
 This is very common for railway networks where the initial plan usually needs to be rescheduled during operations as minor events such as delayed departure from trainstations, malfunctions on trains or infrastructure or just the weather lead to delayed trains.
 
 We implemted a poisson process to simulate delays by stopping agents at random times for random durations. The parameters necessary for the stochastic events can be provided when creating the environment.
@@ -28,12 +28,12 @@ You can introduce stochasticity by simply creating the env as follows:
 env = RailEnv(
     ...
     stochastic_data=stochastic_data,  # Malfunction data generator
-    ...    
+    ...
 )
 ```
-In your controller, you can check whether an agent is malfunctioning: 
+In your controller, you can check whether an agent is malfunctioning:
 ```python
-obs, rew, done, info = env.step(actions) 
+obs, rew, done, info = env.step(actions)
 ...
 action_dict = dict()
 for a in range(env.get_num_agents()):
@@ -65,6 +65,7 @@ env = RailEnv(width=50,
               number_of_agents=10,
               stochastic_data=stochastic_data,  # Malfunction data generator
               obs_builder_object=tree_observation)
+env.reset()
 ```
 
 You will quickly realize that this will lead to unforeseen difficulties which means that **your controller** needs to observe the environment at all times to be able to react to the stochastic events.
-- 
GitLab