From 651ce77b8d6441f78a6bb7b74a3c9785a75f4785 Mon Sep 17 00:00:00 2001 From: hagrid67 <jdhwatson@gmail.com> Date: Mon, 29 Jul 2019 20:30:30 +0100 Subject: [PATCH] made rst http link to Wikipedia Pareto Optimality --- docs/getting_start_with_observationbuilder.rst | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/docs/getting_start_with_observationbuilder.rst b/docs/getting_start_with_observationbuilder.rst index 5f655989..c74e198f 100644 --- a/docs/getting_start_with_observationbuilder.rst +++ b/docs/getting_start_with_observationbuilder.rst @@ -58,11 +58,13 @@ Example 2 : Single-agent navigation Observation builder objects can of course derive from existing concrete subclasses of ObservationBuilder. For example, it may be useful to extend the TreeObsForRailEnv_ observation builder. -A feature of this class is that on :code:`reset()`, it pre-computes the length of the shortest paths from all +A feature of this class is that on :code:`reset()`, it pre-computes the lengths of the shortest paths from all cells and orientations to the target of each agent, i.e. a distance map for each agent. + In this example we exploit these distance maps by implementing an observation builder that shows the current shortest path for each agent as a one-hot observation vector of length 3, whose components represent the possible directions an agent can take (LEFT, FORWARD, RIGHT). All values of the observation vector are set to :code:`0` except for the shortest direction where it is set to :code:`1`. -Using this observation with highly engineered features indicating the agent's shortest path, an agent can then learn to take the corresponding action at each time-step, or we could even hardcode the optimal policy. -Please note, however, that this simple strategy fails when multiple agents are present, as each agent would only attempt its greedy solution, which is not usually Pareto-optimal in this context. + +Using this observation with highly engineered features indicating the agent's shortest path, an agent can then learn to take the corresponding action at each time-step; or we could even hardcode the optimal policy. +Note that this simple strategy fails when multiple agents are present, as each agent would only attempt its greedy solution, which is not usually `Pareto-optimal <https://en.wikipedia.org/wiki/Pareto_efficiency>`_ in this context. .. _TreeObsForRailEnv: https://gitlab.aicrowd.com/flatland/flatland/blob/master/flatland/envs/observations.py#L14 -- GitLab