Skip to content
Snippets Groups Projects
Commit 653bfce0 authored by heinrich_kuttler's avatar heinrich_kuttler
Browse files

Wrap README.

parent 3e91b07a
No related branches found
No related tags found
1 merge request!4Wrap README.
# TorchBeast NetHackChallenge Benchmark
This is a baseline model for the NetHack Challenge based on [TorchBeast](https://github.com/facebookresearch/torchbeast) - FAIR's implementation of IMPALA for PyTorch.
This is a baseline model for the NetHack Challenge based on
[TorchBeast](https://github.com/facebookresearch/torchbeast) - FAIR's
implementation of IMPALA for PyTorch.
It comes with all the code you need to train, run and submit a model that is based on the results published in the original NLE paper.
It comes with all the code you need to train, run and submit a model
that is based on the results published in the original NLE paper.
This implementation runs with 2 GPUS (one for acting and one for
learning), and runs many simultaneous environments with dynamic
batching.
This implementation runs with 2 GPUS (one for acting and one for learning), and runs many simultaneous environments with dynamic batching.
## Installation
To get this running all you need to do is follow the TorchBeast installation instructions, on the repo page, and then install the requirements.txt
To get this running all you need to do is follow the TorchBeast
installation instructions, on the repo page, and then install the
requirements.txt
A Dockerfile is also provided with installation of Torchbeast.
## Running The Baseline
Once installed, in this directory run:
`python polyhydra.py`
`python polyhydra.py`
To change parameters, edit `config.yaml`, or to override parameters from the command-line run:
To change parameters, edit `config.yaml`, or to override parameters
from the command-line run:
`python polyhydra.py embedding_dim=16`
The training will save checkpoints to a new directory (`outputs`) and should the environments create any outputs, they will be saved to `nle_data` - (by default recordings of episodes are switched off to save space).
The training will save checkpoints to a new directory (`outputs`) and
should the environments create any outputs, they will be saved to
`nle_data` - (by default recordings of episodes are switched off to
save space).
The default polybeast runs on 2 GPUs, one for the learner and one for the actors. However, with only one GPU you can run still run polybeast - just override the `actor_device` argument:
The default polybeast runs on 2 GPUs, one for the learner and one for
the actors. However, with only one GPU you can run still run
polybeast - just override the `actor_device` argument:
`python polyhydra.py actor_device=cpu`
## Making a submission
Take the output directory of your trained model, add the `checkpoint.tar` and `config.yaml` to the git repo. Then change the `SUBMISSION` variable in `rollout.py` in the base of this repository to point to that directory.
Take the output directory of your trained model, add the
`checkpoint.tar` and `config.yaml` to the git repo. Then change the
`SUBMISSION` variable in `rollout.py` in the base of this repository
to point to that directory.
After that tag the submission, and push the branch and tag to AIcrowd's gitlab!
After that tag the submission, and push the branch and tag to
AIcrowd's gitlab!
## Repo Structure
```
baselines/torchbeast
├── core/
├── core/
├── models/ # <- Models HERE
├── util/
├── util/
├── config.yaml # <- Flags HERE
├── polybeast_env.py # <- Training Env HERE
├── polybeast_learner.py # <- Training Loop HERE
......@@ -49,22 +68,31 @@ baselines/torchbeast
```
The structure is simple, compartmentalising the environment setup, training loop and models in to different files. You can tweak any of these separately, and add parameters to the flags (which are passed around).
The structure is simple, compartmentalising the environment setup,
training loop and models in to different files. You can tweak any of
these separately, and add parameters to the flags (which are passed
around).
## About the Model
This model (`BaselineNet`) we provide is simple and all in `models/baseline.py`.
This model (`BaselineNet`) we provide is simple and all in
`models/baseline.py`.
* It encodes the dungeon into a fixed-size representation (`GlyphEncoder`)
* It encodes the topline message into a fixed-size representation (`MessageEncoder`)
* It encodes the bottom line statistics (eg armour class, health) into a fixed-size representation (`BLStatsEncoder`)
* It concatenates all these outputs into a fixed size, runs this through a fully connected layer, and into an LSTM.
* The outputs of the LSTM go through policy and baseline heads (since this is an actor-critic alorithm)
* It encodes the dungeon into a fixed-size representation
(`GlyphEncoder`)
* It encodes the topline message into a fixed-size representation
(`MessageEncoder`)
* It encodes the bottom line statistics (eg armour class, health) into
a fixed-size representation (`BLStatsEncoder`)
* It concatenates all these outputs into a fixed size, runs this
through a fully connected layer, and into an LSTM.
* The outputs of the LSTM go through policy and baseline heads (since
this is an actor-critic alorithm)
As you can see there is a lot of data to play with in this game, and
plenty to try, both in modelling and in the learning algorithms used.
As you can see there is a lot of data to play with in this game, and plenty to try, both in modelling and in the learning algorithms used.
## Improvement Ideas
......@@ -73,18 +101,30 @@ As you can see there is a lot of data to play with in this game, and plenty to t
### Model Improvements (`baseline.py`)
* The model is currently not using the terminal observations (`tty_chars`, `tty_colors`, `tty_cursor`), so it has no idea about menus - could this we make use of this somehow?
* The bottom-line stats are very informative, but very simply encoded in `BLStatsEncoder` - is there a better way to do this?
* The `GlyphEncoder` builds a embedding for the glyphs, and then takes a crop of these centered around the player icon coordinates (`@`). Should the crop be reusing these the same embedding matrix?
* The current model constrains the vast action space to a smaller subset of actions. Is it too constrained? Or not constrained enough?
* The model is currently not using the terminal observations
(`tty_chars`, `tty_colors`, `tty_cursor`), so it has no idea about
menus - could this we make use of this somehow?
* The bottom-line stats are very informative, but very simply encoded
in `BLStatsEncoder` - is there a better way to do this?
* The `GlyphEncoder` builds a embedding for the glyphs, and then takes
a crop of these centered around the player icon coordinates
(`@`). Should the crop be reusing these the same embedding matrix?
* The current model constrains the vast action space to a smaller
subset of actions. Is it too constrained? Or not constrained enough?
### Environment Improvements (`polybeast_env.py`)
* Opening menus (such as when spellcasting) do not advance the in game timer. However, models can also get stuck in menus as you have to learn what buttons to press to close the menu. Can changing the penalty for not advancing the in-game timer improve the result?
* The NetHackChallenge assesses the score on random character assignments. Might it be easier to learn on just a few of these at the beginning of training?
* Opening menus (such as when spellcasting) do not advance the in game
timer. However, models can also get stuck in menus as you have to
learn what buttons to press to close the menu. Can changing the
penalty for not advancing the in-game timer improve the result?
* The NetHackChallenge assesses the score on random character
assignments. Might it be easier to learn on just a few of these at
the beginning of training?
### Algorithm/Optimisation Improvements (`polybeast_learner.py`)
* Can we add some intrinsic rewards to help our agents learn?
* Should we add penalties for disincentivise pathological behaviour we observe?
* Can we add some intrinsic rewards to help our agents learn?
* Should we add penalties for disincentivise pathological behaviour we
observe?
* Can we improve the model by using a different optimizer?
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment