Compare revisions

a9cf5c38 · a9cf5c38 · a9cf5c38 · a9cf5c38 · a9cf5c38 · a9cf5c38
--- a/nethack_baselines/torchbeast/README.md
+++ b/nethack_baselines/torchbeast/README.md
+# TorchBeast NetHackChallenge Benchmark
+
+This is a baseline model for the NetHack Challenge based on
+[TorchBeast](https://github.com/facebookresearch/torchbeast) - FAIR's
+implementation of IMPALA for PyTorch.
+
+It comes with all the code you need to train, run and submit a model
+that is based on the results published in the original NLE paper.
+
+This implementation can run with 2 GPUS (one for acting and one for
+learning), and runs many simultaneous environments with dynamic
+batching. Currently it has been configured to run with only 1 GPU.
+
+
+## Installation 
+
+**[Native Installation]**
+
+To get this running you'll need to follow the TorchBeast installation instructions for PolyBeast from the [TorchBeast repo](https://github.com/facebookresearch/torchbeast#faster-version-polybeast).
+
+**[Docker Installation]**
+
+You can fast track the installation of PolyBeast, by running the competitions own Dockerfile. Prebuilt images are also hosted on the Docker Hub. These commands should open an image that allows you run the baseline
+
+**To Run Existing Docker Image**
+
+`docker pull fairnle/challenge:dev`
+
+```docker run -it -v `pwd`:/home/aicrowd --gpus='all' fairnle/challenge:dev```
+
+**To Build Your Own Image**
+
+*Dev Image*  - runs with root user, doesn't copy all your files across into image
+
+`docker build -t competition --target nhc-dev .`
+
+*or Submission Image* - runs with aicrowd user, copies across all your files into image
+
+`docker build -t competition --target nhc-submit .`
+
+*Run Image*
+
+```docker run -it -v `pwd`:/home/aicrowd --gpus='all' competition```
+
+
+
+## Running The Baseline
+
+Once installed, in this directory run:
+
+`python polyhydra.py`
+
+To change parameters, edit `config.yaml`, or to override parameters
+from the command-line run:
+
+`python polyhydra.py embedding_dim=16`
+
+The training will save checkpoints to a new directory (`outputs`) and
+should the environments create any outputs, they will be saved to
+`nle_data` - (by default recordings of episodes are switched off to
+save space).
+
+The default polybeast runs on 2 GPUs, one for the learner and one for
+the actors. However, with only one GPU you can run still run
+polybeast - just override the `actor_device` argument:
+
+`python polyhydra.py actor_device=cpu`
+
+NOTE: if you get a "Too many open files" error, try: `ulimit -Sn 10000`.
+
+## Making a submission
+
+In the output directory of your trained model, you should find two files, `checkpoint.tar` and `config.yaml`. Add both of them to your submission repo. Then change the `MODEL_DIR` variable in `agents/torchbeast_agent.py` to point to the directory where these files are located. And finally, simply set the `AGENT` in `submission_config.py` to be 'TorchBeastAgent' so that your torchbeast agent variation is used for the submission.
+
+After that, follow [these instructions](/docs/SUBMISSION.md) to submit your model to AIcrowd!
+
+## Repo Structure
+
+```
+baselines/torchbeast
+├── core/
+├── models/                #  <- Models HERE
+├── util/
+├── config.yaml            #  <- Flags HERE
+├── polybeast_env.py       #  <- Training Env HERE
+├── polybeast_learner.py   #  <- Training Loop HERE
+└── polyhydra.py           #  <- main() HERE
+
+```
+
+The structure is simple, compartmentalising the environment setup,
+training loop and models in to different files. You can tweak any of
+these separately, and add parameters to the flags (which are passed
+around).
+
+
+## About the Model
+
+This model (`BaselineNet`) we provide is simple and all in
+`models/baseline.py`.
+
+* It encodes the dungeon into a fixed-size representation
+  (`GlyphEncoder`)
+* It encodes the topline message into a fixed-size representation
+  (`MessageEncoder`)
+* It encodes the bottom line statistics (eg armour class, health) into
+  a fixed-size representation (`BLStatsEncoder`)
+* It concatenates all these outputs into a fixed size, runs this
+  through a fully connected layer, and into an LSTM.
+* The outputs of the LSTM go through policy and baseline heads (since
+  this is an actor-critic alorithm)
+
+As you can see there is a lot of data to play with in this game, and
+plenty to try, both in modelling and in the learning algorithms used.
+
+
+## Improvement Ideas
+
+*Here are some ideas we haven't tried yet, but might be easy places to start. Happy tinkering!*
+
+
+### Model Improvements (`baseline.py`)
+
+* The model is currently not using the terminal observations
+  (`tty_chars`, `tty_colors`, `tty_cursor`), so it has no idea about
+  menus - could this we make use of this somehow?
+* The bottom-line stats are very informative, but very simply encoded
+  in `BLStatsEncoder` - is there a better way to do this?
+* The `GlyphEncoder` builds a embedding for the glyphs, and then takes
+  a crop of these centered around the player icon coordinates
+  (`@`). Should the crop be reusing these the same embedding matrix?
+* The current model constrains the vast action space to a smaller
+  subset of actions. Is it too constrained? Or not constrained enough?
+
+###  Environment Improvements (`polybeast_env.py`)
+
+* Opening menus (such as when spellcasting) do not advance the in game
+  timer. However, models can also get stuck in menus as you have to
+  learn what buttons to press to close the menu. Can changing the
+  penalty for not advancing the in-game timer improve the result?
+* The NetHackChallenge assesses the score on random character
+  assignments. Might it be easier to learn on just a few of these at
+  the beginning of training?
+
+### Algorithm/Optimisation Improvements (`polybeast_learner.py`)
+
+* Can we add some intrinsic rewards to help our agents learn?
+* Should we add penalties for disincentivise pathological behaviour we
+  observe?
+* Can we improve the model by using a different optimizer?
--- a/nethack_baselines/torchbeast/config.yaml
+++ b/nethack_baselines/torchbeast/config.yaml
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+defaults:
+- hydra/job_logging: colorlog
+- hydra/hydra_logging: colorlog
+# - hydra/launcher: submitit_slurm
+
+# # To Be Used With hydra submitit_slurm if you have SLURM cluster
+# # pip install hydra-core hydra_colorlog
+# # can set these on the commandline too, e.g. `hydra.launcher.partition=dev`
+# hydra:
+#   launcher:
+#     timeout_min: 4300
+#     cpus_per_task: 20  
+#     gpus_per_node: 2
+#     tasks_per_node: 1
+#     mem_gb: 20
+#     nodes: 1
+#     partition: dev
+#     comment: null  
+#     max_num_timeout: 5  # will requeue on timeout or preemption
+
+
+name: null  # can use this to have multiple runs with same params, eg name=1,2,3,4,5
+
+## WANDB settings
+wandb: false                 # Enable wandb logging.
+project: nethack_challenge   # The wandb project name.
+entity: user1                # The wandb user to log to.
+group: group1                # The wandb group for the run.
+
+# POLYBEAST ENV settings
+mock: false                  # Use mock environment instead of NetHack.
+single_ttyrec: true          # Record ttyrec only for actor 0.
+num_seeds: 0                 # If larger than 0, samples fixed number of environment seeds to be used.'
+write_profiler_trace: false  # Collect and write a profiler trace for chrome://tracing/.
+fn_penalty_step: constant    # Function to accumulate penalty.
+penalty_time: 0.0            # Penalty per time step in the episode.
+penalty_step: -0.01          # Penalty per step in the episode.
+reward_lose: 0               # Reward for losing (dying before finding the staircase).
+reward_win: 100              # Reward for winning (finding the staircase).
+state_counter: none          # Method for counting state visits. Default none.
+character: 'mon-hum-neu-mal' # Specification of the NetHack character.
+                              ## typical characters we use
+                                # 'mon-hum-neu-mal'
+                                # 'val-dwa-law-fem'
+                                # 'wiz-elf-cha-mal'
+                                # 'tou-hum-neu-fem'
+                                # '@'   # random (used in Challenge assessment)
+
+# RUN settings.
+mode: train                  # Training or test mode.
+env: challenge               # Name of Gym environment to create.
+                             # # env (task) names: challenge, staircase, pet, 
+                             #     eat, gold, score, scout, oracle
+
+# TRAINING settings.
+num_actors: 256              # Number of actors.
+total_steps: 1e9             # Total environment steps to train for. Will be cast to int.
+batch_size: 32               # Learner batch size.
+unroll_length: 80            # The unroll length (time dimension).
+num_learner_threads: 1       # Number learner threads.
+num_inference_threads: 1     # Number inference threads.
+disable_cuda: false          # Disable CUDA.
+learner_device: cuda:0       # Set learner device.
+actor_device: cuda:0         # Set actor device.
+
+# OPTIMIZER settings. (RMS Prop)
+learning_rate: 0.0002        # Learning rate.
+grad_norm_clipping: 40       # Global gradient norm clip.
+alpha: 0.99                  # RMSProp smoothing constant.
+momentum: 0                  # RMSProp momentum.
+epsilon: 0.000001            # RMSProp epsilon.
+
+# LOSS settings.
+entropy_cost: 0.001          # Entropy cost/multiplier.
+baseline_cost: 0.5           # Baseline cost/multiplier.
+discounting: 0.999           # Discounting factor.
+normalize_reward: true       # Normalizes reward by dividing by running stdev from mean.
+
+# MODEL settings.
+model: baseline              # Name of model to build (see models/__init__.py).
+use_lstm: true               # Use LSTM in agent model.
+hidden_dim: 256              # Size of hidden representations.
+embedding_dim: 64            # Size of glyph embeddings.
+layers: 5                    # Number of ConvNet Layers for Glyph Model
+crop_dim: 9                  # Size of crop (c x c)
+use_index_select: true       # Whether to use index_select instead of embedding lookup (for speed reasons).
+restrict_action_space: True  # Use a restricted ACTION SPACE (only nethack.USEFUL_ACTIONS)
+
+msg:                      
+  hidden_dim: 64             # Hidden dimension for message encoder.
+  embedding_dim: 32          # Embedding dimension for characters in message encoder.
+
+# TEST settings.    
+load_dir: null               # Path to load a model from for testing
--- a/nethack_baselines/torchbeast/core/file_writer.py
+++ b/nethack_baselines/torchbeast/core/file_writer.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import copy
+import csv
+import datetime
+import json
+import logging
+import os
+import time
+import weakref
+
+
+def _save_metadata(path, metadata):
+    metadata["date_save"] = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f")
+    with open(path, "w") as f:
+        json.dump(metadata, f, indent=4, sort_keys=True)
+
+
+def gather_metadata():
+    metadata = dict(
+        date_start=datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f"),
+        env=os.environ.copy(),
+        successful=False,
+    )
+
+    # Git metadata.
+    try:
+        import git
+    except ImportError:
+        logging.warning(
+            "Couldn't import gitpython module; install it with `pip install gitpython`."
+        )
+    else:
+        try:
+            repo = git.Repo(search_parent_directories=True)
+            metadata["git"] = {
+                "commit": repo.commit().hexsha,
+                "is_dirty": repo.is_dirty(),
+                "path": repo.git_dir,
+            }
+            if not repo.head.is_detached:
+                metadata["git"]["branch"] = repo.active_branch.name
+        except git.InvalidGitRepositoryError:
+            pass
+
+    if "git" not in metadata:
+        logging.warning("Couldn't determine git data.")
+
+    # Slurm metadata.
+    if "SLURM_JOB_ID" in os.environ:
+        slurm_env_keys = [k for k in os.environ if k.startswith("SLURM")]
+        metadata["slurm"] = {}
+        for k in slurm_env_keys:
+            d_key = k.replace("SLURM_", "").replace("SLURMD_", "").lower()
+            metadata["slurm"][d_key] = os.environ[k]
+
+    return metadata
+
+
+class FileWriter:
+    def __init__(self, xp_args=None, rootdir="~/palaas"):
+        if rootdir == "~/palaas":
+            # make unique id in case someone uses the default rootdir
+            xpid = "{proc}_{unixtime}".format(
+                proc=os.getpid(), unixtime=int(time.time())
+            )
+            rootdir = os.path.join(rootdir, xpid)
+        self.basepath = os.path.expandvars(os.path.expanduser(rootdir))
+
+        self._tick = 0
+
+        # metadata gathering
+        if xp_args is None:
+            xp_args = {}
+        self.metadata = gather_metadata()
+        # we need to copy the args, otherwise when we close the file writer
+        # (and rewrite the args) we might have non-serializable objects (or
+        # other nasty stuff).
+        self.metadata["args"] = copy.deepcopy(xp_args)
+
+        formatter = logging.Formatter("%(message)s")
+        self._logger = logging.getLogger("palaas/out")
+
+        # to stdout handler
+        shandle = logging.StreamHandler()
+        shandle.setFormatter(formatter)
+        self._logger.addHandler(shandle)
+        self._logger.setLevel(logging.INFO)
+
+        # to file handler
+        if not os.path.exists(self.basepath):
+            self._logger.info("Creating log directory: %s", self.basepath)
+            os.makedirs(self.basepath, exist_ok=True)
+        else:
+            self._logger.info("Found log directory: %s", self.basepath)
+
+        self.paths = dict(
+            msg="{base}/out.log".format(base=self.basepath),
+            logs="{base}/logs.csv".format(base=self.basepath),
+            fields="{base}/fields.csv".format(base=self.basepath),
+            meta="{base}/meta.json".format(base=self.basepath),
+        )
+
+        self._logger.info("Saving arguments to %s", self.paths["meta"])
+        if os.path.exists(self.paths["meta"]):
+            self._logger.warning(
+                "Path to meta file already exists. " "Not overriding meta."
+            )
+        else:
+            self.save_metadata()
+
+        self._logger.info("Saving messages to %s", self.paths["msg"])
+        if os.path.exists(self.paths["msg"]):
+            self._logger.warning(
+                "Path to message file already exists. " "New data will be appended."
+            )
+
+        fhandle = logging.FileHandler(self.paths["msg"])
+        fhandle.setFormatter(formatter)
+        self._logger.addHandler(fhandle)
+
+        self._logger.info("Saving logs data to %s", self.paths["logs"])
+        self._logger.info("Saving logs' fields to %s", self.paths["fields"])
+        self.fieldnames = ["_tick", "_time"]
+        if os.path.exists(self.paths["logs"]):
+            self._logger.warning(
+                "Path to log file already exists. " "New data will be appended."
+            )
+            # Override default fieldnames.
+            with open(self.paths["fields"], "r") as csvfile:
+                reader = csv.reader(csvfile)
+                lines = list(reader)
+                if len(lines) > 0:
+                    self.fieldnames = lines[-1]
+            # Override default tick: use the last tick from the logs file plus 1.
+            with open(self.paths["logs"], "r") as csvfile:
+                reader = csv.reader(csvfile)
+                lines = list(reader)
+                # Need at least two lines in order to read the last tick:
+                # the first is the csv header and the second is the first line
+                # of data.
+                if len(lines) > 1:
+                    self._tick = int(lines[-1][0]) + 1
+
+        self._fieldfile = open(self.paths["fields"], "a")
+        self._fieldwriter = csv.writer(self._fieldfile)
+        self._fieldfile.flush()
+        self._logfile = open(self.paths["logs"], "a")
+        self._logwriter = csv.DictWriter(self._logfile, fieldnames=self.fieldnames)
+
+        # Auto-close (and save) on destruction.
+        weakref.finalize(self, _save_metadata, self.paths["meta"], self.metadata)
+
+    def log(self, to_log, tick=None, verbose=False):
+        if tick is not None:
+            raise NotImplementedError
+        else:
+            to_log["_tick"] = self._tick
+            self._tick += 1
+        to_log["_time"] = time.time()
+
+        old_len = len(self.fieldnames)
+        for k in to_log:
+            if k not in self.fieldnames:
+                self.fieldnames.append(k)
+        if old_len != len(self.fieldnames):
+            self._fieldwriter.writerow(self.fieldnames)
+            self._fieldfile.flush()
+            self._logger.info("Updated log fields: %s", self.fieldnames)
+
+        if to_log["_tick"] == 0:
+            self._logfile.write("# %s\n" % ",".join(self.fieldnames))
+
+        if verbose:
+            self._logger.info(
+                "LOG | %s",
+                ", ".join(["{}: {}".format(k, to_log[k]) for k in sorted(to_log)]),
+            )
+
+        self._logwriter.writerow(to_log)
+        self._logfile.flush()
+
+    def close(self, successful=True):
+        self.metadata["successful"] = successful
+        self.save_metadata()
+
+        for f in [self._logfile, self._fieldfile]:
+            f.close()
+
+    def save_metadata(self):
+        _save_metadata(self.paths["meta"], self.metadata)
--- a/nethack_baselines/torchbeast/core/vtrace.py
+++ b/nethack_baselines/torchbeast/core/vtrace.py
+# This file taken from
+#     https://github.com/deepmind/scalable_agent/blob/
+#         cd66d00914d56c8ba2f0615d9cdeefcb169a8d70/vtrace.py
+# and modified.
+
+# Copyright 2018 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     https://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Functions to compute V-trace off-policy actor critic targets.
+
+For details and theory see:
+
+"IMPALA: Scalable Distributed Deep-RL with
+Importance Weighted Actor-Learner Architectures"
+by Espeholt, Soyer, Munos et al.
+
+See https://arxiv.org/abs/1802.01561 for the full paper.
+"""
+
+import collections
+
+import torch
+import torch.nn.functional as F
+
+
+VTraceFromLogitsReturns = collections.namedtuple(
+    "VTraceFromLogitsReturns",
+    [
+        "vs",
+        "pg_advantages",
+        "log_rhos",
+        "behavior_action_log_probs",
+        "target_action_log_probs",
+    ],
+)
+
+VTraceReturns = collections.namedtuple("VTraceReturns", "vs pg_advantages")
+
+
+def action_log_probs(policy_logits, actions):
+    return -F.nll_loss(
+        F.log_softmax(torch.flatten(policy_logits, 0, 1), dim=-1),
+        torch.flatten(actions, 0, 1),
+        reduction="none",
+    ).view_as(actions)
+
+
+def from_logits(
+    behavior_policy_logits,
+    target_policy_logits,
+    actions,
+    discounts,
+    rewards,
+    values,
+    bootstrap_value,
+    clip_rho_threshold=1.0,
+    clip_pg_rho_threshold=1.0,
+):
+    """V-trace for softmax policies."""
+
+    target_action_log_probs = action_log_probs(target_policy_logits, actions)
+    behavior_action_log_probs = action_log_probs(behavior_policy_logits, actions)
+    log_rhos = target_action_log_probs - behavior_action_log_probs
+    vtrace_returns = from_importance_weights(
+        log_rhos=log_rhos,
+        discounts=discounts,
+        rewards=rewards,
+        values=values,
+        bootstrap_value=bootstrap_value,
+        clip_rho_threshold=clip_rho_threshold,
+        clip_pg_rho_threshold=clip_pg_rho_threshold,
+    )
+    return VTraceFromLogitsReturns(
+        log_rhos=log_rhos,
+        behavior_action_log_probs=behavior_action_log_probs,
+        target_action_log_probs=target_action_log_probs,
+        **vtrace_returns._asdict(),
+    )
+
+
+@torch.no_grad()
+def from_importance_weights(
+    log_rhos,
+    discounts,
+    rewards,
+    values,
+    bootstrap_value,
+    clip_rho_threshold=1.0,
+    clip_pg_rho_threshold=1.0,
+):
+    """V-trace from log importance weights."""
+    with torch.no_grad():
+        rhos = torch.exp(log_rhos)
+        if clip_rho_threshold is not None:
+            clipped_rhos = torch.clamp(rhos, max=clip_rho_threshold)
+        else:
+            clipped_rhos = rhos
+
+        cs = torch.clamp(rhos, max=1.0)
+        # Append bootstrapped value to get [v1, ..., v_t+1]
+        values_t_plus_1 = torch.cat(
+            [values[1:], torch.unsqueeze(bootstrap_value, 0)], dim=0
+        )
+        deltas = clipped_rhos * (rewards + discounts * values_t_plus_1 - values)
+
+        acc = torch.zeros_like(bootstrap_value)
+        result = []
+        for t in range(discounts.shape[0] - 1, -1, -1):
+            acc = deltas[t] + discounts[t] * cs[t] * acc
+            result.append(acc)
+        result.reverse()
+        vs_minus_v_xs = torch.stack(result)
+
+        # Add V(x_s) to get v_s.
+        vs = torch.add(vs_minus_v_xs, values)
+
+        # Advantage for policy gradient.
+        vs_t_plus_1 = torch.cat([vs[1:], torch.unsqueeze(bootstrap_value, 0)], dim=0)
+        if clip_pg_rho_threshold is not None:
+            clipped_pg_rhos = torch.clamp(rhos, max=clip_pg_rho_threshold)
+        else:
+            clipped_pg_rhos = rhos
+        pg_advantages = clipped_pg_rhos * (rewards + discounts * vs_t_plus_1 - values)
+
+        # Make sure no gradients backpropagated through the returned values.
+        return VTraceReturns(vs=vs, pg_advantages=pg_advantages)
--- a/nethack_baselines/torchbeast/models/__init__.py
+++ b/nethack_baselines/torchbeast/models/__init__.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from nle.env import tasks
+from nle.env.base import DUNGEON_SHAPE
+
+from .baseline import BaselineNet
+
+from omegaconf import OmegaConf
+import torch
+
+
+ENVS = dict(
+    staircase=tasks.NetHackStaircase,
+    score=tasks.NetHackScore,
+    pet=tasks.NetHackStaircasePet,
+    oracle=tasks.NetHackOracle,
+    gold=tasks.NetHackGold,
+    eat=tasks.NetHackEat,
+    scout=tasks.NetHackScout,
+    challenge=tasks.NetHackChallenge,
+)
+
+
+def create_model(flags, device):
+    model_string = flags.model
+    if model_string == "baseline":
+        model_cls = BaselineNet
+    else:
+        raise NotImplementedError("model=%s" % model_string)
+
+    action_space = ENVS[flags.env](savedir=None, archivefile=None)._actions
+
+    model = model_cls(DUNGEON_SHAPE, action_space, flags, device)
+    model.to(device=device)
+    return model
+
+
+def load_model(load_dir, device):
+    flags = OmegaConf.load(load_dir + "/config.yaml")
+    flags.checkpoint = load_dir + "/checkpoint.tar"
+    model = create_model(flags, device)
+    checkpoint_states = torch.load(flags.checkpoint, map_location=device)
+    model.load_state_dict(checkpoint_states["model_state_dict"])
+    return model
--- a/nethack_baselines/torchbeast/models/baseline.py
+++ b/nethack_baselines/torchbeast/models/baseline.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import torch
+from torch import nn
+from torch.nn import functional as F
+from einops import rearrange
+
+
+from nle import nethack
+
+from .util import id_pairs_table
+import numpy as np
+
+NUM_GLYPHS = nethack.MAX_GLYPH
+NUM_FEATURES = 25
+PAD_CHAR = 0
+NUM_CHARS = 256
+
+
+def get_action_space_mask(action_space, reduced_action_space):
+    mask = np.array([int(a in reduced_action_space) for a in action_space])
+    return torch.Tensor(mask)
+
+
+def conv_outdim(i_dim, k, padding=0, stride=1, dilation=1):
+    """Return the dimension after applying a convolution along one axis"""
+    return int(1 + (i_dim + 2 * padding - dilation * (k - 1) - 1) / stride)
+
+
+def select(embedding_layer, x, use_index_select):
+    """Use index select instead of default forward to possible speed up embedding."""
+    if use_index_select:
+        out = embedding_layer.weight.index_select(0, x.view(-1))
+        # handle reshaping x to 1-d and output back to N-d
+        return out.view(x.shape + (-1,))
+    else:
+        return embedding_layer(x)
+
+
+class NetHackNet(nn.Module):
+    """This base class simply provides a skeleton for running with torchbeast."""
+
+    AgentOutput = collections.namedtuple("AgentOutput", "action policy_logits baseline")
+
+    def __init__(self):
+        super(NetHackNet, self).__init__()
+        self.register_buffer("reward_sum", torch.zeros(()))
+        self.register_buffer("reward_m2", torch.zeros(()))
+        self.register_buffer("reward_count", torch.zeros(()).fill_(1e-8))
+
+    def forward(self, inputs, core_state):
+        raise NotImplementedError
+
+    def initial_state(self, batch_size=1):
+        return ()
+
+    @torch.no_grad()
+    def update_running_moments(self, reward_batch):
+        """Maintains a running mean of reward."""
+        new_count = len(reward_batch)
+        new_sum = torch.sum(reward_batch)
+        new_mean = new_sum / new_count
+
+        curr_mean = self.reward_sum / self.reward_count
+        new_m2 = torch.sum((reward_batch - new_mean) ** 2) + (
+            (self.reward_count * new_count)
+            / (self.reward_count + new_count)
+            * (new_mean - curr_mean) ** 2
+        )
+
+        self.reward_count += new_count
+        self.reward_sum += new_sum
+        self.reward_m2 += new_m2
+
+    @torch.no_grad()
+    def get_running_std(self):
+        """Returns standard deviation of the running mean of the reward."""
+        return torch.sqrt(self.reward_m2 / self.reward_count)
+
+
+class BaselineNet(NetHackNet):
+    """This model combines the encodings of the glyphs, top line message and
+    blstats into a single fixed-size representation, which is then passed to
+    an LSTM core before generating a policy and value head for use in an IMPALA
+    like architecture.
+
+    This model was based on 'neurips2020release' tag on the NLE repo, itself
+    based on Kuttler et al, 2020
+    The NetHack Learning Environment
+    https://arxiv.org/abs/2006.13760
+    """
+
+    def __init__(self, observation_shape, action_space, flags, device):
+        super(BaselineNet, self).__init__()
+
+        self.flags = flags
+
+        self.observation_shape = observation_shape
+        self.num_actions = len(action_space)
+
+        self.H = observation_shape[0]
+        self.W = observation_shape[1]
+
+        self.use_lstm = flags.use_lstm
+        self.h_dim = flags.hidden_dim
+
+        # GLYPH + CROP MODEL
+        self.glyph_model = GlyphEncoder(flags, self.H, self.W, flags.crop_dim, device)
+
+        # MESSAGING MODEL
+        self.msg_model = MessageEncoder(
+            flags.msg.hidden_dim, flags.msg.embedding_dim, device
+        )
+
+        # BLSTATS MODEL
+        self.blstats_model = BLStatsEncoder(NUM_FEATURES, flags.embedding_dim)
+
+        out_dim = (
+            self.blstats_model.hidden_dim
+            + self.glyph_model.hidden_dim
+            + self.msg_model.hidden_dim
+        )
+
+        self.fc = nn.Sequential(
+            nn.Linear(out_dim, self.h_dim),
+            nn.ReLU(),
+            nn.Linear(self.h_dim, self.h_dim),
+            nn.ReLU(),
+        )
+
+        if self.use_lstm:
+            self.core = nn.LSTM(self.h_dim, self.h_dim, num_layers=1)
+
+        self.policy = nn.Linear(self.h_dim, self.num_actions)
+        self.baseline = nn.Linear(self.h_dim, 1)
+
+        if flags.restrict_action_space:
+            reduced_space = nethack.USEFUL_ACTIONS
+            logits_mask = get_action_space_mask(action_space, reduced_space)
+            self.policy_logits_mask = nn.parameter.Parameter(
+                logits_mask, requires_grad=False
+            )
+
+    def initial_state(self, batch_size=1):
+        return tuple(
+            torch.zeros(self.core.num_layers, batch_size, self.core.hidden_size)
+            for _ in range(2)
+        )
+
+    def forward(self, inputs, core_state, learning=False):
+        T, B, H, W = inputs["glyphs"].shape
+
+        reps = []
+
+        # -- [B' x K] ; B' == (T x B)
+        glyphs_rep = self.glyph_model(inputs)
+        reps.append(glyphs_rep)
+
+        # -- [B' x K]
+        char_rep = self.msg_model(inputs)
+        reps.append(char_rep)
+
+        # -- [B' x K]
+        features_emb = self.blstats_model(inputs)
+        reps.append(features_emb)
+
+        # -- [B' x K]
+        st = torch.cat(reps, dim=1)
+
+        # -- [B' x K]
+        st = self.fc(st)
+
+        if self.use_lstm:
+            core_input = st.view(T, B, -1)
+            core_output_list = []
+            notdone = (~inputs["done"]).float()
+            for input, nd in zip(core_input.unbind(), notdone.unbind()):
+                # Reset core state to zero whenever an episode ended.
+                # Make `done` broadcastable with (num_layers, B, hidden_size)
+                # states:
+                nd = nd.view(1, -1, 1)
+                core_state = tuple(nd * t for t in core_state)
+                output, core_state = self.core(input.unsqueeze(0), core_state)
+                core_output_list.append(output)
+            core_output = torch.flatten(torch.cat(core_output_list), 0, 1)
+        else:
+            core_output = st
+
+        # -- [B' x A]
+        policy_logits = self.policy(core_output)
+
+        # -- [B' x 1]
+        baseline = self.baseline(core_output)
+
+        if self.flags.restrict_action_space:
+            policy_logits = policy_logits * self.policy_logits_mask + (
+                (1 - self.policy_logits_mask) * -1e10
+            )
+
+        if self.training:
+            action = torch.multinomial(F.softmax(policy_logits, dim=1), num_samples=1)
+        else:
+            # Don't sample when testing.
+            action = torch.argmax(policy_logits, dim=1)
+
+        policy_logits = policy_logits.view(T, B, -1)
+        baseline = baseline.view(T, B)
+        action = action.view(T, B)
+
+        output = dict(policy_logits=policy_logits, baseline=baseline, action=action)
+        return (output, core_state)
+
+
+class GlyphEncoder(nn.Module):
+    """This glyph encoder first breaks the glyphs (integers up to 6000) to a
+    more structured representation based on the qualities of the glyph: chars,
+    colors, specials, groups and subgroup ids..
+       Eg: invisible hell-hound: char (d), color (red), specials (invisible),
+                                 group (monster) subgroup id (type of monster)
+       Eg: lit dungeon floor: char (.), color (white), specials (none),
+                              group (dungeon) subgroup id (type of dungeon)
+
+    An embedding is provided for each of these, and the embeddings are
+    concatenated, before encoding with a number of CNN layers.  This operation
+    is repeated with a crop of the structured reprentations taken around the
+    characters position, and the two representations are concatenated
+    before returning.
+    """
+
+    def __init__(self, flags, rows, cols, crop_dim, device=None):
+        super(GlyphEncoder, self).__init__()
+
+        self.crop = Crop(rows, cols, crop_dim, crop_dim, device)
+        K = flags.embedding_dim  # number of input filters
+        L = flags.layers  # number of convnet layers
+
+        assert (
+            K % 8 == 0
+        ), "This glyph embedding format needs embedding dim to be multiple of 8"
+        unit = K // 8
+        self.chars_embedding = nn.Embedding(256, 2 * unit)
+        self.colors_embedding = nn.Embedding(16, unit)
+        self.specials_embedding = nn.Embedding(256, unit)
+
+        self.id_pairs_table = nn.parameter.Parameter(
+            torch.from_numpy(id_pairs_table()), requires_grad=False
+        )
+        num_groups = self.id_pairs_table.select(1, 1).max().item() + 1
+        num_ids = self.id_pairs_table.select(1, 0).max().item() + 1
+
+        self.groups_embedding = nn.Embedding(num_groups, unit)
+        self.ids_embedding = nn.Embedding(num_ids, 3 * unit)
+
+        F = 3  # filter dimensions
+        S = 1  # stride
+        P = 1  # padding
+        M = 16  # number of intermediate filters
+        self.output_filters = 8
+
+        in_channels = [K] + [M] * (L - 1)
+        out_channels = [M] * (L - 1) + [self.output_filters]
+
+        h, w, c = rows, cols, crop_dim
+        conv_extract, conv_extract_crop = [], []
+        for i in range(L):
+            conv_extract.append(
+                nn.Conv2d(
+                    in_channels=in_channels[i],
+                    out_channels=out_channels[i],
+                    kernel_size=(F, F),
+                    stride=S,
+                    padding=P,
+                )
+            )
+            conv_extract.append(nn.ELU())
+
+            conv_extract_crop.append(
+                nn.Conv2d(
+                    in_channels=in_channels[i],
+                    out_channels=out_channels[i],
+                    kernel_size=(F, F),
+                    stride=S,
+                    padding=P,
+                )
+            )
+            conv_extract_crop.append(nn.ELU())
+
+            # Keep track of output shapes
+            h = conv_outdim(h, F, P, S)
+            w = conv_outdim(w, F, P, S)
+            c = conv_outdim(c, F, P, S)
+
+        self.hidden_dim = (h * w + c * c) * self.output_filters
+        self.extract_representation = nn.Sequential(*conv_extract)
+        self.extract_crop_representation = nn.Sequential(*conv_extract_crop)
+        self.select = lambda emb, x: select(emb, x, flags.use_index_select)
+
+    def glyphs_to_ids_groups(self, glyphs):
+        T, B, H, W = glyphs.shape
+        ids_groups = self.id_pairs_table.index_select(0, glyphs.view(-1).long())
+        ids = ids_groups.select(1, 0).view(T, B, H, W).long()
+        groups = ids_groups.select(1, 1).view(T, B, H, W).long()
+        return [ids, groups]
+
+    def forward(self, inputs):
+        T, B, H, W = inputs["glyphs"].shape
+        ids, groups = self.glyphs_to_ids_groups(inputs["glyphs"])
+
+        glyph_tensors = [
+            self.select(self.chars_embedding, inputs["chars"].long()),
+            self.select(self.colors_embedding, inputs["colors"].long()),
+            self.select(self.specials_embedding, inputs["specials"].long()),
+            self.select(self.groups_embedding, groups),
+            self.select(self.ids_embedding, ids),
+        ]
+
+        glyphs_emb = torch.cat(glyph_tensors, dim=-1)
+        glyphs_emb = rearrange(glyphs_emb, "T B H W K -> (T B) K H W")
+
+        coordinates = inputs["blstats"].view(T * B, -1).float()[:, :2]
+        crop_emb = self.crop(glyphs_emb, coordinates)
+
+        glyphs_rep = self.extract_representation(glyphs_emb)
+        glyphs_rep = rearrange(glyphs_rep, "B C H W -> B (C H W)")
+        assert glyphs_rep.shape[0] == T * B
+
+        crop_rep = self.extract_crop_representation(crop_emb)
+        crop_rep = rearrange(crop_rep, "B C H W -> B (C H W)")
+        assert crop_rep.shape[0] == T * B
+
+        st = torch.cat([glyphs_rep, crop_rep], dim=1)
+        return st
+
+
+class MessageEncoder(nn.Module):
+    """This model encodes the the topline message into a fixed size representation.
+
+    It works by using a learnt embedding for each character before passing the
+    embeddings through 6 CNN layers.
+
+    Inspired by Zhang et al, 2016
+    Character-level Convolutional Networks for Text Classification
+    https://arxiv.org/abs/1509.01626
+    """
+
+    def __init__(self, hidden_dim, embedding_dim, device=None):
+        super(MessageEncoder, self).__init__()
+
+        self.hidden_dim = hidden_dim
+        self.msg_edim = embedding_dim
+
+        self.char_lt = nn.Embedding(NUM_CHARS, self.msg_edim, padding_idx=PAD_CHAR)
+        self.conv1 = nn.Conv1d(self.msg_edim, self.hidden_dim, kernel_size=7)
+        self.conv2_6_fc = nn.Sequential(
+            nn.ReLU(),
+            nn.MaxPool1d(kernel_size=3, stride=3),
+            # conv2
+            nn.Conv1d(self.hidden_dim, self.hidden_dim, kernel_size=7),
+            nn.ReLU(),
+            nn.MaxPool1d(kernel_size=3, stride=3),
+            # conv3
+            nn.Conv1d(self.hidden_dim, self.hidden_dim, kernel_size=3),
+            nn.ReLU(),
+            # conv4
+            nn.Conv1d(self.hidden_dim, self.hidden_dim, kernel_size=3),
+            nn.ReLU(),
+            # conv5
+            nn.Conv1d(self.hidden_dim, self.hidden_dim, kernel_size=3),
+            nn.ReLU(),
+            # conv6
+            nn.Conv1d(self.hidden_dim, self.hidden_dim, kernel_size=3),
+            nn.ReLU(),
+            nn.MaxPool1d(kernel_size=3, stride=3),
+            # fc receives -- [ B x h_dim x 5 ]
+            Flatten(),
+            nn.Linear(5 * self.hidden_dim, 2 * self.hidden_dim),
+            nn.ReLU(),
+            nn.Linear(2 * self.hidden_dim, self.hidden_dim),
+        )  # final output -- [ B x h_dim x 5 ]
+
+    def forward(self, inputs):
+        T, B, *_ = inputs["message"].shape
+        messages = inputs["message"].long().view(T * B, -1)
+        # [ T * B x E x 256 ]
+        char_emb = self.char_lt(messages).transpose(1, 2)
+        char_rep = self.conv2_6_fc(self.conv1(char_emb))
+        return char_rep
+
+
+class BLStatsEncoder(nn.Module):
+    """This model encodes the bottom line stats into a fixed size representation.
+
+    It works by simply using two fully-connected layers with ReLU activations.
+    """
+
+    def __init__(self, num_features, hidden_dim):
+        super(BLStatsEncoder, self).__init__()
+        self.num_features = num_features
+        self.hidden_dim = hidden_dim
+        self.embed_features = nn.Sequential(
+            nn.Linear(self.num_features, self.hidden_dim),
+            nn.ReLU(),
+            nn.Linear(self.hidden_dim, self.hidden_dim),
+            nn.ReLU(),
+        )
+
+    def forward(self, inputs):
+        T, B, *_ = inputs["blstats"].shape
+
+        features = inputs["blstats"][:,:, :NUM_FEATURES]
+        # -- [B' x F]
+        features = features.view(T * B, -1).float()
+        # -- [B x K]
+        features_emb = self.embed_features(features)
+
+        assert features_emb.shape[0] == T * B
+        return features_emb
+
+
+class Crop(nn.Module):
+    def __init__(self, height, width, height_target, width_target, device=None):
+        super(Crop, self).__init__()
+        self.width = width
+        self.height = height
+        self.width_target = width_target
+        self.height_target = height_target
+
+        width_grid = self._step_to_range(2 / (self.width - 1), self.width_target)
+        self.width_grid = width_grid[None, :].expand(self.height_target, -1)
+
+        height_grid = self._step_to_range(2 / (self.height - 1), height_target)
+        self.height_grid = height_grid[:, None].expand(-1, self.width_target)
+
+        if device is not None:
+            self.width_grid = self.width_grid.to(device)
+            self.height_grid = self.height_grid.to(device)
+
+    def _step_to_range(self, step, num_steps):
+        return torch.tensor([step * (i - num_steps // 2) for i in range(num_steps)])
+
+    def forward(self, inputs, coordinates):
+        """Calculates centered crop around given x,y coordinates.
+
+        Args:
+           inputs [B x H x W] or [B x C x H x W]
+           coordinates [B x 2] x,y coordinates
+
+        Returns:
+           [B x C x H' x W'] inputs cropped and centered around x,y coordinates.
+        """
+        if inputs.dim() == 3:
+            inputs = inputs.unsqueeze(1).float()
+
+        assert inputs.shape[2] == self.height, "expected %d but found %d" % (
+            self.height,
+            inputs.shape[2],
+        )
+        assert inputs.shape[3] == self.width, "expected %d but found %d" % (
+            self.width,
+            inputs.shape[3],
+        )
+
+        x = coordinates[:, 0]
+        y = coordinates[:, 1]
+
+        x_shift = 2 / (self.width - 1) * (x.float() - self.width // 2)
+        y_shift = 2 / (self.height - 1) * (y.float() - self.height // 2)
+
+        grid = torch.stack(
+            [
+                self.width_grid[None, :, :] + x_shift[:, None, None],
+                self.height_grid[None, :, :] + y_shift[:, None, None],
+            ],
+            dim=3,
+        )
+
+        crop = torch.round(F.grid_sample(inputs, grid, align_corners=True)).squeeze(1)
+        return crop
+
+
+class Flatten(nn.Module):
+    def forward(self, input):
+        return input.view(input.size(0), -1)
--- a/nethack_baselines/torchbeast/models/util.py
+++ b/nethack_baselines/torchbeast/models/util.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import enum
+
+import numpy as np
+
+from nle.nethack import *  # noqa: F403
+
+# flake8: noqa: F405
+
+# TODO: import this from NLE again
+NUM_OBJECTS = 453
+MAXEXPCHARS = 9
+
+
+class GlyphGroup(enum.IntEnum):
+    # See display.h in NetHack.
+    MON = 0
+    PET = 1
+    INVIS = 2
+    DETECT = 3
+    BODY = 4
+    RIDDEN = 5
+    OBJ = 6
+    CMAP = 7
+    EXPLODE = 8
+    ZAP = 9
+    SWALLOW = 10
+    WARNING = 11
+    STATUE = 12
+
+
+def id_pairs_table():
+    """Returns a lookup table for glyph -> NLE id pairs."""
+    table = np.zeros([MAX_GLYPH, 2], dtype=np.int16)
+
+    num_nle_ids = 0
+
+    for glyph in range(GLYPH_MON_OFF, GLYPH_PET_OFF):
+        table[glyph] = (glyph, GlyphGroup.MON)
+        num_nle_ids += 1
+
+    for glyph in range(GLYPH_PET_OFF, GLYPH_INVIS_OFF):
+        table[glyph] = (glyph - GLYPH_PET_OFF, GlyphGroup.PET)
+
+    for glyph in range(GLYPH_INVIS_OFF, GLYPH_DETECT_OFF):
+        table[glyph] = (num_nle_ids, GlyphGroup.INVIS)
+        num_nle_ids += 1
+
+    for glyph in range(GLYPH_DETECT_OFF, GLYPH_BODY_OFF):
+        table[glyph] = (glyph - GLYPH_DETECT_OFF, GlyphGroup.DETECT)
+
+    for glyph in range(GLYPH_BODY_OFF, GLYPH_RIDDEN_OFF):
+        table[glyph] = (glyph - GLYPH_BODY_OFF, GlyphGroup.BODY)
+
+    for glyph in range(GLYPH_RIDDEN_OFF, GLYPH_OBJ_OFF):
+        table[glyph] = (glyph - GLYPH_RIDDEN_OFF, GlyphGroup.RIDDEN)
+
+    for glyph in range(GLYPH_OBJ_OFF, GLYPH_CMAP_OFF):
+        table[glyph] = (num_nle_ids, GlyphGroup.OBJ)
+        num_nle_ids += 1
+
+    for glyph in range(GLYPH_CMAP_OFF, GLYPH_EXPLODE_OFF):
+        table[glyph] = (num_nle_ids, GlyphGroup.CMAP)
+        num_nle_ids += 1
+
+    for glyph in range(GLYPH_EXPLODE_OFF, GLYPH_ZAP_OFF):
+        id_ = num_nle_ids + (glyph - GLYPH_EXPLODE_OFF) // MAXEXPCHARS
+        table[glyph] = (id_, GlyphGroup.EXPLODE)
+
+    num_nle_ids += EXPL_MAX
+
+    for glyph in range(GLYPH_ZAP_OFF, GLYPH_SWALLOW_OFF):
+        id_ = num_nle_ids + (glyph - GLYPH_ZAP_OFF) // 4
+        table[glyph] = (id_, GlyphGroup.ZAP)
+
+    num_nle_ids += NUM_ZAP
+
+    for glyph in range(GLYPH_SWALLOW_OFF, GLYPH_WARNING_OFF):
+        table[glyph] = (num_nle_ids, GlyphGroup.SWALLOW)
+    num_nle_ids += 1
+
+    for glyph in range(GLYPH_WARNING_OFF, GLYPH_STATUE_OFF):
+        table[glyph] = (num_nle_ids, GlyphGroup.WARNING)
+        num_nle_ids += 1
+
+    for glyph in range(GLYPH_STATUE_OFF, MAX_GLYPH):
+        table[glyph] = (glyph - GLYPH_STATUE_OFF, GlyphGroup.STATUE)
+
+    return table
+
+
+def id_pairs_func(glyph):
+    result = glyph_to_mon(glyph)
+    if result != NO_GLYPH:
+        return result
+    if glyph_is_invisible(glyph):
+        return NUMMONS
+    if glyph_is_body(glyph):
+        return glyph - GLYPH_BODY_OFF
+
+    offset = NUMMONS + 1
+
+    # CORPSE handled by glyph_is_body; STATUE handled by glyph_to_mon.
+    result = glyph_to_obj(glyph)
+    if result != NO_GLYPH:
+        return result + offset
+    offset += NUM_OBJECTS
+
+    # I don't understand glyph_to_cmap and/or the GLYPH_EXPLODE_OFF definition
+    # with MAXPCHARS - MAXEXPCHARS.
+    if GLYPH_CMAP_OFF <= glyph < GLYPH_EXPLODE_OFF:
+        return glyph - GLYPH_CMAP_OFF + offset
+    offset += MAXPCHARS - MAXEXPCHARS
+
+    if GLYPH_EXPLODE_OFF <= glyph < GLYPH_ZAP_OFF:
+        return (glyph - GLYPH_EXPLODE_OFF) // MAXEXPCHARS + offset
+    offset += EXPL_MAX
+
+    if GLYPH_ZAP_OFF <= glyph < GLYPH_SWALLOW_OFF:
+        return ((glyph - GLYPH_ZAP_OFF) >> 2) + offset
+    offset += NUM_ZAP
+
+    if GLYPH_SWALLOW_OFF <= glyph < GLYPH_WARNING_OFF:
+        return offset
+    offset += 1
+
+    result = glyph_to_warning(glyph)
+    if result != NO_GLYPH:
+        return result + offset
--- a/nethack_baselines/torchbeast/polybeast_env.py
+++ b/nethack_baselines/torchbeast/polybeast_env.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import multiprocessing as mp
+import logging
+import os
+import threading
+import time
+
+import torch
+
+import libtorchbeast
+
+from models import ENVS
+
+
+logging.basicConfig(
+    format=(
+        "[%(levelname)s:%(process)d %(module)s:%(lineno)d %(asctime)s] " "%(message)s"
+    ),
+    level=0,
+)
+
+
+# Helper functions for NethackEnv.
+def _format_observation(obs):
+    obs = torch.from_numpy(obs)
+    return obs.view((1, 1) + obs.shape)  # (...) -> (T,B,...).
+
+
+def create_folders(flags):
+    # Creates some of the folders that would be created by the filewriter.
+    logdir = os.path.join(flags.savedir, "archives")
+    if not os.path.exists(logdir):
+        logging.info("Creating archive directory: %s" % logdir)
+        os.makedirs(logdir, exist_ok=True)
+    else:
+        logging.info("Found archive directory: %s" % logdir)
+
+
+def create_env(flags, env_id=0, lock=threading.Lock()):
+    # commenting out these options for now because they use too much disk space
+    # archivefile = "nethack.%i.%%(pid)i.%%(time)s.zip" % env_id
+    # if flags.single_ttyrec and env_id != 0:
+    #     archivefile = None
+
+    # logdir = os.path.join(flags.savedir, "archives")
+
+    with lock:
+        env_class = ENVS[flags.env]
+        kwargs = dict(
+            savedir=None,
+            archivefile=None,
+            character=flags.character,
+            max_episode_steps=flags.max_num_steps,
+            observation_keys=(
+                "glyphs",
+                "chars",
+                "colors",
+                "specials",
+                "blstats",
+                "message",
+                "tty_chars",
+                "tty_colors",
+                "tty_cursor",
+                "inv_glyphs",
+                "inv_strs",
+                "inv_letters",
+                "inv_oclasses",
+            ),
+            penalty_step=flags.penalty_step,
+            penalty_time=flags.penalty_time,
+            penalty_mode=flags.fn_penalty_step,
+        )
+        if flags.env in ("staircase", "pet", "oracle"):
+            kwargs.update(reward_win=flags.reward_win, reward_lose=flags.reward_lose)
+        elif env_id == 0:  # print warning once
+            print("Ignoring flags.reward_win and flags.reward_lose")
+        if flags.state_counter != "none":
+            kwargs.update(state_counter=flags.state_counter)
+        env = env_class(**kwargs)
+        if flags.seedspath is not None and len(flags.seedspath) > 0:
+            raise NotImplementedError("seedspath > 0 not implemented yet.")
+
+        return env
+
+
+def serve(flags, server_address, env_id):
+    env = lambda: create_env(flags, env_id)
+    server = libtorchbeast.Server(env, server_address=server_address)
+    server.run()
+
+
+def main(flags):
+    if flags.num_seeds > 0:
+        raise NotImplementedError("num_seeds > 0 not currently implemented.")
+
+    create_folders(flags)
+
+    if not flags.pipes_basename.startswith("unix:"):
+        raise Exception("--pipes_basename has to be of the form unix:/some/path.")
+
+    processes = []
+    for i in range(flags.num_servers):
+        p = mp.Process(
+            target=serve, args=(flags, f"{flags.pipes_basename}.{i}", i), daemon=True
+        )
+        p.start()
+        processes.append(p)
+
+    try:
+        # We are only here to listen to the interrupt.
+        while True:
+            time.sleep(10)
+    except KeyboardInterrupt:
+        pass
--- a/nethack_baselines/torchbeast/polybeast_learner.py
+++ b/nethack_baselines/torchbeast/polybeast_learner.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# Run with OMP_NUM_THREADS=1.
+#
+
+import collections
+import logging
+import os
+import threading
+import time
+import timeit
+import traceback
+
+import wandb
+import omegaconf
+import nest
+import torch
+
+import libtorchbeast
+
+from core import file_writer
+from core import vtrace
+
+from models import create_model
+from models.baseline import NetHackNet
+
+from torch import nn
+from torch.nn import functional as F
+
+
+logging.basicConfig(
+    format=(
+        "[%(levelname)s:%(process)d %(module)s:%(lineno)d %(asctime)s] " "%(message)s"
+    ),
+    level=0,
+)
+
+
+def compute_baseline_loss(advantages):
+    return 0.5 * torch.sum(advantages ** 2)
+
+
+def compute_entropy_loss(logits):
+    policy = F.softmax(logits, dim=-1)
+    log_policy = F.log_softmax(logits, dim=-1)
+    entropy_per_timestep = torch.sum(-policy * log_policy, dim=-1)
+    return -torch.sum(entropy_per_timestep)
+
+
+def compute_policy_gradient_loss(logits, actions, advantages):
+    cross_entropy = F.nll_loss(
+        F.log_softmax(torch.flatten(logits, 0, 1), dim=-1),
+        target=torch.flatten(actions, 0, 1),
+        reduction="none",
+    )
+    cross_entropy = cross_entropy.view_as(advantages)
+    policy_gradient_loss_per_timestep = cross_entropy * advantages.detach()
+    return torch.sum(policy_gradient_loss_per_timestep)
+
+
+def inference(
+    inference_batcher, model, flags, actor_device, lock=threading.Lock()
+):  # noqa: B008
+    with torch.no_grad():
+        for batch in inference_batcher:
+            batched_env_outputs, agent_state = batch.get_inputs()
+            observation, reward, done, *_ = batched_env_outputs
+            # Observation is a dict with keys 'features' and 'glyphs'.
+            observation["done"] = done
+            observation, agent_state = nest.map(
+                lambda t: t.to(actor_device, non_blocking=True),
+                (observation, agent_state),
+            )
+            with lock:
+                outputs = model(observation, agent_state)
+            core_outputs, agent_state = nest.map(lambda t: t.cpu(), outputs)
+            # Restructuring the output in the way that is expected
+            # by the functions in actorpool.
+            outputs = (
+                tuple(
+                    (
+                        core_outputs["action"],
+                        core_outputs["policy_logits"],
+                        core_outputs["baseline"],
+                    )
+                ),
+                agent_state,
+            )
+            batch.set_outputs(outputs)
+
+
+# TODO(heiner): Given that our nest implementation doesn't support
+# namedtuples, using them here doesn't seem like a good fit. We
+# probably want to nestify the environment server and deal with
+# dictionaries?
+EnvOutput = collections.namedtuple(
+    "EnvOutput", "frame rewards done episode_step episode_return"
+)
+AgentOutput = NetHackNet.AgentOutput
+Batch = collections.namedtuple("Batch", "env agent")
+
+
+def learn(
+    learner_queue,
+    model,
+    actor_model,
+    optimizer,
+    scheduler,
+    stats,
+    flags,
+    plogger,
+    learner_device,
+    lock=threading.Lock(),  # noqa: B008
+):
+    for tensors in learner_queue:
+        tensors = nest.map(lambda t: t.to(learner_device), tensors)
+
+        batch, initial_agent_state = tensors
+        env_outputs, actor_outputs = batch
+        observation, reward, done, *_ = env_outputs
+        observation["reward"] = reward
+        observation["done"] = done
+
+        lock.acquire()  # Only one thread learning at a time.
+
+        output, _ = model(observation, initial_agent_state, learning=True)
+
+        # Use last baseline value (from the value function) to bootstrap.
+        learner_outputs = AgentOutput._make(
+            (output["action"], output["policy_logits"], output["baseline"])
+        )
+
+        # At this point, the environment outputs at time step `t` are the inputs
+        # that lead to the learner_outputs at time step `t`. After the following
+        # shifting, the actions in `batch` and `learner_outputs` at time
+        # step `t` is what leads to the environment outputs at time step `t`.
+        batch = nest.map(lambda t: t[1:], batch)
+        learner_outputs = nest.map(lambda t: t[:-1], learner_outputs)
+
+        # Turn into namedtuples again.
+        env_outputs, actor_outputs = batch
+        # Note that the env_outputs.frame is now a dict with 'features' and 'glyphs'
+        # instead of actually being the frame itself. This is currently not a problem
+        # because we never use actor_outputs.frame in the rest of this function.
+        env_outputs = EnvOutput._make(env_outputs)
+        actor_outputs = AgentOutput._make(actor_outputs)
+        learner_outputs = AgentOutput._make(learner_outputs)
+
+        rewards = env_outputs.rewards
+        if flags.normalize_reward:
+            model.update_running_moments(rewards)
+            rewards /= model.get_running_std()
+
+        total_loss = 0
+
+        # STANDARD EXTRINSIC LOSSES / REWARDS
+        if flags.entropy_cost > 0:
+            entropy_loss = flags.entropy_cost * compute_entropy_loss(
+                learner_outputs.policy_logits
+            )
+            total_loss += entropy_loss
+
+        discounts = (~env_outputs.done).float() * flags.discounting
+
+        # This could be in C++. In TF, this is actually slower on the GPU.
+        vtrace_returns = vtrace.from_logits(
+            behavior_policy_logits=actor_outputs.policy_logits,
+            target_policy_logits=learner_outputs.policy_logits,
+            actions=actor_outputs.action,
+            discounts=discounts,
+            rewards=rewards,
+            values=learner_outputs.baseline,
+            bootstrap_value=learner_outputs.baseline[-1],
+        )
+
+        # Compute loss as a weighted sum of the baseline loss, the policy
+        # gradient loss and an entropy regularization term.
+        pg_loss = compute_policy_gradient_loss(
+            learner_outputs.policy_logits,
+            actor_outputs.action,
+            vtrace_returns.pg_advantages,
+        )
+        baseline_loss = flags.baseline_cost * compute_baseline_loss(
+            vtrace_returns.vs - learner_outputs.baseline
+        )
+        total_loss += pg_loss + baseline_loss
+
+        # BACKWARD STEP
+        optimizer.zero_grad()
+        total_loss.backward()
+        if flags.grad_norm_clipping > 0:
+            nn.utils.clip_grad_norm_(model.parameters(), flags.grad_norm_clipping)
+        optimizer.step()
+        scheduler.step()
+
+        actor_model.load_state_dict(model.state_dict())
+
+        # LOGGING
+        episode_returns = env_outputs.episode_return[env_outputs.done]
+        stats["step"] = stats.get("step", 0) + flags.unroll_length * flags.batch_size
+        stats["mean_episode_return"] = torch.mean(episode_returns).item()
+        stats["mean_episode_step"] = torch.mean(env_outputs.episode_step.float()).item()
+        stats["total_loss"] = total_loss.item()
+        stats["pg_loss"] = pg_loss.item()
+        stats["baseline_loss"] = baseline_loss.item()
+        if flags.entropy_cost > 0:
+            stats["entropy_loss"] = entropy_loss.item()
+
+        stats["learner_queue_size"] = learner_queue.size()
+
+        if not len(episode_returns):
+            # Hide the mean-of-empty-tuple NaN as it scares people.
+            stats["mean_episode_return"] = None
+
+        # Only logging if at least one episode was finished
+        if len(episode_returns):
+            # TODO: log also SPS
+            plogger.log(stats)
+            if flags.wandb:
+                wandb.log(stats, step=stats["step"])
+
+        lock.release()
+
+
+def train(flags):
+    logging.info("Logging results to %s", flags.savedir)
+    if isinstance(flags, omegaconf.DictConfig):
+        flag_dict = omegaconf.OmegaConf.to_container(flags)
+    else:
+        flag_dict = vars(flags)
+    plogger = file_writer.FileWriter(xp_args=flag_dict, rootdir=flags.savedir)
+
+    if not flags.disable_cuda and torch.cuda.is_available():
+        logging.info("Using CUDA.")
+        learner_device = torch.device(flags.learner_device)
+        actor_device = torch.device(flags.actor_device)
+    else:
+        logging.info("Not using CUDA.")
+        learner_device = torch.device("cpu")
+        actor_device = torch.device("cpu")
+
+    if flags.max_learner_queue_size is None:
+        flags.max_learner_queue_size = flags.batch_size
+
+    # The queue the learner threads will get their data from.
+    # Setting `minimum_batch_size == maximum_batch_size`
+    # makes the batch size static. We could make it dynamic, but that
+    # requires a loss (and learning rate schedule) that's batch size
+    # independent.
+    learner_queue = libtorchbeast.BatchingQueue(
+        batch_dim=1,
+        minimum_batch_size=flags.batch_size,
+        maximum_batch_size=flags.batch_size,
+        check_inputs=True,
+        maximum_queue_size=flags.max_learner_queue_size,
+    )
+
+    # The "batcher", a queue for the inference call. Will yield
+    # "batch" objects with `get_inputs` and `set_outputs` methods.
+    # The batch size of the tensors will be dynamic.
+    inference_batcher = libtorchbeast.DynamicBatcher(
+        batch_dim=1,
+        minimum_batch_size=1,
+        maximum_batch_size=512,
+        timeout_ms=100,
+        check_outputs=True,
+    )
+
+    addresses = []
+    connections_per_server = 1
+    pipe_id = 0
+    while len(addresses) < flags.num_actors:
+        for _ in range(connections_per_server):
+            addresses.append(f"{flags.pipes_basename}.{pipe_id}")
+            if len(addresses) == flags.num_actors:
+                break
+        pipe_id += 1
+
+    logging.info("Using model %s", flags.model)
+
+    model = create_model(flags, learner_device)
+
+    plogger.metadata["model_numel"] = sum(
+        p.numel() for p in model.parameters() if p.requires_grad
+    )
+
+    logging.info("Number of model parameters: %i", plogger.metadata["model_numel"])
+
+    actor_model = create_model(flags, actor_device)
+
+    # The ActorPool that will run `flags.num_actors` many loops.
+    actors = libtorchbeast.ActorPool(
+        unroll_length=flags.unroll_length,
+        learner_queue=learner_queue,
+        inference_batcher=inference_batcher,
+        env_server_addresses=addresses,
+        initial_agent_state=model.initial_state(),
+    )
+
+    def run():
+        try:
+            actors.run()
+        except Exception as e:
+            logging.error("Exception in actorpool thread!")
+            traceback.print_exc()
+            print()
+            raise e
+
+    actorpool_thread = threading.Thread(target=run, name="actorpool-thread")
+
+    optimizer = torch.optim.RMSprop(
+        model.parameters(),
+        lr=flags.learning_rate,
+        momentum=flags.momentum,
+        eps=flags.epsilon,
+        alpha=flags.alpha,
+    )
+
+    def lr_lambda(epoch):
+        return (
+            1
+            - min(epoch * flags.unroll_length * flags.batch_size, flags.total_steps)
+            / flags.total_steps
+        )
+
+    scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)
+
+    stats = {}
+
+    if flags.checkpoint and os.path.exists(flags.checkpoint):
+        logging.info("Loading checkpoint: %s" % flags.checkpoint)
+        checkpoint_states = torch.load(
+            flags.checkpoint, map_location=flags.learner_device
+        )
+        model.load_state_dict(checkpoint_states["model_state_dict"])
+        optimizer.load_state_dict(checkpoint_states["optimizer_state_dict"])
+        scheduler.load_state_dict(checkpoint_states["scheduler_state_dict"])
+        stats = checkpoint_states["stats"]
+        logging.info(f"Resuming preempted job, current stats:\n{stats}")
+
+    # Initialize actor model like learner model.
+    actor_model.load_state_dict(model.state_dict())
+
+    learner_threads = [
+        threading.Thread(
+            target=learn,
+            name="learner-thread-%i" % i,
+            args=(
+                learner_queue,
+                model,
+                actor_model,
+                optimizer,
+                scheduler,
+                stats,
+                flags,
+                plogger,
+                learner_device,
+            ),
+        )
+        for i in range(flags.num_learner_threads)
+    ]
+    inference_threads = [
+        threading.Thread(
+            target=inference,
+            name="inference-thread-%i" % i,
+            args=(inference_batcher, actor_model, flags, actor_device),
+        )
+        for i in range(flags.num_inference_threads)
+    ]
+
+    actorpool_thread.start()
+    for t in learner_threads + inference_threads:
+        t.start()
+
+    def checkpoint(checkpoint_path=None):
+        if flags.checkpoint:
+            if checkpoint_path is None:
+                checkpoint_path = flags.checkpoint
+            logging.info("Saving checkpoint to %s", checkpoint_path)
+            torch.save(
+                {
+                    "model_state_dict": model.state_dict(),
+                    "optimizer_state_dict": optimizer.state_dict(),
+                    "scheduler_state_dict": scheduler.state_dict(),
+                    "stats": stats,
+                    "flags": vars(flags),
+                },
+                checkpoint_path,
+            )
+
+    def format_value(x):
+        return f"{x:1.5}" if isinstance(x, float) else str(x)
+
+    try:
+        train_start_time = timeit.default_timer()
+        train_time_offset = stats.get("train_seconds", 0)  # used for resuming training
+        last_checkpoint_time = timeit.default_timer()
+
+        dev_checkpoint_intervals = [0, 0.25, 0.5, 0.75]
+
+        loop_start_time = timeit.default_timer()
+        loop_start_step = stats.get("step", 0)
+        while True:
+            if loop_start_step >= flags.total_steps:
+                break
+            time.sleep(5)
+            loop_end_time = timeit.default_timer()
+            loop_end_step = stats.get("step", 0)
+
+            stats["train_seconds"] = round(
+                loop_end_time - train_start_time + train_time_offset, 1
+            )
+
+            if loop_end_time - last_checkpoint_time > 10 * 60:
+                # Save every 10 min.
+                checkpoint()
+                last_checkpoint_time = loop_end_time
+
+            if len(dev_checkpoint_intervals) > 0:
+                step_percentage = loop_end_step / flags.total_steps
+                i = dev_checkpoint_intervals[0]
+                if step_percentage > i:
+                    checkpoint(flags.checkpoint[:-4] + "_" + str(i) + ".tar")
+                    dev_checkpoint_intervals = dev_checkpoint_intervals[1:]
+
+            logging.info(
+                "Step %i @ %.1f SPS. Inference batcher size: %i."
+                " Learner queue size: %i."
+                " Other stats: (%s)",
+                loop_end_step,
+                (loop_end_step - loop_start_step) / (loop_end_time - loop_start_time),
+                inference_batcher.size(),
+                learner_queue.size(),
+                ", ".join(
+                    f"{key} = {format_value(value)}" for key, value in stats.items()
+                ),
+            )
+            loop_start_time = loop_end_time
+            loop_start_step = loop_end_step
+    except KeyboardInterrupt:
+        pass  # Close properly.
+    else:
+        logging.info("Learning finished after %i steps.", stats["step"])
+
+    checkpoint()
+
+    # Done with learning. Let's stop all the ongoing work.
+    inference_batcher.close()
+    learner_queue.close()
+
+    actorpool_thread.join()
+
+    for t in learner_threads + inference_threads:
+        t.join()
+
+
+def test(flags):
+    test_checkpoint = os.path.join(flags.savedir, "test_checkpoint.tar")
+    checkpoint = os.path.join(flags.load_dir, "checkpoint.tar")
+    if not os.path.exists(os.path.dirname(test_checkpoint)):
+        os.makedirs(os.path.dirname(test_checkpoint))
+
+    logging.info("Creating test copy of checkpoint '%s'", checkpoint)
+
+    checkpoint = torch.load(checkpoint)
+    for d in checkpoint["optimizer_state_dict"]["param_groups"]:
+        d["lr"] = 0.0
+        d["initial_lr"] = 0.0
+
+    checkpoint["scheduler_state_dict"]["last_epoch"] = 0
+    checkpoint["scheduler_state_dict"]["_step_count"] = 0
+    checkpoint["scheduler_state_dict"]["base_lrs"] = [0.0]
+    checkpoint["stats"]["step"] = 0
+    checkpoint["stats"]["_tick"] = 0
+
+    flags.checkpoint = test_checkpoint
+    flags.learning_rate = 0.0
+
+    logging.info("Saving test checkpoint to %s", test_checkpoint)
+    torch.save(checkpoint, test_checkpoint)
+
+    train(flags)
+
+
+def main(flags):
+    if flags.wandb:
+        wandb.init(
+            project=flags.project,
+            config=vars(flags),
+            group=flags.group,
+            entity=flags.entity,
+        )
+    if flags.mode == "train":
+        if flags.write_profiler_trace:
+            logging.info("Running with profiler.")
+            with torch.autograd.profiler.profile() as prof:
+                train(flags)
+            filename = "chrome-%s.trace" % time.strftime("%Y%m%d-%H%M%S")
+            logging.info("Writing profiler trace to '%s.gz'", filename)
+            prof.export_chrome_trace(filename)
+            os.system("gzip %s" % filename)
+        else:
+            train(flags)
+    elif flags.mode.startswith("test"):
+        test(flags)
+    if flags.wandb:
+        wandb.finish()
+        
--- a/nethack_baselines/torchbeast/polyhydra.py
+++ b/nethack_baselines/torchbeast/polyhydra.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Installation for hydra:
+pip install hydra-core hydra_colorlog --upgrade
+
+Runs like polybeast but use = to set flags:
+python -m polyhydra.py learning_rate=0.001 rnd.twoheaded=true
+
+Run sweep with another -m after the module:
+python -m polyhydra.py -m learning_rate=0.01,0.001,0.0001,0.00001 momentum=0,0.5
+
+Baseline should run with:
+python polyhydra.py
+"""
+
+from pathlib import Path
+import logging
+import os
+import multiprocessing as mp
+
+import hydra
+import numpy as np
+from omegaconf import OmegaConf, DictConfig
+
+import torch
+
+import polybeast_env
+import polybeast_learner
+
+if torch.__version__.startswith("1.5") or torch.__version__.startswith("1.6"):
+    # pytorch 1.5.* needs this for some reason on the cluster
+    os.environ["MKL_SERVICE_FORCE_INTEL"] = "1"
+os.environ["OMP_NUM_THREADS"] = "1"
+
+logging.basicConfig(
+    format=(
+        "[%(levelname)s:%(process)d %(module)s:%(lineno)d %(asctime)s] " "%(message)s"
+    ),
+    level=0,
+)
+
+
+def pipes_basename():
+    logdir = Path(os.getcwd())
+    name = ".".join([logdir.parents[1].name, logdir.parents[0].name, logdir.name])
+    return "unix:/tmp/poly.%s" % name
+
+
+def get_common_flags(flags):
+    flags = OmegaConf.to_container(flags)
+    flags["pipes_basename"] = pipes_basename()
+    flags["savedir"] = os.getcwd()
+    return OmegaConf.create(flags)
+
+
+def get_learner_flags(flags):
+    lrn_flags = OmegaConf.to_container(flags)
+    lrn_flags["checkpoint"] = os.path.join(flags["savedir"], "checkpoint.tar")
+    lrn_flags["entropy_cost"] = float(lrn_flags["entropy_cost"])
+    return OmegaConf.create(lrn_flags)
+
+
+def run_learner(flags: DictConfig):
+    polybeast_learner.main(flags)
+
+
+def get_environment_flags(flags):
+    env_flags = OmegaConf.to_container(flags)
+    env_flags["num_servers"] = flags.num_actors
+    max_num_steps = 1e6
+    if flags.env in ("staircase", "pet"):
+        max_num_steps = 1000
+    env_flags["max_num_steps"] = int(max_num_steps)
+    env_flags["seedspath"] = ""
+    return OmegaConf.create(env_flags)
+
+
+def run_env(flags):
+    np.random.seed()  # Get new random seed in forked process.
+    polybeast_env.main(flags)
+
+
+def symlink_latest(savedir, symlink):
+    try:
+        if os.path.islink(symlink):
+            os.remove(symlink)
+        if not os.path.exists(symlink):
+            os.symlink(savedir, symlink)
+            logging.info("Symlinked log directory: %s" % symlink)
+    except OSError:
+        # os.remove() or os.symlink() raced. Don't do anything.
+        pass
+
+
+@hydra.main(config_name="config")
+def main(flags: DictConfig):
+    if os.path.exists("config.yaml"):
+        # this ignores the local config.yaml and replaces it completely with saved one
+        logging.info("loading existing configuration, we're continuing a previous run")
+        new_flags = OmegaConf.load("config.yaml")
+        cli_conf = OmegaConf.from_cli()
+        # however, you can override parameters from the cli still
+        # this is useful e.g. if you did total_steps=N before and want to increase it
+        flags = OmegaConf.merge(new_flags, cli_conf)
+    if flags.load_dir and os.path.exists(os.path.join(flags.load_dir, "config.yaml")):
+        new_flags = OmegaConf.load(os.path.join(flags.load_dir, "config.yaml"))
+        cli_conf = OmegaConf.from_cli()
+        flags = OmegaConf.merge(new_flags, cli_conf)
+
+    logging.info(flags.pretty(resolve=True))
+    OmegaConf.save(flags, "config.yaml")
+
+    flags = get_common_flags(flags)
+
+    # set flags for polybeast_env
+    env_flags = get_environment_flags(flags)
+    env_processes = []
+    for _ in range(1):
+        p = mp.Process(target=run_env, args=(env_flags,))
+        p.start()
+        env_processes.append(p)
+
+    symlink_latest(
+        flags.savedir, os.path.join(hydra.utils.get_original_cwd(), "latest")
+    )
+
+    lrn_flags = get_learner_flags(flags)
+    run_learner(lrn_flags)
+
+    for p in env_processes:
+        p.kill()
+        p.join()
+    print('Training Done!')
+
+
+if __name__ == "__main__":
+    main()
--- a/notebooks/NetHackTutorial.ipynb
+++ b/notebooks/NetHackTutorial.ipynb
--- a/notebooks/example_annotated.png
+++ b/notebooks/example_annotated.png
--- a/notebooks/example_standalone.png
+++ b/notebooks/example_standalone.png
--- a/notebooks/model.png
+++ b/notebooks/model.png
--- a/requirements.txt
+++ b/requirements.txt
+torch
+einops
+hydra-core==1.0.6
+hydra_colorlog
 aicrowd-api
 aicrowd-gym
-nle --no-binary :all:
-
+numpy
+scipy
+nle>=0.7.2
+tqdm
+wandb
--- a/rollout.py
+++ b/rollout.py
 #!/usr/bin/env python
-# This file is the entrypoint for your submission
-# You can modify this file to include your code or directly call your functions/modules from here.

-import aicrowd_gym
-import nle
+################################################################
+## Ideally you shouldn't need to change this file at all      ##
+##                                                            ##
+## This file generates the rollouts, with the specific agent, ##
+## batch_size and wrappers specified in subminssion_config.py ##
+################################################################
+from tqdm import tqdm
+import numpy as np

-def main():
+from envs.batched_env import BatchedEnv
+from envs.wrappers import create_env
+from submission_config import SubmissionConfig
+
+
+
+def run_batched_rollout(num_episodes, batched_env, agent):
    """
-    This function will be called for training phase.
+    This function will generate a series of rollouts in a batched manner.
    """

-    # This allows us to limit the features of the environment 
-    # that we don't want participants to use during the submission
-    env = aicrowd_gym.make("NetHackScore-v0") 
+    num_envs = batched_env.num_envs

-    env = aicrowd_gym.make("NetHackScore-v0")
-    env.reset()
-    done = False
+    # This part can be left as is
+    observations = batched_env.batch_reset()
+    rewards = [0.0 for _ in range(num_envs)]
+    dones = [False for _ in range(num_envs)]
+    infos = [{} for _ in range(num_envs)]
+
+    # We mark at the start of each episode if we are 'counting it'
+    active_envs = [i < num_episodes for i in range(num_envs)]
+    num_remaining = num_episodes - sum(active_envs)
+    
    episode_count = 0
-    while episode_count < 20:
-        _, _, done, _ = env.step(env.action_space.sample())
-        if done:
-            episode_count += 1
-            print(episode_count)
-            env.reset()
+    pbar = tqdm(total=num_episodes)
+
+    ascension_count = 0
+    all_returns = []
+    returns = [0.0 for _ in range(num_envs)]
+    # The evaluator will automatically stop after the episodes based on the development/test phase
+    while episode_count < num_episodes:
+        actions = agent.batched_step(observations, rewards, dones, infos)
+
+        observations, rewards, dones, infos = batched_env.batch_step(actions)
+        
+        for i, r in enumerate(rewards):
+            returns[i] += r
+        
+        for done_idx in np.where(dones)[0]:
+            if active_envs[done_idx]:
+                # We were 'counting' this episode
+                all_returns.append(returns[done_idx])
+                episode_count += 1
+                
+                active_envs[done_idx] = (num_remaining > 0)
+                num_remaining -= 1
+                
+                ascension_count += int(infos[done_idx]["is_ascended"])
+                pbar.update(1)
+            
+            returns[done_idx] = 0.0
+    pbar.close()
+    return ascension_count, all_returns

 if __name__ == "__main__":
-    main()
+    # AIcrowd will cut the assessment early duing the dev phase
+    NUM_ASSESSMENTS = 4096
+
+    env_make_fn = SubmissionConfig.MAKE_ENV_FN
+    num_envs = SubmissionConfig.NUM_ENVIRONMENTS
+    Agent = SubmissionConfig.AGENT
+
+
+    batched_env = BatchedEnv(env_make_fn=env_make_fn, num_envs=num_envs)
+    agent = Agent(num_envs, batched_env.num_actions)
+
+    run_batched_rollout(NUM_ASSESSMENTS, batched_env, agent)
--- a/run.sh
+++ b/run.sh
 #!/bin/bash
+
 python rollout.py

--- a/saved_models/torchbeast/pretrained_0.25B/checkpoint.tar
+++ b/saved_models/torchbeast/pretrained_0.25B/checkpoint.tar
--- a/saved_models/torchbeast/pretrained_0.25B/config.yaml
+++ b/saved_models/torchbeast/pretrained_0.25B/config.yaml
+name: 5
+wandb: true
+project: nethack_challenge
+entity: nethack
+group: baseline
+mock: false
+single_ttyrec: true
+num_seeds: 0
+write_profiler_trace: false
+fn_penalty_step: constant
+penalty_time: 0.0
+penalty_step: -0.01
+reward_lose: 0
+reward_win: 100
+state_counter: none
+character: '@'
+mode: train
+env: challenge
+num_actors: 256
+total_steps: 1000000000.0
+batch_size: 32
+unroll_length: 80
+num_learner_threads: 1
+num_inference_threads: 1
+disable_cuda: false
+learner_device: cuda:1
+actor_device: cuda:0
+learning_rate: 0.0002
+grad_norm_clipping: 40
+alpha: 0.99
+momentum: 0
+epsilon: 1.0e-06
+entropy_cost: 0.001
+baseline_cost: 0.5
+discounting: 0.999
+normalize_reward: true
+model: baseline
+use_lstm: true
+hidden_dim: 256
+embedding_dim: 64
+layers: 5
+crop_dim: 9
+use_index_select: true
+restrict_action_space: true
+msg:
+  hidden_dim: 64
+  embedding_dim: 32
+load_dir: null
--- a/saved_models/torchbeast/pretrained_0.5B/checkpoint.tar
+++ b/saved_models/torchbeast/pretrained_0.5B/checkpoint.tar
No results found