nethack issueshttps://gitlab.aicrowd.com/groups/nethack/-/issues2021-07-13T09:51:40Zhttps://gitlab.aicrowd.com/nethack/neurips-2021-the-nethack-challenge/-/issues/17Using more than 2 GPUs with polybeast2021-07-13T09:51:40ZBatomUsing more than 2 GPUs with polybeastHi,
First of all, thanks a lot for the instructions and great competition.
We wanted to use polybeast on a single machine with more than 2 GPUs with the goal of getting as much throughput as possible. We were wondering what the best way ...Hi,
First of all, thanks a lot for the instructions and great competition.
We wanted to use polybeast on a single machine with more than 2 GPUs with the goal of getting as much throughput as possible. We were wondering what the best way to achieve this (e.g. more learner GPUs etc.) and what do we have to add/change.https://gitlab.aicrowd.com/nethack/neurips-2021-the-nethack-challenge/-/issues/16Error running polyhydra2021-07-12T18:11:02ZCireNeikualError running polyhydraHi,
I am trying to get the torchbeast agent to run in order to generate some trajectories I can initialize my own agents to. I installed torchbeast as per the README in the torchbeast repository (for polybeast). I was able to compile it...Hi,
I am trying to get the torchbeast agent to run in order to generate some trajectories I can initialize my own agents to. I installed torchbeast as per the README in the torchbeast repository (for polybeast). I was able to compile it, although I did have to make a modification to one of the dependencies (abseil-cpp) to fix an incompatibility with GCC 11 (missing limits header).
However, now when I try to run the baseline (nethack_baselines/torchbeast/polyhydra.py) I get the following:
```
python3 polyhydra.py
Traceback (most recent call last):
File "/home/.../neurips-2021-the-nethack-challenge/nethack_baselines/torchbeast/polyhydra.py", line 40, in <module>
import polybeast_env
File "/home/.../neurips-2021-the-nethack-challenge/nethack_baselines/torchbeast/polybeast_env.py", line 23, in <module>
import libtorchbeast
File "/usr/lib/python3.9/site-packages/libtorchbeast-0.0.20-py3.9-linux-x86_64.egg/libtorchbeast/__init__.py", line 18, in <module>
from ._C import (
ImportError: /usr/lib/python3.9/site-packages/libtorchbeast-0.0.20-py3.9-linux-x86_64.egg/libtorchbeast/_C.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZN3c106detail14torchCheckFailEPKcS2_jRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
```
Any ideas? I am using Python 3.9 on Manjaro Linux. Note that I do not have an Nvidia GPU and intend to run inference on the CPU.https://gitlab.aicrowd.com/nethack/neurips-2021-the-nethack-challenge/-/issues/15the median score from evaluation server seems to be much lower than local test2021-09-24T10:13:47ZGhost Userthe median score from evaluation server seems to be much lower than local test@eric_hammy I trained torchbeast_agent from scratch for 1B, and tested the model using test_submission.py. I did 5 test runs, where each run evaluated 512 episodes, and the resulting median score was **400 +- 25.** Then I submitted exact...@eric_hammy I trained torchbeast_agent from scratch for 1B, and tested the model using test_submission.py. I did 5 test runs, where each run evaluated 512 episodes, and the resulting median score was **400 +- 25.** Then I submitted exactly the same model and code to the evaluation server, but the computed median score was only **322**, which is much lower than my local test result (400 +- 25).
What could make the difference between local and remote test results? The submission ID is `149676`, and config.yaml is as below.
```
name: null
wandb: false
project: nethack_challenge
entity: user1
group: group1
mock: false
single_ttyrec: true
num_seeds: 0
write_profiler_trace: false
fn_penalty_step: constant
penalty_time: 0.0
penalty_step: -0.01
reward_lose: 0
reward_win: 100
state_counter: none
character: '@'
mode: train
env: challenge
num_actors: 256
total_steps: 1000000000.0
batch_size: 32
unroll_length: 80
num_learner_threads: 1
num_inference_threads: 1
disable_cuda: false
learner_device: cuda:1
actor_device: cuda:0
max_learner_queue_size: null
learning_rate: 0.0002
grad_norm_clipping: 40
alpha: 0.99
momentum: 0
epsilon: 1.0e-06
entropy_cost: 0.001
baseline_cost: 0.5
discounting: 0.999
normalize_reward: true
model: baseline
use_lstm: true
hidden_dim: 256
embedding_dim: 64
layers: 5
crop_dim: 9
use_index_select: true
restrict_action_space: true
msg:
hidden_dim: 64
embedding_dim: 32
load_dir: null
savedir: ../../NetHackChallenge-v0-random-char
```https://gitlab.aicrowd.com/nethack/neurips-2021-the-nethack-challenge/-/issues/14loading pre-trained checkpoint failed2021-09-24T10:12:17ZGhost Userloading pre-trained checkpoint failed@eric_hammy I set `AGENT = TorchBeastAgent` in submission_config.py, and `MODEL_DIR = "./saved_models/torchbeast/pretrained_0.5B"` in agents/torchbeast_agent.py.
Then I ran `$ python test_submission.py ` but it gives me the following er...@eric_hammy I set `AGENT = TorchBeastAgent` in submission_config.py, and `MODEL_DIR = "./saved_models/torchbeast/pretrained_0.5B"` in agents/torchbeast_agent.py.
Then I ran `$ python test_submission.py ` but it gives me the following error message. (commit id:
3f9ef7f14a)
```
Traceback (most recent call last):
File "test_submission.py", line 36, in <module>
evaluate()
File "test_submission.py", line 25, in evaluate
agent = Agent(num_envs, batched_env.num_actions)
File "/data/private/research/AgentLearning/nethack_challenge/agents/torchbeast_agent.py", line 26, in __init__
self.model = load_model(MODEL_DIR, self.device)
File "/data/private/research/AgentLearning/nethack_challenge/nethack_baselines/torchbeast/models/__init__.py", line 54, in load_model
checkpoint_states = torch.load(flags.checkpoint, map_location=device)
File "/root/anaconda3/envs/nle_challenge/lib/python3.8/site-packages/torch/serialization.py", line 607, in load
return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
File "/root/anaconda3/envs/nle_challenge/lib/python3.8/site-packages/torch/serialization.py", line 882, in _load
result = unpickler.load()
File "/root/anaconda3/envs/nle_challenge/lib/python3.8/site-packages/omegaconf/basecontainer.py", line 108, in __setstate__
key_type = d["_metadata"].key_type
KeyError: '_metadata'
```https://gitlab.aicrowd.com/nethack/neurips-2021-the-nethack-challenge/-/issues/13baseline doesn't support hydra 1.12021-07-05T23:53:37Zmax_reederbaseline doesn't support hydra 1.1Lots of errors here, if you follow the instructions inside [nethack_baselines/torchbeast/polyhydra.py](https://gitlab.aicrowd.com/nethack/neurips-2021-the-nethack-challenge/-/blob/master/nethack_baselines/torchbeast/polyhydra.py#L17), th...Lots of errors here, if you follow the instructions inside [nethack_baselines/torchbeast/polyhydra.py](https://gitlab.aicrowd.com/nethack/neurips-2021-the-nethack-challenge/-/blob/master/nethack_baselines/torchbeast/polyhydra.py#L17), that says to install hydra with --upgrade, but that installs **hydra-core==1.1** and now i am sad. Please consolidate all your instructions in one place, maybe with two conda(or requirements) files for vanilla vs. baseline for those customers, so i am not chasing my tail.
```
Installation for hydra:
pip install hydra-core hydra_colorlog --upgrade
```
then i get some deal about overrides, so i added that to config.yaml
```
defaults:
- override hydra/job_logging: colorlog
- override hydra/hydra_logging: colorlog
```
Trying again i get the below error. Note that `Be aware that cfg.pretty() is now deprecated and you should use OmegaConf.to_yaml(cfg) instead.` via https://github.com/facebookresearch/hydra/blob/2808e71248ac5a04a1b1a770d3a60f8ec9a38569/NEWS.md#L220
```
(nethack) 88665a14b754:torchbeast maxreede$ python polyhydra.py
[DEBUG:79067 cmd:814 2021-07-05 16:16:49,799] Popen(['git', 'version'], cwd=/Volumes/workplace/ml/aicrowd/neurips-2021-the-nethack-challenge/nethack_baselines/torchbeast, universal_newlines=False, shell=None, istream=None)
[DEBUG:79067 cmd:814 2021-07-05 16:16:49,836] Popen(['git', 'version'], cwd=/Volumes/workplace/ml/aicrowd/neurips-2021-the-nethack-challenge/nethack_baselines/torchbeast, universal_newlines=False, shell=None, istream=None)
polyhydra.py:108: UserWarning:
config_path is not specified in @hydra.main().
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/changes_to_hydra_main_config_path for more information.
@hydra.main(config_name="config")
[DEBUG:79067 utils:252 2021-07-05 16:16:50,090] Setting JobRuntime:name=UNKNOWN_NAME
[DEBUG:79067 utils:252 2021-07-05 16:16:50,091] Setting JobRuntime:name=polyhydra
Error executing job with overrides: []
Traceback (most recent call last):
File "polyhydra.py", line 123, in main
logging.info(flags.pretty(resolve=True))
omegaconf.errors.ConfigAttributeError: Key 'pretty' is not in struct
full_key: pretty
object_type=dict
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
(nethack) 88665a14b754:torchbeast maxreede$
```https://gitlab.aicrowd.com/nethack/neurips-2021-the-nethack-challenge/-/issues/12Crash Missing key max_learner_queue_size2021-06-13T14:42:51Zchristophe_cerisaraCrash Missing key max_learner_queue_sizeRun
```
python polyhydra.py actor_device=cpu
```
and the baseline crashes immediately with error:
```
Traceback (most recent call last):
Server listening on unix:/tmp/poly.outputs.2021-06-13.16-27-24.56
File "/gpfsdswork/projects/re...Run
```
python polyhydra.py actor_device=cpu
```
and the baseline crashes immediately with error:
```
Traceback (most recent call last):
Server listening on unix:/tmp/poly.outputs.2021-06-13.16-27-24.56
File "/gpfsdswork/projects/rech/knb/uyr14tk/home/xtofNLE/neurips-2021-the-nethack-challenge/nethack_baselines/torchbeast/polyhydra.py", line 141, in main
run_learner(lrn_flags)
File "/gpfsdswork/projects/rech/knb/uyr14tk/home/xtofNLE/neurips-2021-the-nethack-challenge/nethack_baselines/torchbeast/polyhydra.py", line 77, in run_learner
polybeast_learner.main(flags)
File "/gpfsdswork/projects/rech/knb/uyr14tk/home/xtofNLE/neurips-2021-the-nethack-challenge/nethack_baselines/torchbeast/polybeast_learner.py", line 515, in main
train(flags)
File "/gpfsdswork/projects/rech/knb/uyr14tk/home/xtofNLE/neurips-2021-the-nethack-challenge/nethack_baselines/torchbeast/polybeast_learner.py", line 254, in train
if flags.max_learner_queue_size is None:
omegaconf.errors.ConfigAttributeError: Missing key max_learner_queue_size
full_key: max_learner_queue_size
object_type=dict
```https://gitlab.aicrowd.com/nethack/neurips-2021-the-nethack-challenge/-/issues/11crash with hydra colorlog error2021-07-18T10:07:02Zchristophe_cerisaracrash with hydra colorlog errorI installed first torchbeast as stated (on a RedHat linux with 1 high-end GPU), then pip install in the same conda environment the requirements from this repo, and finally run
```
HYDRA_FULL_ERROR=1 python polyhydra.py actor_device=cpu
...I installed first torchbeast as stated (on a RedHat linux with 1 high-end GPU), then pip install in the same conda environment the requirements from this repo, and finally run
```
HYDRA_FULL_ERROR=1 python polyhydra.py actor_device=cpu
```
it crashes with error:
```
[DEBUG:49231 cmd:817 2021-06-13 15:48:15,820] Popen(['git', 'version'], cwd=/gpfsdswork/projects/rech/knb/uyr14tk/home/xtofNLE/neurips-2021-the-nethack-challenge/nethack_baselines/torchbeast, universal_newlines=False, shell=None, istream=None)
/gpfsdswork/projects/rech/knb/uyr14tk/home/xtofNLE/neurips-2021-the-nethack-challenge/nethack_baselines/torchbeast/polyhydra.py:108: UserWarning:
config_path is not specified in @hydra.main().
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/changes_to_hydra_main_config_path for more information.
@hydra.main(config_name="config")
[DEBUG:49231 utils:252 2021-06-13 15:48:16,605] Setting JobRuntime:name=UNKNOWN_NAME
[DEBUG:49231 utils:252 2021-06-13 15:48:16,606] Setting JobRuntime:name=polyhydra
/gpfswork/rech/knb/uyr14tk/home/.conda/envs/torchbeast/lib/python3.9/site-packages/hydra/_internal/defaults_list.py:389: UserWarning: In config: Invalid overriding of hydra/job_logging:
Default list overrides requires 'override' keyword.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/defaults_list_override for more information.
warnings.warn(msg, UserWarning)
Traceback (most recent call last):
File "/gpfsdswork/projects/rech/knb/uyr14tk/home/xtofNLE/neurips-2021-the-nethack-challenge/nethack_baselines/torchbeast/polyhydra.py", line 149, in <module>
main()
File "/gpfswork/rech/knb/uyr14tk/home/.conda/envs/torchbeast/lib/python3.9/site-packages/hydra/main.py", line 49, in decorated_main
_run_hydra(
File "/gpfswork/rech/knb/uyr14tk/home/.conda/envs/torchbeast/lib/python3.9/site-packages/hydra/_internal/utils.py", line 367, in _run_hydra
run_and_report(
File "/gpfswork/rech/knb/uyr14tk/home/.conda/envs/torchbeast/lib/python3.9/site-packages/hydra/_internal/utils.py", line 214, in run_and_report
raise ex
File "/gpfswork/rech/knb/uyr14tk/home/.conda/envs/torchbeast/lib/python3.9/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
return func()
File "/gpfswork/rech/knb/uyr14tk/home/.conda/envs/torchbeast/lib/python3.9/site-packages/hydra/_internal/utils.py", line 368, in <lambda>
lambda: hydra.run(
File "/gpfswork/rech/knb/uyr14tk/home/.conda/envs/torchbeast/lib/python3.9/site-packages/hydra/_internal/hydra.py", line 87, in run
cfg = self.compose_config(
File "/gpfswork/rech/knb/uyr14tk/home/.conda/envs/torchbeast/lib/python3.9/site-packages/hydra/_internal/hydra.py", line 564, in compose_config
cfg = self.config_loader.load_configuration(
File "/gpfswork/rech/knb/uyr14tk/home/.conda/envs/torchbeast/lib/python3.9/site-packages/hydra/_internal/config_loader_impl.py", line 146, in load_configuration
return self._load_configuration_impl(
File "/gpfswork/rech/knb/uyr14tk/home/.conda/envs/torchbeast/lib/python3.9/site-packages/hydra/_internal/config_loader_impl.py", line 239, in _load_configuration_impl
defaults_list = create_defaults_list(
File "/gpfswork/rech/knb/uyr14tk/home/.conda/envs/torchbeast/lib/python3.9/site-packages/hydra/_internal/defaults_list.py", line 719, in create_defaults_list
defaults, tree = _create_defaults_list(
File "/gpfswork/rech/knb/uyr14tk/home/.conda/envs/torchbeast/lib/python3.9/site-packages/hydra/_internal/defaults_list.py", line 689, in _create_defaults_list
defaults_tree = _create_defaults_tree(
File "/gpfswork/rech/knb/uyr14tk/home/.conda/envs/torchbeast/lib/python3.9/site-packages/hydra/_internal/defaults_list.py", line 337, in _create_defaults_tree
ret = _create_defaults_tree_impl(
File "/gpfswork/rech/knb/uyr14tk/home/.conda/envs/torchbeast/lib/python3.9/site-packages/hydra/_internal/defaults_list.py", line 420, in _create_defaults_tree_impl
return _expand_virtual_root(repo, root, overrides, skip_missing)
File "/gpfswork/rech/knb/uyr14tk/home/.conda/envs/torchbeast/lib/python3.9/site-packages/hydra/_internal/defaults_list.py", line 262, in _expand_virtual_root
subtree = _create_defaults_tree_impl(
File "/gpfswork/rech/knb/uyr14tk/home/.conda/envs/torchbeast/lib/python3.9/site-packages/hydra/_internal/defaults_list.py", line 476, in _create_defaults_tree_impl
_update_overrides(defaults_list, overrides, parent, interpolated_subtree)
File "/gpfswork/rech/knb/uyr14tk/home/.conda/envs/torchbeast/lib/python3.9/site-packages/hydra/_internal/defaults_list.py", line 367, in _update_overrides
raise ConfigCompositionException(
hydra.errors.ConfigCompositionException: In config: Override 'hydra/job_logging : colorlog' is defined before 'hydra/hydra_logging: colorlog'.
Overrides must be at the end of the defaults list
```