the median score from evaluation server seems to be much lower than local test
@eric_hammy I trained torchbeast_agent from scratch for 1B, and tested the model using test_submission.py. I did 5 test runs, where each run evaluated 512 episodes, and the resulting median score was 400 +- 25. Then I submitted exactly the same model and code to the evaluation server, but the computed median score was only 322, which is much lower than my local test result (400 +- 25).
What could make the difference between local and remote test results? The submission ID is
149676, and config.yaml is as below.
name: null wandb: false project: nethack_challenge entity: user1 group: group1 mock: false single_ttyrec: true num_seeds: 0 write_profiler_trace: false fn_penalty_step: constant penalty_time: 0.0 penalty_step: -0.01 reward_lose: 0 reward_win: 100 state_counter: none character: '@' mode: train env: challenge num_actors: 256 total_steps: 1000000000.0 batch_size: 32 unroll_length: 80 num_learner_threads: 1 num_inference_threads: 1 disable_cuda: false learner_device: cuda:1 actor_device: cuda:0 max_learner_queue_size: null learning_rate: 0.0002 grad_norm_clipping: 40 alpha: 0.99 momentum: 0 epsilon: 1.0e-06 entropy_cost: 0.001 baseline_cost: 0.5 discounting: 0.999 normalize_reward: true model: baseline use_lstm: true hidden_dim: 256 embedding_dim: 64 layers: 5 crop_dim: 9 use_index_select: true restrict_action_space: true msg: hidden_dim: 64 embedding_dim: 32 load_dir: null savedir: ../../NetHackChallenge-v0-random-char