the median score from evaluation server seems to be much lower than local test

@eric_hammy I trained torchbeast_agent from scratch for 1B, and tested the model using test_submission.py. I did 5 test runs, where each run evaluated 512 episodes, and the resulting median score was 400 +- 25. Then I submitted exactly the same model and code to the evaluation server, but the computed median score was only 322, which is much lower than my local test result (400 +- 25). What could make the difference between local and remote test results? The submission ID is 149676, and config.yaml is as below.

name: null
wandb: false
project: nethack_challenge
entity: user1
group: group1
mock: false
single_ttyrec: true
num_seeds: 0
write_profiler_trace: false
fn_penalty_step: constant
penalty_time: 0.0
penalty_step: -0.01
reward_lose: 0
reward_win: 100
state_counter: none
character: '@'
mode: train
env: challenge
num_actors: 256
total_steps: 1000000000.0
batch_size: 32
unroll_length: 80
num_learner_threads: 1
num_inference_threads: 1
disable_cuda: false
learner_device: cuda:1
actor_device: cuda:0
max_learner_queue_size: null
learning_rate: 0.0002
grad_norm_clipping: 40
alpha: 0.99
momentum: 0
epsilon: 1.0e-06
entropy_cost: 0.001
baseline_cost: 0.5
discounting: 0.999
normalize_reward: true
model: baseline
use_lstm: true
hidden_dim: 256
embedding_dim: 64
layers: 5
crop_dim: 9
use_index_select: true
restrict_action_space: true
msg:
  hidden_dim: 64
  embedding_dim: 32
load_dir: null
savedir: ../../NetHackChallenge-v0-random-char

Edited Jul 06, 2021 by Ghost User