Done __all__ rewards wrong
If the environment terminates when max time steps are reached and we keep on calling env.step() then the global reward is returned to alla gents as if they finished.
This behavior is not intended. We should remove the bevhavior and update it. A refactoring on how we terminate the environment when max time steps is reached would be best.