Group by scores in the evaluation service by the test groups
At the moment, the evaluation service simply takes a mean of rewards across all the episodes. We want to compute the final scores by grouping all the envs by the test groups. So the final scores will be mean of mean of the said score in a single test group.