bsuite/baselines/utils/replay_test.py · master · AIcrowd / research / bsuite

Mar 31, 2020

Add several improvements to actor-critic agents. · 0bba18c5

John Aslanides authored Mar 31, 2020

- Use a simple shared sequence buffer for building trajectories to compute TD(lambda) on. This should result in more readable agent code.
- Use dynamic rather than fixed/static unroll in actor_critic_rnn, allowing us to learn from sequences of unknown length.

This second point introduces a slight change in the learning algorithm for actor_critic_rnn:
- Previously: concatenate transitions until we have a sequence of length `sequence_length`, possibly spanning episodes, and use a mask to reset the RNN state.
- Now: Allow sequences to have dynamic length, and truncate at the episode boundary.

For episodes that are shorter than `max_sequence_length` this results in a smaller effective batch size, resulting in noisier gradients; I have reduced the learning rate to compensate for this.

This change also includes some minor maintenance/gardening:
- Modernise baselines/utils (remove Python 2 support).
- Import dm_env.specs directly in all agents.

PiperOrigin-RevId: 304057738
Change-Id: If559ab6467ecd1a4094d1c1eceb1d969aaf413b2

0bba18c5

Add several improvements to actor-critic agents.

John Aslanides authored Mar 31, 2020

- Use a simple shared sequence buffer for building trajectories to compute TD(lambda) on. This should result in more readable agent code.
- Use dynamic rather than fixed/static unroll in actor_critic_rnn, allowing us to learn from sequences of unknown length.

This second point introduces a slight change in the learning algorithm for actor_critic_rnn:
- Previously: concatenate transitions until we have a sequence of length `sequence_length`, possibly spanning episodes, and use a mask to reset the RNN state.
- Now: Allow sequences to have dynamic length, and truncate at the episode boundary.

For episodes that are shorter than `max_sequence_length` this results in a smaller effective batch size, resulting in noisier gradients; I have reduced the learning rate to compensate for this.

This change also includes some minor maintenance/gardening:
- Modernise baselines/utils (remove Python 2 support).
- Import dm_env.specs directly in all agents.

PiperOrigin-RevId: 304057738
Change-Id: If559ab6467ecd1a4094d1c1eceb1d969aaf413b2