-
John Aslanides authored
- Use a simple shared sequence buffer for building trajectories to compute TD(lambda) on. This should result in more readable agent code. - Use dynamic rather than fixed/static unroll in actor_critic_rnn, allowing us to learn from sequences of unknown length. This second point introduces a slight change in the learning algorithm for actor_critic_rnn: - Previously: concatenate transitions until we have a sequence of length `sequence_length`, possibly spanning episodes, and use a mask to reset the RNN state. - Now: Allow sequences to have dynamic length, and truncate at the episode boundary. For episodes that are shorter than `max_sequence_length` this results in a smaller effective batch size, resulting in noisier gradients; I have reduced the learning rate to compensate for this. This change also includes some minor maintenance/gardening: - Modernise baselines/utils (remove Python 2 support). - Import dm_env.specs directly in all agents. PiperOrigin-RevId: 304057738 Change-Id: If559ab6467ecd1a4094d1c1eceb1d969aaf413b2
0bba18c5John Aslanides authored- Use a simple shared sequence buffer for building trajectories to compute TD(lambda) on. This should result in more readable agent code. - Use dynamic rather than fixed/static unroll in actor_critic_rnn, allowing us to learn from sequences of unknown length. This second point introduces a slight change in the learning algorithm for actor_critic_rnn: - Previously: concatenate transitions until we have a sequence of length `sequence_length`, possibly spanning episodes, and use a mask to reset the RNN state. - Now: Allow sequences to have dynamic length, and truncate at the episode boundary. For episodes that are shorter than `max_sequence_length` this results in a smaller effective batch size, resulting in noisier gradients; I have reduced the learning rate to compensate for this. This change also includes some minor maintenance/gardening: - Modernise baselines/utils (remove Python 2 support). - Import dm_env.specs directly in all agents. PiperOrigin-RevId: 304057738 Change-Id: If559ab6467ecd1a4094d1c1eceb1d969aaf413b2
Loading