Skip to content
  • John Aslanides's avatar
    0bba18c5
    Add several improvements to actor-critic agents. · 0bba18c5
    John Aslanides authored
    - Use a simple shared sequence buffer for building trajectories to compute TD(lambda) on. This should result in more readable agent code.
    - Use dynamic rather than fixed/static unroll in actor_critic_rnn, allowing us to learn from sequences of unknown length.
    
    This second point introduces a slight change in the learning algorithm for actor_critic_rnn:
    - Previously: concatenate transitions until we have a sequence of length `sequence_length`, possibly spanning episodes, and use a mask to reset the RNN state.
    - Now: Allow sequences to have dynamic length, and truncate at the episode boundary.
    
    For episodes that are shorter than `max_sequence_length` this results in a smaller effective batch size, resulting in noisier gradients; I have reduced the learning rate to compensate for this.
    
    This change also includes some minor maintenance/gardening:
    - Modernise baselines/utils (remove Python 2 support).
    - Import dm_env.specs directly in all agents.
    
    PiperOrigin-RevId: 304057738
    Change-Id: If559ab6467ecd1a4094d1c1eceb1d969aaf413b2
    0bba18c5
    Add several improvements to actor-critic agents.
    John Aslanides authored
    - Use a simple shared sequence buffer for building trajectories to compute TD(lambda) on. This should result in more readable agent code.
    - Use dynamic rather than fixed/static unroll in actor_critic_rnn, allowing us to learn from sequences of unknown length.
    
    This second point introduces a slight change in the learning algorithm for actor_critic_rnn:
    - Previously: concatenate transitions until we have a sequence of length `sequence_length`, possibly spanning episodes, and use a mask to reset the RNN state.
    - Now: Allow sequences to have dynamic length, and truncate at the episode boundary.
    
    For episodes that are shorter than `max_sequence_length` this results in a smaller effective batch size, resulting in noisier gradients; I have reduced the learning rate to compensate for this.
    
    This change also includes some minor maintenance/gardening:
    - Modernise baselines/utils (remove Python 2 support).
    - Import dm_env.specs directly in all agents.
    
    PiperOrigin-RevId: 304057738
    Change-Id: If559ab6467ecd1a4094d1c1eceb1d969aaf413b2
Loading