Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Policy Optimization
  2. Q-Learning

    Policy Optimization

    Q-Learning

    optimize the parameters either directly by gradient ascent on the performance objective or indirectly, by maximizing local approximations

    learn an approximator for the optimal action-value function

    performed on-policy, each update only uses data collected while acting according to the most recent version of the policy

    performed off-policy, each update can use data collected at any point during training

    directly optimize for the thing you want

    indirectly optimize for agent performance

    More stable

    tends to be less stable

    advantage of being substantially more sample efficient when they do work, because they can reuse data more effectively 

    Less sample efficient and takes longer to learn as learning data is limited at every iteration.



  • Value-based methods
    • (Q-learning, Deep Q-learning): where we learn a value function that will map each state action pair to a value.
    • find the best action to take for each state — the action with the biggest value.
    • works well when you have a finite set of actions.
  • Policy-based methods
    • REINFORCE with Policy Gradients
    • we directly optimize the policy without using a value function.
    • when the action space is continuous or stochastic.
    • use total rewards of the episode
    • problem is finding a good score function to compute how good a policy is
  • Hybrid Method
    • Actor-Critic Method
      • Policy Learning + Value Learning
      • Policy Function → Actor: Choses to make moves
      • Value Function → Critic: Decides how the agent is performing
      • we make an update at each step (TD Learning)
      • Because we do an update at each time step, we can’t use the total rewards R(t).
      • Both learn in parallel, like GANs
      • Not Stable but several variations which are stable

Algorithms

NameComments on ApplicabilityReference

Q Learning









...