Page History

...

Q-Learning

Policy Optimization	Q-Learning
optimize the parameters either directly by gradient ascent on the performance objective or indirectly, by maximizing local approximations	learn an approximator for the optimal action-value function
performed on-policy, each update only uses data collected while acting according to the most recent version of the policy	performed off-policy, each update can use data collected at any point during training
directly optimize for the thing you want	indirectly optimize for agent performance
More stable	tends to be less stable
advantage of being substantially more sample efficient when they do work, because they can reuse data more effectively	Less sample efficient and takes longer to learn as learning data is limited at every iteration.

Value-based methods
- (Q-learning, Deep Q-learning): where we learn a value function that will map each state action pair to a value.
- find the best action to take for each state — the action with the biggest value.
- works well when you have a finite set of actions.
Policy-based methods
- REINFORCE with Policy Gradients
- we directly optimize the policy without using a value function.
- when the action space is continuous or stochastic.
- use total rewards of the episode
- problem is finding a good score function to compute how good a policy is
Hybrid Method
- Actor-Critic Method
  - Policy Learning + Value Learning
  - Policy Function → Actor: Choses to make moves
  - Value Function → Critic: Decides how the agent is performing
  - we make an update at each step (TD Learning)
  - Because we do an update at each time step, we can’t use the total rewards R(t).
  - Both learn in parallel, like GANs
  - Not Stable but several variations which are stable

Algorithms

...