Proximal Policy Optimization (PPO)
Adaptive KL Penalty
PPO variant that dynamically adjusts the KL penalty strength based on the observed divergence between policies, ensuring controlled updates.
← Back