KI-Glossar
Das vollständige Wörterbuch der Künstlichen Intelligenz
Clipping Function
PPO mechanism that limits the magnitude of policy updates by clipping the probability ratio between the new and old policy to avoid overly drastic changes.
Trust Region
Confidence region in policy space where updates are considered safe, defined by a constraint on KL divergence between successive policies.
Surrogate Objective
Modified objective function used in PPO that approximates the original objective while incorporating stability constraints like clipping to prevent performance degradation.
KL Divergence Penalty
Penalty added to PPO's objective function to control divergence between successive policies, adaptively adjusted to maintain updates within an acceptable region.
Mini-batch Updates
PPO optimization process where collected data is divided into small batches to perform multiple gradient passes, improving computational efficiency and stability.
Clip Range Parameter
Epsilon hyperparameter in PPO that defines the width of the clipping zone for the probability ratio, directly controlling the conservatism of policy updates.
Value Function Clipping
PPO variant that also applies clipping to the value function to stabilize learning and prevent large variations in value estimates.
Epoch Optimization
PPO process where the same experience data is reused for multiple optimization passes, improving the utilization of collected data.
Normalized Advantage
Technique for normalizing advantage estimates to stabilize training by maintaining a consistent gradient scale between updates.
Experience Collection
PPO phase where the agent interacts with the environment following the current policy to collect transitions (state, action, reward) used for optimization.
Adaptive KL Penalty
PPO variant that dynamically adjusts the KL penalty strength based on the observed divergence between policies, ensuring controlled updates.