YZ Sözlüğü
Yapay Zekanın tam sözlüğü
Conservative Q-Learning (CQL)
Offline reinforcement learning method that actively penalizes overestimated Q-values to keep the policy close to the behavioral data distribution and prevent divergence.
Offline data distribution
Fixed and predefined dataset collected from a behavioral policy, serving as the sole source of information for offline RL training.
Conservative penalty
Regularization term added to the loss function to penalize high Q-values for state-action pairs absent from training data, thus preventing overestimation.
Q-value overestimation
Inherent problem in offline RL where Q-values are artificially inflated for unobserved actions, leading to suboptimal and unstable policies.
Conservative policy
Action strategy that intentionally stays close to behaviors observed in the dataset, minimizing the risk of divergence due to extrapolation on unseen data.
Distribution correction
Mechanism in CQL that adjusts Q-estimations to correct the mismatch between the behavioral distribution and the target policy distribution.
Policy gap
Measure of divergence between the learned policy and the behavioral policy, crucial for ensuring stability in offline reinforcement learning.
CQL loss function
Objective function combining standard Q-Learning loss with a conservative term that minimizes Q-values for out-of-distribution actions, forming log(Σexp(Q(s,a))) - Q(s,a').
Importance Sampling Ratio
Coefficient weighting transitions according to their probability of occurrence under the target policy relative to the behavioral policy, essential for correcting bias.
Distributional Shift
Fundamental difference between the distribution of available data and that required to accurately evaluate the learned policy, main challenge of offline RL.
Learning Stabilization
Objective of CQL aiming to guarantee algorithm convergence by avoiding oscillations and divergences caused by extrapolation on limited data.
Conservative Safeguard
Safety mechanism built into CQL limiting Q-value optimization for state-action pairs that are infrequent or absent from the training dataset.
Conservative Q-update
Iterative process modifying Q-values by penalizing overestimations while preserving reliable estimates based on observed data.
Extrapolation Error
Inaccuracy introduced when a model makes predictions for states or actions not represented in the training dataset, major problem in offline RL.
Conservative Critic
CQL component evaluating actions with a conservative bias, assigning lower scores to actions potentially overestimated due to lack of data.
Constrained Action Space
Subset of possible actions limited to those observed in the dataset, reducing the risk of policies exploiting extrapolation artifacts.
Behavior Sampling
Process of collecting transitions (state, action, reward, next state) according to a fixed behavioral policy, constituting the offline dataset.
Policy Divergence
Phenomenon where the learned policy dangerously deviates from the data distribution, leading to degraded performance or total learning collapse.