Conservative Q-Learning (CQL)

📖

용어

Offline reinforcement learning method that actively penalizes overestimated Q-values to keep the policy close to the behavioral data distribution and prevent divergence.

📖

용어

Offline data distribution

Fixed and predefined dataset collected from a behavioral policy, serving as the sole source of information for offline RL training.

📖

용어

Conservative penalty

Regularization term added to the loss function to penalize high Q-values for state-action pairs absent from training data, thus preventing overestimation.

📖

용어

Q-value overestimation

Inherent problem in offline RL where Q-values are artificially inflated for unobserved actions, leading to suboptimal and unstable policies.

📖

용어

Conservative policy

Action strategy that intentionally stays close to behaviors observed in the dataset, minimizing the risk of divergence due to extrapolation on unseen data.

📖

용어

Distribution correction

Mechanism in CQL that adjusts Q-estimations to correct the mismatch between the behavioral distribution and the target policy distribution.

📖

용어

Policy gap

Measure of divergence between the learned policy and the behavioral policy, crucial for ensuring stability in offline reinforcement learning.

📖

용어

CQL loss function

Objective function combining standard Q-Learning loss with a conservative term that minimizes Q-values for out-of-distribution actions, forming log(Σexp(Q(s,a))) - Q(s,a').

📖

용어

Importance Sampling Ratio

Coefficient weighting transitions according to their probability of occurrence under the target policy relative to the behavioral policy, essential for correcting bias.

📖

용어

Distributional Shift

Fundamental difference between the distribution of available data and that required to accurately evaluate the learned policy, main challenge of offline RL.

📖

용어

Learning Stabilization

Objective of CQL aiming to guarantee algorithm convergence by avoiding oscillations and divergences caused by extrapolation on limited data.

📖

용어

Conservative Safeguard

Safety mechanism built into CQL limiting Q-value optimization for state-action pairs that are infrequent or absent from the training dataset.

📖

용어

Conservative Q-update

Iterative process modifying Q-values by penalizing overestimations while preserving reliable estimates based on observed data.

📖

용어

Extrapolation Error

Inaccuracy introduced when a model makes predictions for states or actions not represented in the training dataset, major problem in offline RL.

📖

용어

Conservative Critic

CQL component evaluating actions with a conservative bias, assigning lower scores to actions potentially overestimated due to lack of data.

📖

용어

Constrained Action Space

Subset of possible actions limited to those observed in the dataset, reducing the risk of policies exploiting extrapolation artifacts.

📖

용어

Behavior Sampling

Process of collecting transitions (state, action, reward, next state) according to a fixed behavioral policy, constituting the offline dataset.

📖

용어

Policy Divergence

Phenomenon where the learned policy dangerously deviates from the data distribution, leading to degraded performance or total learning collapse.

AI 용어집