Batch Constrained Q-learning (BCQ)

📖

용어

Offline reinforcement learning algorithm that constrains policies to remain close to actions observed in the training dataset to avoid extrapolation errors. BCQ uses an action generator model to produce actions similar to those in the batch while exploring slight variations.

📖

용어

Distribution Shift

Phenomenon where the distribution of state-actions visited by the learned policy significantly differs from the distribution of the offline dataset. This shift can lead to biased value estimates and degraded performance during deployment.

📖

용어

Offline Reinforcement Learning

Learning paradigm where the agent learns exclusively from a fixed set of previously collected data, without interaction with the environment. This approach is essential when real-time exploration is costly or dangerous.

📖

용어

Behavior Cloning

Supervised learning technique that directly imitates expert actions from demonstration data without using reward signals. Although simple, this approach can suffer from cascading error accumulation during deployment.

📖

용어

Implicit Q-learning

Method that learns the Q function implicitly by avoiding direct evaluation of out-of-distribution actions. IQL formulates learning as an expectile learning problem to better handle uncertainty in offline data.

📖

용어

Out-of-Distribution Actions

Actions generated by the learned policy that were not or rarely observed in the training dataset. These actions pose a major risk in offline RL because their values cannot be reliably estimated.

📖

용어

Policy Constraint

Mechanism that limits the learned policy to produce actions similar to those present in the offline data batch. This constraint can be implemented via penalties, divergences, or conditional generative models.

📖

용어

Perturbation Model

Component of BCQ that generates variations around behavior actions to locally explore the action space. This model adds controlled noise to observed actions while ensuring their feasibility.

📖

용어

Value Function Estimation

Process of estimating Q-values from offline data while accounting for potential bias due to lack of exploration. Modern methods use conservative underestimation techniques to avoid over-optimization.

📖

용어

Batch RL

Reinforcement learning framework where the agent has a fixed batch of transitions and must learn an optimal policy without additional interactions. This context imposes specific constraints on algorithms to prevent divergence.

📖

용어

Safety Constraint

Restriction imposed on offline policies to ensure that generated actions remain in safe regions of the state-action space. These constraints are crucial in applications such as robotics or medicine.

📖

용어

Action Repetition

Strategy used in offline RL to improve stability by repeating actions similar to those observed in the data. This technique reduces the risk of generating completely new and potentially dangerous actions.

📖

용어

Uncertainty Estimation

Quantification of uncertainty associated with value estimates of actions not observed in the batch. Accurate uncertainty estimation allows penalizing out-of-distribution actions and improves robustness.

📖

용어

Model-Based RL

Approach that learns a model of the environment dynamics from offline data to generate synthetic experiences. In an offline context, this model must be used cautiously to avoid error propagation.

📖

용어

Policy Evaluation

Phase of evaluating policy performance using only offline data without interaction with the environment. This step is crucial for validating learning before deployment.

📖

용어

Policy Improvement

Process of iteratively improving the policy using value estimates calculated from the offline data batch. The improvement must respect distribution constraints to maintain validity.

📖

용어

Bootstrapping Error

Error accumulated when a policy uses its own value estimates to improve itself, leading to divergence from the data support. Offline methods use specific techniques to control this bias.

AI 용어집