Batch Constrained Q-learning (BCQ)

📖

terms

Batch Constrained Q-learning (BCQ)

Offline reinforcement learning algorithm that constrains policies to remain close to actions observed in the training dataset to avoid extrapolation errors. BCQ uses an action generator model to produce actions similar to those in the batch while exploring slight variations.

📖

terms

Distribution Shift

Phenomenon where the distribution of state-actions visited by the learned policy significantly differs from the distribution of the offline dataset. This shift can lead to biased value estimates and degraded performance during deployment.

📖

terms

Offline Reinforcement Learning

Learning paradigm where the agent learns exclusively from a fixed set of previously collected data, without interaction with the environment. This approach is essential when real-time exploration is costly or dangerous.

📖

terms

Behavior Cloning

Supervised learning technique that directly imitates expert actions from demonstration data without using reward signals. Although simple, this approach can suffer from cascading error accumulation during deployment.

📖

terms

Implicit Q-learning

Method that learns the Q function implicitly by avoiding direct evaluation of out-of-distribution actions. IQL formulates learning as an expectile learning problem to better handle uncertainty in offline data.

📖

terms

Out-of-Distribution Actions

Actions generated by the learned policy that were not or rarely observed in the training dataset. These actions pose a major risk in offline RL because their values cannot be reliably estimated.

📖

terms

Policy Constraint

Mechanism that limits the learned policy to produce actions similar to those present in the offline data batch. This constraint can be implemented via penalties, divergences, or conditional generative models.

📖

terms

Perturbation Model

Component of BCQ that generates variations around behavior actions to locally explore the action space. This model adds controlled noise to observed actions while ensuring their feasibility.

📖

terms

Value Function Estimation

Process of estimating Q-values from offline data while accounting for potential bias due to lack of exploration. Modern methods use conservative underestimation techniques to avoid over-optimization.

📖

terms

Batch RL

Reinforcement learning framework where the agent has a fixed batch of transitions and must learn an optimal policy without additional interactions. This context imposes specific constraints on algorithms to prevent divergence.

📖

terms

Safety Constraint

Restriction imposed on offline policies to ensure that generated actions remain in safe regions of the state-action space. These constraints are crucial in applications such as robotics or medicine.

📖

terms

Action Repetition

Strategy used in offline RL to improve stability by repeating actions similar to those observed in the data. This technique reduces the risk of generating completely new and potentially dangerous actions.

📖

terms

Uncertainty Estimation

Quantification of uncertainty associated with value estimates of actions not observed in the batch. Accurate uncertainty estimation allows penalizing out-of-distribution actions and improves robustness.

📖

terms

Model-Based RL

Approach that learns a model of the environment dynamics from offline data to generate synthetic experiences. In an offline context, this model must be used cautiously to avoid error propagation.

📖

terms

Policy Evaluation

Phase of evaluating policy performance using only offline data without interaction with the environment. This step is crucial for validating learning before deployment.

📖

terms

Policy Improvement

Process of iteratively improving the policy using value estimates calculated from the offline data batch. The improvement must respect distribution constraints to maintain validity.

📖

terms

Bootstrapping Error

Error accumulated when a policy uses its own value estimates to improve itself, leading to divergence from the data support. Offline methods use specific techniques to control this bias.

AI Glossary