Batch Constrained Q-learning (BCQ)

📖

termini

Batch Constrained Q-learning (BCQ)

Offline reinforcement learning algorithm that constrains policies to remain close to actions observed in the training dataset to avoid extrapolation errors. BCQ uses an action generator model to produce actions similar to those in the batch while exploring slight variations.

📖

termini

Distribution Shift

Phenomenon where the distribution of state-actions visited by the learned policy significantly differs from the distribution of the offline dataset. This shift can lead to biased value estimates and degraded performance during deployment.

📖

termini

Offline Reinforcement Learning

Learning paradigm where the agent learns exclusively from a fixed set of previously collected data, without interaction with the environment. This approach is essential when real-time exploration is costly or dangerous.

📖

termini

Behavior Cloning

Supervised learning technique that directly imitates expert actions from demonstration data without using reward signals. Although simple, this approach can suffer from cascading error accumulation during deployment.

📖

termini

Implicit Q-learning

Method that learns the Q function implicitly by avoiding direct evaluation of out-of-distribution actions. IQL formulates learning as an expectile learning problem to better handle uncertainty in offline data.

📖

termini

Out-of-Distribution Actions

Actions generated by the learned policy that were not or rarely observed in the training dataset. These actions pose a major risk in offline RL because their values cannot be reliably estimated.

📖

termini

Policy Constraint

Mechanism that limits the learned policy to produce actions similar to those present in the offline data batch. This constraint can be implemented via penalties, divergences, or conditional generative models.

📖

termini

Perturbation Model

Component of BCQ that generates variations around behavior actions to locally explore the action space. This model adds controlled noise to observed actions while ensuring their feasibility.

📖

termini

Value Function Estimation

Process of estimating Q-values from offline data while accounting for potential bias due to lack of exploration. Modern methods use conservative underestimation techniques to avoid over-optimization.

📖

termini

Batch RL

Reinforcement learning framework where the agent has a fixed batch of transitions and must learn an optimal policy without additional interactions. This context imposes specific constraints on algorithms to prevent divergence.

📖

termini

Safety Constraint

Restriction imposed on offline policies to ensure that generated actions remain in safe regions of the state-action space. These constraints are crucial in applications such as robotics or medicine.

📖

termini

Action Repetition

Strategy used in offline RL to improve stability by repeating actions similar to those observed in the data. This technique reduces the risk of generating completely new and potentially dangerous actions.

📖

termini

Uncertainty Estimation

Quantification of uncertainty associated with value estimates of actions not observed in the batch. Accurate uncertainty estimation allows penalizing out-of-distribution actions and improves robustness.

📖

termini

Model-Based RL

Approach that learns a model of the environment dynamics from offline data to generate synthetic experiences. In an offline context, this model must be used cautiously to avoid error propagation.

📖

termini

Policy Evaluation

Phase of evaluating policy performance using only offline data without interaction with the environment. This step is crucial for validating learning before deployment.

📖

termini

Policy Improvement

Process of iteratively improving the policy using value estimates calculated from the offline data batch. The improvement must respect distribution constraints to maintain validity.

📖

termini

Bootstrapping Error

Error accumulated when a policy uses its own value estimates to improve itself, leading to divergence from the data support. Offline methods use specific techniques to control this bias.

Glossario IA