Batch Constrained Q-learning (BCQ)
Policy Constraint
Mechanism that limits the learned policy to produce actions similar to those present in the offline data batch. This constraint can be implemented via penalties, divergences, or conditional generative models.
← Zurück