🏠 홈
벤치마크
📊 모든 벤치마크 🦖 공룡 v1 🦖 공룡 v2 ✅ 할 일 목록 앱 🎨 창의적인 자유 페이지 🎯 FSACB - 궁극의 쇼케이스 🌍 번역 벤치마크
모델
🏆 톱 10 모델 🆓 무료 모델 📋 모든 모델 ⚙️ 킬로 코드 모드
리소스
💬 프롬프트 라이브러리 📖 AI 용어 사전 🔗 유용한 링크

AI 용어집

인공지능 완전 사전

162
카테고리
2,032
하위 카테고리
23,060
용어
📖
용어

Conservative Q-Learning (CQL)

Offline reinforcement learning method that actively penalizes overestimated Q-values to keep the policy close to the behavioral data distribution and prevent divergence.

📖
용어

Offline data distribution

Fixed and predefined dataset collected from a behavioral policy, serving as the sole source of information for offline RL training.

📖
용어

Conservative penalty

Regularization term added to the loss function to penalize high Q-values for state-action pairs absent from training data, thus preventing overestimation.

📖
용어

Q-value overestimation

Inherent problem in offline RL where Q-values are artificially inflated for unobserved actions, leading to suboptimal and unstable policies.

📖
용어

Conservative policy

Action strategy that intentionally stays close to behaviors observed in the dataset, minimizing the risk of divergence due to extrapolation on unseen data.

📖
용어

Distribution correction

Mechanism in CQL that adjusts Q-estimations to correct the mismatch between the behavioral distribution and the target policy distribution.

📖
용어

Policy gap

Measure of divergence between the learned policy and the behavioral policy, crucial for ensuring stability in offline reinforcement learning.

📖
용어

CQL loss function

Objective function combining standard Q-Learning loss with a conservative term that minimizes Q-values for out-of-distribution actions, forming log(Σexp(Q(s,a))) - Q(s,a').

📖
용어

Importance Sampling Ratio

Coefficient weighting transitions according to their probability of occurrence under the target policy relative to the behavioral policy, essential for correcting bias.

📖
용어

Distributional Shift

Fundamental difference between the distribution of available data and that required to accurately evaluate the learned policy, main challenge of offline RL.

📖
용어

Learning Stabilization

Objective of CQL aiming to guarantee algorithm convergence by avoiding oscillations and divergences caused by extrapolation on limited data.

📖
용어

Conservative Safeguard

Safety mechanism built into CQL limiting Q-value optimization for state-action pairs that are infrequent or absent from the training dataset.

📖
용어

Conservative Q-update

Iterative process modifying Q-values by penalizing overestimations while preserving reliable estimates based on observed data.

📖
용어

Extrapolation Error

Inaccuracy introduced when a model makes predictions for states or actions not represented in the training dataset, major problem in offline RL.

📖
용어

Conservative Critic

CQL component evaluating actions with a conservative bias, assigning lower scores to actions potentially overestimated due to lack of data.

📖
용어

Constrained Action Space

Subset of possible actions limited to those observed in the dataset, reducing the risk of policies exploiting extrapolation artifacts.

📖
용어

Behavior Sampling

Process of collecting transitions (state, action, reward, next state) according to a fixed behavioral policy, constituting the offline dataset.

📖
용어

Policy Divergence

Phenomenon where the learned policy dangerously deviates from the data distribution, leading to degraded performance or total learning collapse.

🔍

결과를 찾을 수 없습니다