Policy Gradient Methods - Bảng thuật ngữ Trí tuệ nhân tạo

📖

thuật ngữ

Policy Gradient

Direct optimization method that adjusts policy parameters by following the gradient of the expected return, enabling learning of stochastic policies without requiring an environment model.

📖

thuật ngữ

REINFORCE Algorithm

Basic policy gradient algorithm using a Monte Carlo estimate of the gradient to update policy parameters based on fully observed episodes.

📖

thuật ngữ

Actor-Critic Methods

Hybrid approach combining an actor that learns the policy and a critic that estimates the value function, reducing the variance of policy gradient estimates.

📖

thuật ngữ

Advantage Function

Measure of the superiority of an action compared to the average of actions in a given state, calculated as the difference between the Q function and the V function to reduce gradient variance.

📖

thuật ngữ

Proximal Policy Optimization (PPO)

Algorithm optimizing the policy by constraining updates to stay close to the previous policy, using a clipped objective function to ensure learning stability.

📖

thuật ngữ

Trust Region Policy Optimization (TRPO)

Method ensuring monotonic performance improvements by optimizing the policy within a trust region defined by the KL divergence between successive policies.

📖

thuật ngữ

Natural Policy Gradient

Variant of policy gradient using the Fisher metric to perform parameterization-invariant updates, ensuring more stable and efficient convergence.

📖

thuật ngữ

Policy Network

Parameterized neural network that represents the policy π(a|s; θ), generating a probability distribution over actions conditioned on the current state.

📖

thuật ngữ

Monte Carlo Policy Gradient

A gradient estimation technique that uses complete trajectories to calculate returns, providing an unbiased but high-variance estimate.

📖

thuật ngữ

Baseline Function

A function subtracted from the return to reduce the variance of the gradient estimate without introducing bias, typically the state-value function.

📖

thuật ngữ

Importance Sampling

A technique that allows using data collected with an old policy to update a new policy, by weighting samples according to the probability ratio of the policies.

📖

thuật ngữ

Entropy Regularization

Adding an entropy term to the objective function to encourage exploration by penalizing overly deterministic policies, improving the robustness of learning.

📖

thuật ngữ

Deterministic Policy Gradient

An extension of policy gradient to continuous action spaces where the policy is deterministic, particularly effective in high-dimensional environments.

📖

thuật ngữ

Stochastic Policy

A policy represented by a probability distribution π(a|s) over actions, allowing for intrinsic exploration and is essential for policy gradient methods.

📖

thuật ngữ

KL Divergence Constraint

A constraint that limits the Kullback-Leibler divergence between successive policies to ensure stable updates and avoid overly drastic changes in behavior.

📖

thuật ngữ

Generalized Advantage Estimation (GAE)

An advantage estimation method that combines bias and variance through a weighted average of multi-step estimators, offering an optimal trade-off for learning.

📖

thuật ngữ

Policy Gradient Theorem

Fundamental theorem providing an analytical expression of the gradient of the expected return with respect to the policy parameters, formulating the theoretical basis of the methods.

📖

thuật ngữ

Return-to-Go

Sum of discounted future rewards from a given time step, used as a gradient estimator in policy gradient algorithms.

Thuật ngữ AI

Policy Gradient

REINFORCE Algorithm

Actor-Critic Methods

Advantage Function

Proximal Policy Optimization (PPO)

Trust Region Policy Optimization (TRPO)

Natural Policy Gradient

Policy Network

Monte Carlo Policy Gradient

Baseline Function

Importance Sampling

Entropy Regularization

Deterministic Policy Gradient

Stochastic Policy

KL Divergence Constraint

Generalized Advantage Estimation (GAE)

Policy Gradient Theorem

Return-to-Go

Không tìm thấy kết quả