🏠 Trang chủ
Benchmark
📊 Tất cả benchmark 🦖 Khủng long v1 🦖 Khủng long v2 ✅ Ứng dụng To-Do List 🎨 Trang tự do sáng tạo 🎯 FSACB - Trình diễn cuối cùng 🌍 Benchmark dịch thuật
Mô hình
🏆 Top 10 mô hình 🆓 Mô hình miễn phí 📋 Tất cả mô hình ⚙️ Kilo Code
Tài nguyên
💬 Thư viện prompt 📖 Thuật ngữ AI 🔗 Liên kết hữu ích

Thuật ngữ AI

Từ điển đầy đủ về Trí tuệ nhân tạo

162
danh mục
2.032
danh mục con
23.060
thuật ngữ
📖
thuật ngữ

Clipping Function

PPO mechanism that limits the magnitude of policy updates by clipping the probability ratio between the new and old policy to avoid overly drastic changes.

📖
thuật ngữ

Trust Region

Confidence region in policy space where updates are considered safe, defined by a constraint on KL divergence between successive policies.

📖
thuật ngữ

Surrogate Objective

Modified objective function used in PPO that approximates the original objective while incorporating stability constraints like clipping to prevent performance degradation.

📖
thuật ngữ

KL Divergence Penalty

Penalty added to PPO's objective function to control divergence between successive policies, adaptively adjusted to maintain updates within an acceptable region.

📖
thuật ngữ

Mini-batch Updates

PPO optimization process where collected data is divided into small batches to perform multiple gradient passes, improving computational efficiency and stability.

📖
thuật ngữ

Clip Range Parameter

Epsilon hyperparameter in PPO that defines the width of the clipping zone for the probability ratio, directly controlling the conservatism of policy updates.

📖
thuật ngữ

Value Function Clipping

PPO variant that also applies clipping to the value function to stabilize learning and prevent large variations in value estimates.

📖
thuật ngữ

Epoch Optimization

PPO process where the same experience data is reused for multiple optimization passes, improving the utilization of collected data.

📖
thuật ngữ

Normalized Advantage

Technique for normalizing advantage estimates to stabilize training by maintaining a consistent gradient scale between updates.

📖
thuật ngữ

Experience Collection

PPO phase where the agent interacts with the environment following the current policy to collect transitions (state, action, reward) used for optimization.

📖
thuật ngữ

Adaptive KL Penalty

PPO variant that dynamically adjusts the KL penalty strength based on the observed divergence between policies, ensuring controlled updates.

🔍

Không tìm thấy kết quả