Multi-Armed Bandits - Bảng thuật ngữ Trí tuệ nhân tạo

📖

thuật ngữ

Multi-Armed Bandit

Fundamental reinforcement learning problem where an agent must sequentially select among multiple options (arms) to maximize the sum of obtained rewards.

📖

thuật ngữ

Exploration-Exploitation Dilemma

Central conflict between exploring new options to discover their potential rewards and exploiting options known to be the most profitable.

📖

thuật ngữ

Regret Rate

Performance measure quantifying the cumulative difference between obtained rewards and optimal ones, evaluating the effectiveness of the learning strategy.

📖

thuật ngữ

UCB Algorithm

Optimistic strategy that selects the arm with the highest upper confidence bound, balancing exploration and exploitation through statistical confidence intervals.

📖

thuật ngữ

ε-greedy Algorithm

Simple policy choosing the optimal arm with probability (1-ε) and exploring randomly with probability ε, controlling the exploration-exploitation trade-off.

📖

thuật ngữ

Stochastic Reward

Random return following an unknown probability distribution associated with each arm, modeling the inherent uncertainty in real environments.

📖

thuật ngữ

Action Policy

Rule or algorithm determining the choice of arm at each step based on accumulated information, defining the agent's behavior.

📖

thuật ngữ

Bernoulli Distribution

Binary reward model (success/failure) frequently used in bandit problems, characterized by a single success probability parameter.

📖

thuật ngữ

Bayesian Update

Iterative process of updating beliefs about reward distribution parameters by combining prior information and new observations.

📖

thuật ngữ

Non-Stationary Bandit

Variant where reward distributions change over time, requiring adaptive strategies capable of tracking these variations.

📖

thuật ngữ

Optimism in the Face of Uncertainty

Algorithmic principle favoring arms with high uncertainty and high reward potential, ensuring efficient exploration.

📖

thuật ngữ

Convergence Rate

Speed at which the algorithm approaches the optimal policy, measuring the asymptotic efficiency of the learning strategy.

📖

thuật ngữ

Adversarial Bandit

Scenario where rewards are chosen by an adversary rather than following stochastic distributions, requiring robust strategies.

📖

thuật ngữ

Optimistic Initialization

Technique initializing reward estimates to high values to encourage early exploration of all available arms.

📖

thuật ngữ

Linear Bandit

Generalization where the expected reward is a linear function of contextual features, allowing for more complex structures.

📖

thuật ngữ

Variance Reduction

Technique aimed at decreasing the uncertainty of reward estimates to accelerate convergence to the optimal policy.

Thuật ngữ AI