Thuật ngữ AI
Từ điển đầy đủ về Trí tuệ nhân tạo
Multi-Armed Bandit
Fundamental reinforcement learning problem where an agent must sequentially select among multiple options (arms) to maximize the sum of obtained rewards.
Exploration-Exploitation Dilemma
Central conflict between exploring new options to discover their potential rewards and exploiting options known to be the most profitable.
Regret Rate
Performance measure quantifying the cumulative difference between obtained rewards and optimal ones, evaluating the effectiveness of the learning strategy.
UCB Algorithm
Optimistic strategy that selects the arm with the highest upper confidence bound, balancing exploration and exploitation through statistical confidence intervals.
ε-greedy Algorithm
Simple policy choosing the optimal arm with probability (1-ε) and exploring randomly with probability ε, controlling the exploration-exploitation trade-off.
Stochastic Reward
Random return following an unknown probability distribution associated with each arm, modeling the inherent uncertainty in real environments.
Action Policy
Rule or algorithm determining the choice of arm at each step based on accumulated information, defining the agent's behavior.
Bernoulli Distribution
Binary reward model (success/failure) frequently used in bandit problems, characterized by a single success probability parameter.
Bayesian Update
Iterative process of updating beliefs about reward distribution parameters by combining prior information and new observations.
Non-Stationary Bandit
Variant where reward distributions change over time, requiring adaptive strategies capable of tracking these variations.
Optimism in the Face of Uncertainty
Algorithmic principle favoring arms with high uncertainty and high reward potential, ensuring efficient exploration.
Convergence Rate
Speed at which the algorithm approaches the optimal policy, measuring the asymptotic efficiency of the learning strategy.
Adversarial Bandit
Scenario where rewards are chosen by an adversary rather than following stochastic distributions, requiring robust strategies.
Optimistic Initialization
Technique initializing reward estimates to high values to encourage early exploration of all available arms.
Linear Bandit
Generalization where the expected reward is a linear function of contextual features, allowing for more complex structures.
Variance Reduction
Technique aimed at decreasing the uncertainty of reward estimates to accelerate convergence to the optimal policy.