AI-woordenlijst
Het complete woordenboek van kunstmatige intelligentie
Contextual Bandit
Reinforcement learning algorithm that dynamically selects the best actions based on the observed context to maximize cumulative rewards.
Exploration vs Exploitation
Fundamental dilemma where the algorithm must balance discovering new options and exploiting options known to be performant.
Upper Confidence Bound (UCB)
Strategy that selects arms based on an upper confidence bound on their expected reward, favoring the exploration of uncertain actions.
Thompson Sampling
Bayesian algorithm that samples reward parameters from their posterior distribution to make probabilistic decisions.
LinUCB
Extension of UCB that models expected reward as a linear function of context, adapted for high-dimensional context spaces.
Context Features
Descriptive variables that characterize the current state of the environment and influence the optimal choice of action in contextual bandits.
Regret Minimization
Objective aimed at minimizing the difference between the cumulative reward obtained and that of the optimal policy, measuring the performance of the algorithm.
Multi-armed Bandits
Fundamental problem where an agent must select among several options (arms) with unknown reward distributions to maximize gain.
Reward Function
Mathematical function that quantifies the immediate return obtained after taking an action in a given context, guiding the algorithm's learning.
Arm Selection
Process of choosing the optimal action among available options based on current reward estimates and the observed context.
Expected Reward
Anticipated average value of the reward for a given action in a specific context, calculated from historical observations.
Action-Value Function
Function Q(a,x) that estimates the expected future reward by taking action 'a' in context 'x', fundamental for policy evaluation.
Online Learning
Learning paradigm where the model continuously adjusts as new data arrives, without requiring a full retraining.
Stochastic Contextual Bandits
Variant where rewards follow independent and identically distributed stochastic distributions for each context-action pair.
Neural Bandits
Approach using neural networks to approximate the value function or policy, capable of capturing complex non-linear relationships.