YZ Sözlüğü
Yapay Zekanın tam sözlüğü
First-Visit Monte Carlo
State value estimation method that averages returns only after the first visit to each state in an episode. This approach guarantees convergence to the true state value with potentially lower variance than Every-Visit MC.
Every-Visit Monte Carlo
Algorithm that updates the state value after each visit to a state in an episode, rather than only after the first visit. This method provides more frequent updates and converges to the same theoretical value as First-Visit MC.
Exploring Starts
Assumption guaranteeing that every state-action pair has a non-zero probability of being chosen as the starting point of an episode. This condition ensures sufficient exploration for the convergence of MC Control methods.
Monte Carlo Control
Class of algorithms that use Monte Carlo estimates to learn an optimal policy through iteration between policy evaluation and policy improvement. These methods do not require a complete model of the environment.
Off-Policy Monte Carlo
Learning approach where the learned policy (target policy) differs from the policy used to generate data (behavioral policy). This separation enables learning from expert data or past experiences.
Weighted Importance Sampling
Importance sampling variant using normalized weights that reduce variance compared to ordinary importance sampling. Weights are divided by their sum to form a weighted average that is biased but has lower variance.
GLIE Algorithm
Exploration strategy that is Greedy In the Limit with Infinite Exploration, guaranteeing asymptotic convergence to the optimal policy. Exploration gradually decreases while exploitation increases over time.
Monte Carlo ES
Monte Carlo Control algorithm using Exploring Starts to guarantee exploration of all state-action pairs. It maintains action value estimates and iteratively improves the policy towards optimality.
Return Discounting
Calculation of return in MC methods by applying a discount factor gamma to future rewards, giving more importance to immediate rewards. The return is the sum of future rewards weighted by successive powers of gamma.
Trajectory Sampling
Process of generating complete episodes by following a given policy until reaching a terminal state. The collected trajectories serve as the basis for Monte Carlo estimates of state or action values.
Incremental MC Update
Efficient update of Monte Carlo value estimates using a moving average with a learning rate alpha. This approach avoids storing all past returns while maintaining convergence guarantees.
Monte Carlo Policy Evaluation
Process of estimating the value function of a policy by sampling complete episodes and averaging observed returns. Unlike DP, this method requires no knowledge of the environment dynamics.
Stochastic Policy Estimation
Use of Monte Carlo methods to estimate values of stochastic policies where actions are selected according to probabilities. Estimates must account for the probabilistic distribution of actions in the return calculation.
Bootstrapping-Free Methods
Distinctive feature of Monte Carlo methods that do not use value estimates in their updates, unlike TD methods. This absence of bootstrapping eliminates certain biases but may increase variance.