Trust Region Policy Optimization (TRPO)
Reward-to-go
Value function estimator that uses only future rewards after a given timestep to reduce variance in gradient estimation.
← WsteczValue function estimator that uses only future rewards after a given timestep to reduce variance in gradient estimation.
← Wstecz