Transformer Optimization
Layer-wise Learning Rate Decay
Optimization strategy applying different learning rates based on layer depth, typically higher rates for upper layers.
← BackOptimization strategy applying different learning rates based on layer depth, typically higher rates for upper layers.
← Back