Transformer Optimization
Layer-wise Learning Rate Decay
Optimization strategy applying different learning rates based on layer depth, typically higher rates for upper layers.
← KembaliOptimization strategy applying different learning rates based on layer depth, typically higher rates for upper layers.
← Kembali