Transformer Optimization
Optimizer State Sharding
Memory distribution method that partitions optimizer states across multiple GPUs to significantly reduce memory footprint during training.
← Zurück