Transformer Optimization
Tensor Parallelism
Parallelism technique that divides individual weight tensors across multiple GPUs to enable training of larger layers than what a single device's memory can hold.
← Back