AI Glossary
The complete dictionary of Artificial Intelligence
Sequence Parallelism
A form of parallelism that divides the sequence dimension of input tensors across multiple accelerators, used for Transformer-type models with long sequences.
Expert Parallelism
A technique specific to dense mixture-of-experts (MoE) models where different expert networks are distributed across separate accelerators to balance the computational load.
Sharded Data Parallelism
A combination of data parallelism and the ZeRO strategy, where model weights are partitioned (sharded) among workers while maintaining data parallelism.
Activation Checkpointing
A memory technique that involves not storing intermediate activations during the forward pass, but recalculating them during the backward pass to save GPU memory.
Hybrid Parallelism
An approach combining multiple parallelism strategies (e.g., tensor, pipeline, and data) to maximize resource utilization and scale training across thousands of accelerators.
All-Reduce Communication
A collective communication operation essential to data parallelism, where local gradients from each accelerator are aggregated and redistributed to synchronize model weights.
Tensor Slicing
A fundamental operation in tensor parallelism involving dividing a tensor along a specific dimension (e.g., row, column) to distribute it across multiple devices.
GPipe
A pipeline parallelism implementation that uses micro-batching and activation checkpointing to efficiently train very large neural networks.
Megatron-LM
Tensor parallelism architecture developed by NVIDIA, designed to train massive language models by partitioning weight matrices and gradients.
DeepSpeed
Microsoft's optimization library implementing advanced techniques like ZeRO, hybrid parallelism, and memory compression for large-scale model training.
Offloading
Memory management strategy where data (weights, gradients, activations) are dynamically moved between fast GPU memory and slower but more extensive CPU memory.