Model Parallelism - AI Glossary

📖

terms

Sequence Parallelism

A form of parallelism that divides the sequence dimension of input tensors across multiple accelerators, used for Transformer-type models with long sequences.

📖

terms

Expert Parallelism

A technique specific to dense mixture-of-experts (MoE) models where different expert networks are distributed across separate accelerators to balance the computational load.

📖

terms

Sharded Data Parallelism

A combination of data parallelism and the ZeRO strategy, where model weights are partitioned (sharded) among workers while maintaining data parallelism.

📖

terms

Activation Checkpointing

A memory technique that involves not storing intermediate activations during the forward pass, but recalculating them during the backward pass to save GPU memory.

📖

terms

Hybrid Parallelism

An approach combining multiple parallelism strategies (e.g., tensor, pipeline, and data) to maximize resource utilization and scale training across thousands of accelerators.

📖

terms

All-Reduce Communication

A collective communication operation essential to data parallelism, where local gradients from each accelerator are aggregated and redistributed to synchronize model weights.

📖

terms

Tensor Slicing

A fundamental operation in tensor parallelism involving dividing a tensor along a specific dimension (e.g., row, column) to distribute it across multiple devices.

📖

terms

GPipe

A pipeline parallelism implementation that uses micro-batching and activation checkpointing to efficiently train very large neural networks.

📖

terms

Megatron-LM

Tensor parallelism architecture developed by NVIDIA, designed to train massive language models by partitioning weight matrices and gradients.

📖

terms

DeepSpeed

Microsoft's optimization library implementing advanced techniques like ZeRO, hybrid parallelism, and memory compression for large-scale model training.

📖

terms

Offloading

Memory management strategy where data (weights, gradients, activations) are dynamically moved between fast GPU memory and slower but more extensive CPU memory.

AI Glossary

Sequence Parallelism

Expert Parallelism

Sharded Data Parallelism

Activation Checkpointing

Hybrid Parallelism

All-Reduce Communication

Tensor Slicing

GPipe

Megatron-LM

DeepSpeed

Offloading

No results found