Model Parallelism - Yapay Zeka Sözlüğü

📖

terimler

Sequence Parallelism

A form of parallelism that divides the sequence dimension of input tensors across multiple accelerators, used for Transformer-type models with long sequences.

📖

terimler

Expert Parallelism

A technique specific to dense mixture-of-experts (MoE) models where different expert networks are distributed across separate accelerators to balance the computational load.

📖

terimler

Sharded Data Parallelism

A combination of data parallelism and the ZeRO strategy, where model weights are partitioned (sharded) among workers while maintaining data parallelism.

📖

terimler

Activation Checkpointing

A memory technique that involves not storing intermediate activations during the forward pass, but recalculating them during the backward pass to save GPU memory.

📖

terimler

Hybrid Parallelism

An approach combining multiple parallelism strategies (e.g., tensor, pipeline, and data) to maximize resource utilization and scale training across thousands of accelerators.

📖

terimler

All-Reduce Communication

A collective communication operation essential to data parallelism, where local gradients from each accelerator are aggregated and redistributed to synchronize model weights.

📖

terimler

Tensor Slicing

A fundamental operation in tensor parallelism involving dividing a tensor along a specific dimension (e.g., row, column) to distribute it across multiple devices.

📖

terimler

GPipe

A pipeline parallelism implementation that uses micro-batching and activation checkpointing to efficiently train very large neural networks.

📖

terimler

Megatron-LM

Tensor parallelism architecture developed by NVIDIA, designed to train massive language models by partitioning weight matrices and gradients.

📖

terimler

DeepSpeed

Microsoft's optimization library implementing advanced techniques like ZeRO, hybrid parallelism, and memory compression for large-scale model training.

📖

terimler

Offloading

Memory management strategy where data (weights, gradients, activations) are dynamically moved between fast GPU memory and slower but more extensive CPU memory.

YZ Sözlüğü

Sequence Parallelism

Expert Parallelism

Sharded Data Parallelism

Activation Checkpointing

Hybrid Parallelism

All-Reduce Communication

Tensor Slicing

GPipe

Megatron-LM

DeepSpeed

Offloading

Sonuç bulunamadı