Machine Learning at Scale

📖

terms

Distributed Machine Learning

Paradigm for training ML models where computations are distributed across multiple machines to process massive datasets and reduce training time.

📖

terms

Parameter Server

Distributed architecture centralizing model parameters on dedicated servers, allowing workers to update and synchronize gradients asynchronously.

📖

terms

AllReduce

Collective communication algorithm enabling synchronized reduction and broadcasting of gradients between all nodes in a distributed training environment.

📖

terms

Data Parallelism

Parallelization strategy where data is partitioned across multiple machines, each training an identical copy of the model with different batches.

📖

terms

Spark MLlib

Scalable machine learning library built on Apache Spark, offering distributed implementations of classical ML algorithms.

📖

terms

TensorFlow Distributed

TensorFlow's distributed training framework using strategies like MirroredStrategy and MultiWorkerMirroredStrategy to scale training.

📖

terms

Horovod

Open-source framework developed by Uber using the AllReduce algorithm via MPI for efficient distributed training of deep learning models.

📖

terms

Ray

Distributed computing framework optimized for machine learning and AI, providing primitives for parallel execution and large-scale state management.

📖

terms

Petastorm

Library enabling efficient access to large datasets stored in Apache Parquet for distributed deep learning model training.

📖

terms

Dask-ML

Dask extension integrating scalable machine learning algorithms and parallelization tools for ML workflows on clusters.

📖

terms

Kubeflow

Open-source platform based on Kubernetes for deploying and managing complex ML pipelines at scale with containerized orchestration.

📖

terms

MLflow

Open source platform for managing the complete lifecycle of ML projects, including tracking, model management, and reproducibility at scale.

📖

terms

Feast

Open source feature store providing an abstraction layer for managing, versioning, and serving features at scale.

📖

terms

Vertex AI

Google Cloud's unified platform for training, deploying, and managing ML models at scale with integrated AutoML and MLOps.

📖

terms

SageMaker

Fully managed AWS service for distributed training, deployment, and monitoring of ML models with automatic resource optimization.

📖

terms

Sharding

Horizontal partitioning of data or model across multiple nodes to enable parallel processing and reduce load per machine.

📖

terms

Elastic Training

Ability to dynamically adjust the number of workers during training to optimize resource utilization and reduce costs.

AI Glossary

Distributed Machine Learning

Parameter Server

AllReduce

Data Parallelism

Spark MLlib

TensorFlow Distributed

Horovod

Ray

Petastorm

Dask-ML

Kubeflow

MLflow

Feast

Vertex AI

SageMaker

Sharding

Elastic Training

No results found