Machine Learning at Scale

📖

Begriffe

Distributed Machine Learning

Paradigm for training ML models where computations are distributed across multiple machines to process massive datasets and reduce training time.

📖

Begriffe

Parameter Server

Distributed architecture centralizing model parameters on dedicated servers, allowing workers to update and synchronize gradients asynchronously.

📖

Begriffe

AllReduce

Collective communication algorithm enabling synchronized reduction and broadcasting of gradients between all nodes in a distributed training environment.

📖

Begriffe

Data Parallelism

Parallelization strategy where data is partitioned across multiple machines, each training an identical copy of the model with different batches.

📖

Begriffe

Spark MLlib

Scalable machine learning library built on Apache Spark, offering distributed implementations of classical ML algorithms.

📖

Begriffe

TensorFlow Distributed

TensorFlow's distributed training framework using strategies like MirroredStrategy and MultiWorkerMirroredStrategy to scale training.

📖

Begriffe

Horovod

Open-source framework developed by Uber using the AllReduce algorithm via MPI for efficient distributed training of deep learning models.

📖

Begriffe

Ray

Distributed computing framework optimized for machine learning and AI, providing primitives for parallel execution and large-scale state management.

📖

Begriffe

Petastorm

Library enabling efficient access to large datasets stored in Apache Parquet for distributed deep learning model training.

📖

Begriffe

Dask-ML

Dask extension integrating scalable machine learning algorithms and parallelization tools for ML workflows on clusters.

📖

Begriffe

Kubeflow

Open-source platform based on Kubernetes for deploying and managing complex ML pipelines at scale with containerized orchestration.

📖

Begriffe

MLflow

Open source platform for managing the complete lifecycle of ML projects, including tracking, model management, and reproducibility at scale.

📖

Begriffe

Feast

Open source feature store providing an abstraction layer for managing, versioning, and serving features at scale.

📖

Begriffe

Vertex AI

Google Cloud's unified platform for training, deploying, and managing ML models at scale with integrated AutoML and MLOps.

📖

Begriffe

SageMaker

Fully managed AWS service for distributed training, deployment, and monitoring of ML models with automatic resource optimization.

📖

Begriffe

Sharding

Horizontal partitioning of data or model across multiple nodes to enable parallel processing and reduce load per machine.

📖

Begriffe

Elastic Training

Ability to dynamically adjust the number of workers during training to optimize resource utilization and reduce costs.

KI-Glossar

Distributed Machine Learning

Parameter Server

AllReduce

Data Parallelism

Spark MLlib

TensorFlow Distributed

Horovod

Ray

Petastorm

Dask-ML

Kubeflow

MLflow

Feast

Vertex AI

SageMaker

Sharding

Elastic Training

Keine Ergebnisse gefunden