Distributed Matrix Factorization

📖

Begriffe

Distributed Matrix Factorization

Set of algorithmic techniques aimed at decomposing a very large matrix into products of smaller matrices, distributing computations and data across a cluster of machines to overcome the memory and computing power limitations of a single node.

📖

Begriffe

Distributed Alternating Least Squares (ALS)

Parallelized matrix factorization algorithm that solves the least squares problem alternately for one of the matrix factors while keeping the other fixed, naturally adapting to distributed environments like Spark MLlib due to the independence of computations on each row or column.

📖

Begriffe

Distributed Stochastic Gradient Descent (SGD)

Parallel variant of stochastic gradient descent where the update of factorization parameters is performed asynchronously or synchronously across multiple data partitions, requiring consistency management mechanisms to converge properly in a distributed context.

📖

Begriffe

MapReduce for Factorization

Programming paradigm that decomposes matrix factorization algorithms into two main stages: a 'Map' stage for local computations on data fragments and a 'Reduce' stage to aggregate partial results and update matrix factors, used notably in Hadoop implementations.

📖

Begriffe

Spark MLlib ALS

Optimized and distributed implementation of the Alternating Least Squares algorithm within Spark's Machine Learning library, designed for large-scale matrix factorization by leveraging the RDD or DataFrame programming model for maximum efficiency on iterative data.

📖

Begriffe

Matrix Partitioning

Strategy for splitting a massive matrix into sub-blocks (by rows, by columns, or by square blocks) distributed across cluster nodes, a crucial choice that directly impacts workload, inter-node communication, and overall performance of factorization algorithms.

📖

Begriffe

Consistency Model

Rules defining the visibility of matrix factor updates across cluster nodes, oscillating between strong consistency (BSP model - Bulk Synchronous Parallel) that guarantees convergence at the cost of latency, and weak consistency (asynchronous model) that speeds up iterations but may compromise stability.

📖

Begriffe

Online Matrix Factorization

Distributed approach suitable for continuous data streams, where the factorization model is updated incrementally as new observations arrive without requiring complete retraining on historical data, often implemented with distributed variants of SGD.

📖

Begriffe

Parametric Distributed Matrix Factorization

Advanced method where matrix factors are not learned directly but are generated by shared and distributed parametric functions (e.g., neural networks), thereby reducing the amount of data to communicate between nodes and improving generalization capability.

📖

Begriffe

Stragglers (Slow Nodes)

Phenomenon in distributed systems where some machines execute their computation tasks much slower than others, delaying the entire synchronous factorization process; techniques like speculation or delay-tolerant algorithms are designed to mitigate their impact.

📖

Begriffe

Distributed Non-Negative Matrix Factorization (NMF)

Distributed extension of non-negative matrix factorization, where non-negativity constraints on the factors are enforced through update rules (multiplicative or projection) adapted for parallel execution, often used for large-scale text clustering.

📖

Begriffe

Checkpointing in Iterative Algorithms

Technique of periodically saving the state of matrix factors to reliable storage (e.g., HDFS) during the iterations of a distributed algorithm, allowing the computation to resume from an intermediate point in case of node failure and avoiding restarting from scratch.

📖

Begriffe

Distributed Tensor Factorization

Generalization of matrix factorization to tensors (multi-dimensional arrays) in a distributed context, used to model data with more than two modes (e.g., users, items, time) and requiring specific parallel algorithms like distributed PARAFAC or Tucker.

📖

Begriffe

Distributed Loss Function

Calculation of the matrix factorization reconstruction error performed in a partitioned manner where each node evaluates the loss on its data subset before a global reduction step computes the total loss to guide model updates in a centralized or decentralized manner.

📖

Begriffe

Distributed Regularization

Application of penalties (such as L2 norm) on matrix factors to prevent overfitting, where the regularization term is computed locally on each node and aggregated during global parameter updates, ensuring consistent regularization across the cluster.

📖

Begriffe

Spark GraphX for Factorization

Use of Spark's GraphX graph processing API to model the matrix as a bipartite graph (users-items) and execute factorization algorithms based on message passing between graph nodes, offering an alternative to DataFrame-based implementations.

KI-Glossar