Distributed Computing Models

📖

termini

MapReduce

Parallel programming model for processing large datasets on clusters, dividing processing into two main phases: Map for filtering and transforming, and Reduce for aggregating results.

📖

termini

Lambda Architecture

Data processing architecture combining a batch path for comprehensive analysis and a speed path for real-time results, with a unified service layer to merge both views.

📖

termini

Kappa Architecture

Simplification of Lambda architecture using only a stream processing pipeline, where data is processed in real-time and historical queries are satisfied by replaying events.

📖

termini

Batch Processing

Processing mode where data is collected and processed in batches at predefined intervals, optimized for throughput rather than latency, typical of traditional ETL analyses.

📖

termini

Stream Processing

Continuous processing of data in motion as it is generated, enabling real-time analysis with minimal latency between capture and processing.

📖

termini

Distributed File System

File system storing data across multiple servers while appearing as a single system to users, ensuring replication and fault tolerance for reliability.

📖

termini

HDFS

Hadoop Distributed File System, distributed file system designed to store petabytes of data on standard hardware with high fault tolerance through block replication.

📖

termini

YARN

Yet Another Resource Negotiator, Hadoop resource manager separating data processing from resource management, enabling execution of multiple frameworks on the same cluster.

📖

termini

RDD

Resilient Distributed Dataset, fundamental data structure of Spark representing an immutable and partitioned collection of objects that can be computed in parallel with automatic fault tolerance.

📖

termini

Data Locality

Distributed computing principle where tasks are executed on nodes containing the necessary data, minimizing network transfer and significantly improving performance.

📖

termini

Speculative Execution

Fault tolerance mechanism launching copies of slow tasks on other nodes, using the first completed result to reduce the impact of faulty or overloaded nodes.

📖

termini

DAG

Directed Acyclic Graph, representation of the Spark workflow where transformations are organized in a directed acyclic graph, optimizing parallel execution of steps.

📖

termini

Fault Tolerance

Ability of a distributed system to continue functioning correctly in case of component failures, typically through redundancy, replication, and automatic recovery mechanisms.

📖

termini

Consistency Model

Contract defining data consistency guarantees in a distributed system, ranging from strong consistency to eventual consistency based on application needs.

📖

termini

Combiner

MapReduce optimization function executed locally on each mapper to reduce the volume of data transferred during shuffle, applying pre-aggregation before the reduce phase.

Glossario IA

MapReduce

Lambda Architecture

Kappa Architecture

Batch Processing

Stream Processing

Distributed File System

HDFS

YARN

RDD

Data Locality

Speculative Execution

DAG

Fault Tolerance

Consistency Model

Combiner

Nessun risultato trovato