Apache Spark

📖

termini

Open-source distributed processing framework designed in-memory to accelerate Big Data analytics with optimized parallel execution.

📖

termini

RDD (Resilient Distributed Dataset)

Fundamental data structure of Spark, immutable and partitioned, enabling fault tolerance through reconstruction of lost data.

📖

termini

DataFrame

Distributed data collection organized into named columns, similar to a database table, optimized for structured queries.

📖

termini

Spark SQL

Spark module integrating SQL queries and DataFrame operations with automatic optimization via the Catalyst Optimizer.

📖

termini

Spark Streaming

Spark extension enabling real-time data stream processing with micro-batches for near-real-time latency.

📖

termini

MLlib

Spark's distributed machine learning library providing classification, regression, clustering, and recommendation algorithms.

📖

termini

GraphX

Spark API for distributed graph processing, combining the advantages of graphs with RDD performance.

📖

termini

DAG (Directed Acyclic Graph)

Representation of Spark execution plan for transformations, optimized to eliminate redundancies and parallelize processing.

📖

termini

Spark Driver

Main process coordinating Spark task execution, creating the SparkContext and dividing operations into stages.

📖

termini

Spark Executor

Worker process executing tasks assigned by the Driver on each cluster node, managing memory and partitioned data.

📖

termini

Spark Context

Main entry point of the Spark application, managing cluster connections and coordinating access to distributed resources.

📖

termini

Partition

Logical unit of data distribution in Spark, enabling parallelism by dividing RDDs/DataFrames into independent fragments.

📖

termini

Shuffle

Costly data redistribution operation between partitions, necessary during aggregations, joins, or groupings in Spark.

📖

termini

Catalyst Optimizer

Spark query optimization engine transforming and reorganizing execution plans to improve performance.

📖

termini

Tungsten

Spark execution backend optimizing memory and CPU through binary data management and bytecode generation.

📖

termini

Cache/Persist

Mechanism for persisting RDDs/DataFrames in memory or on disk for fast reuse and to avoid costly recalculations.

📖

termini

Broadcast Variable

Read-only variable efficiently distributed to all executors to minimize network transfers during joins.

📖

termini

Accumulator

Additive shared variable used to aggregate information from parallel tasks in a thread-safe manner.

📖

termini

Transformation

Lazy operation creating a new RDD/DataFrame without immediate execution, deferred until a triggering action.

📖

termini

Action

Operation triggering the execution of the DAG plan to produce a result, forcing the computation of all previous transformations.

Glossario IA

Apache Spark

RDD (Resilient Distributed Dataset)

DataFrame

Spark SQL

Spark Streaming

MLlib

GraphX

DAG (Directed Acyclic Graph)

Spark Driver

Spark Executor

Spark Context

Partition

Shuffle

Catalyst Optimizer

Tungsten

Cache/Persist

Broadcast Variable

Accumulator

Transformation

Action

Nessun risultato trovato