Apache Spark

📖

terms

Open-source distributed processing framework designed in-memory to accelerate Big Data analytics with optimized parallel execution.

📖

terms

RDD (Resilient Distributed Dataset)

Fundamental data structure of Spark, immutable and partitioned, enabling fault tolerance through reconstruction of lost data.

📖

terms

DataFrame

Distributed data collection organized into named columns, similar to a database table, optimized for structured queries.

📖

terms

Spark SQL

Spark module integrating SQL queries and DataFrame operations with automatic optimization via the Catalyst Optimizer.

📖

terms

Spark Streaming

Spark extension enabling real-time data stream processing with micro-batches for near-real-time latency.

📖

terms

MLlib

Spark's distributed machine learning library providing classification, regression, clustering, and recommendation algorithms.

📖

terms

GraphX

Spark API for distributed graph processing, combining the advantages of graphs with RDD performance.

📖

terms

DAG (Directed Acyclic Graph)

Representation of Spark execution plan for transformations, optimized to eliminate redundancies and parallelize processing.

📖

terms

Spark Driver

Main process coordinating Spark task execution, creating the SparkContext and dividing operations into stages.

📖

terms

Spark Executor

Worker process executing tasks assigned by the Driver on each cluster node, managing memory and partitioned data.

📖

terms

Spark Context

Main entry point of the Spark application, managing cluster connections and coordinating access to distributed resources.

📖

terms

Partition

Logical unit of data distribution in Spark, enabling parallelism by dividing RDDs/DataFrames into independent fragments.

📖

terms

Shuffle

Costly data redistribution operation between partitions, necessary during aggregations, joins, or groupings in Spark.

📖

terms

Catalyst Optimizer

Spark query optimization engine transforming and reorganizing execution plans to improve performance.

📖

terms

Tungsten

Spark execution backend optimizing memory and CPU through binary data management and bytecode generation.

📖

terms

Cache/Persist

Mechanism for persisting RDDs/DataFrames in memory or on disk for fast reuse and to avoid costly recalculations.

📖

terms

Broadcast Variable

Read-only variable efficiently distributed to all executors to minimize network transfers during joins.

📖

terms

Accumulator

Additive shared variable used to aggregate information from parallel tasks in a thread-safe manner.

📖

terms

Transformation

Lazy operation creating a new RDD/DataFrame without immediate execution, deferred until a triggering action.

📖

terms

Action

Operation triggering the execution of the DAG plan to produce a result, forcing the computation of all previous transformations.

AI Glossary

Apache Spark

RDD (Resilient Distributed Dataset)

DataFrame

Spark SQL

Spark Streaming

MLlib

GraphX

DAG (Directed Acyclic Graph)

Spark Driver

Spark Executor

Spark Context

Partition

Shuffle

Catalyst Optimizer

Tungsten

Cache/Persist

Broadcast Variable

Accumulator

Transformation

Action

No results found