Apache Spark

📖

pojęcia

Open-source distributed processing framework designed in-memory to accelerate Big Data analytics with optimized parallel execution.

📖

pojęcia

RDD (Resilient Distributed Dataset)

Fundamental data structure of Spark, immutable and partitioned, enabling fault tolerance through reconstruction of lost data.

📖

pojęcia

DataFrame

Distributed data collection organized into named columns, similar to a database table, optimized for structured queries.

📖

pojęcia

Spark SQL

Spark module integrating SQL queries and DataFrame operations with automatic optimization via the Catalyst Optimizer.

📖

pojęcia

Spark Streaming

Spark extension enabling real-time data stream processing with micro-batches for near-real-time latency.

📖

pojęcia

MLlib

Spark's distributed machine learning library providing classification, regression, clustering, and recommendation algorithms.

📖

pojęcia

GraphX

Spark API for distributed graph processing, combining the advantages of graphs with RDD performance.

📖

pojęcia

DAG (Directed Acyclic Graph)

Representation of Spark execution plan for transformations, optimized to eliminate redundancies and parallelize processing.

📖

pojęcia

Spark Driver

Main process coordinating Spark task execution, creating the SparkContext and dividing operations into stages.

📖

pojęcia

Spark Executor

Worker process executing tasks assigned by the Driver on each cluster node, managing memory and partitioned data.

📖

pojęcia

Spark Context

Main entry point of the Spark application, managing cluster connections and coordinating access to distributed resources.

📖

pojęcia

Partition

Logical unit of data distribution in Spark, enabling parallelism by dividing RDDs/DataFrames into independent fragments.

📖

pojęcia

Shuffle

Costly data redistribution operation between partitions, necessary during aggregations, joins, or groupings in Spark.

📖

pojęcia

Catalyst Optimizer

Spark query optimization engine transforming and reorganizing execution plans to improve performance.

📖

pojęcia

Tungsten

Spark execution backend optimizing memory and CPU through binary data management and bytecode generation.

📖

pojęcia

Cache/Persist

Mechanism for persisting RDDs/DataFrames in memory or on disk for fast reuse and to avoid costly recalculations.

📖

pojęcia

Broadcast Variable

Read-only variable efficiently distributed to all executors to minimize network transfers during joins.

📖

pojęcia

Accumulator

Additive shared variable used to aggregate information from parallel tasks in a thread-safe manner.

📖

pojęcia

Transformation

Lazy operation creating a new RDD/DataFrame without immediate execution, deferred until a triggering action.

📖

pojęcia

Action

Operation triggering the execution of the DAG plan to produce a result, forcing the computation of all previous transformations.

Słownik AI

Apache Spark

RDD (Resilient Distributed Dataset)

DataFrame

Spark SQL

Spark Streaming

MLlib

GraphX

DAG (Directed Acyclic Graph)

Spark Driver

Spark Executor

Spark Context

Partition

Shuffle

Catalyst Optimizer

Tungsten

Cache/Persist

Broadcast Variable

Accumulator

Transformation

Action

Nie znaleziono wyników