Apache Spark

📖

terimler

Open-source distributed processing framework designed in-memory to accelerate Big Data analytics with optimized parallel execution.

📖

terimler

RDD (Resilient Distributed Dataset)

Fundamental data structure of Spark, immutable and partitioned, enabling fault tolerance through reconstruction of lost data.

📖

terimler

DataFrame

Distributed data collection organized into named columns, similar to a database table, optimized for structured queries.

📖

terimler

Spark SQL

Spark module integrating SQL queries and DataFrame operations with automatic optimization via the Catalyst Optimizer.

📖

terimler

Spark Streaming

Spark extension enabling real-time data stream processing with micro-batches for near-real-time latency.

📖

terimler

MLlib

Spark's distributed machine learning library providing classification, regression, clustering, and recommendation algorithms.

📖

terimler

GraphX

Spark API for distributed graph processing, combining the advantages of graphs with RDD performance.

📖

terimler

DAG (Directed Acyclic Graph)

Representation of Spark execution plan for transformations, optimized to eliminate redundancies and parallelize processing.

📖

terimler

Spark Driver

Main process coordinating Spark task execution, creating the SparkContext and dividing operations into stages.

📖

terimler

Spark Executor

Worker process executing tasks assigned by the Driver on each cluster node, managing memory and partitioned data.

📖

terimler

Spark Context

Main entry point of the Spark application, managing cluster connections and coordinating access to distributed resources.

📖

terimler

Partition

Logical unit of data distribution in Spark, enabling parallelism by dividing RDDs/DataFrames into independent fragments.

📖

terimler

Shuffle

Costly data redistribution operation between partitions, necessary during aggregations, joins, or groupings in Spark.

📖

terimler

Catalyst Optimizer

Spark query optimization engine transforming and reorganizing execution plans to improve performance.

📖

terimler

Tungsten

Spark execution backend optimizing memory and CPU through binary data management and bytecode generation.

📖

terimler

Cache/Persist

Mechanism for persisting RDDs/DataFrames in memory or on disk for fast reuse and to avoid costly recalculations.

📖

terimler

Broadcast Variable

Read-only variable efficiently distributed to all executors to minimize network transfers during joins.

📖

terimler

Accumulator

Additive shared variable used to aggregate information from parallel tasks in a thread-safe manner.

📖

terimler

Transformation

Lazy operation creating a new RDD/DataFrame without immediate execution, deferred until a triggering action.

📖

terimler

Action

Operation triggering the execution of the DAG plan to produce a result, forcing the computation of all previous transformations.

YZ Sözlüğü

Apache Spark

RDD (Resilient Distributed Dataset)

DataFrame

Spark SQL

Spark Streaming

MLlib

GraphX

DAG (Directed Acyclic Graph)

Spark Driver

Spark Executor

Spark Context

Partition

Shuffle

Catalyst Optimizer

Tungsten

Cache/Persist

Broadcast Variable

Accumulator

Transformation

Action

Sonuç bulunamadı