Glossario IA
Il dizionario completo dell'Intelligenza Artificiale
Apache Spark
Open-source distributed processing framework designed in-memory to accelerate Big Data analytics with optimized parallel execution.
RDD (Resilient Distributed Dataset)
Fundamental data structure of Spark, immutable and partitioned, enabling fault tolerance through reconstruction of lost data.
DataFrame
Distributed data collection organized into named columns, similar to a database table, optimized for structured queries.
Spark SQL
Spark module integrating SQL queries and DataFrame operations with automatic optimization via the Catalyst Optimizer.
Spark Streaming
Spark extension enabling real-time data stream processing with micro-batches for near-real-time latency.
MLlib
Spark's distributed machine learning library providing classification, regression, clustering, and recommendation algorithms.
GraphX
Spark API for distributed graph processing, combining the advantages of graphs with RDD performance.
DAG (Directed Acyclic Graph)
Representation of Spark execution plan for transformations, optimized to eliminate redundancies and parallelize processing.
Spark Driver
Main process coordinating Spark task execution, creating the SparkContext and dividing operations into stages.
Spark Executor
Worker process executing tasks assigned by the Driver on each cluster node, managing memory and partitioned data.
Spark Context
Main entry point of the Spark application, managing cluster connections and coordinating access to distributed resources.
Partition
Logical unit of data distribution in Spark, enabling parallelism by dividing RDDs/DataFrames into independent fragments.
Shuffle
Costly data redistribution operation between partitions, necessary during aggregations, joins, or groupings in Spark.
Catalyst Optimizer
Spark query optimization engine transforming and reorganizing execution plans to improve performance.
Tungsten
Spark execution backend optimizing memory and CPU through binary data management and bytecode generation.
Cache/Persist
Mechanism for persisting RDDs/DataFrames in memory or on disk for fast reuse and to avoid costly recalculations.
Broadcast Variable
Read-only variable efficiently distributed to all executors to minimize network transfers during joins.
Accumulator
Additive shared variable used to aggregate information from parallel tasks in a thread-safe manner.
Transformation
Lazy operation creating a new RDD/DataFrame without immediate execution, deferred until a triggering action.
Action
Operation triggering the execution of the DAG plan to produce a result, forcing the computation of all previous transformations.