KI-Glossar
Das vollständige Wörterbuch der Künstlichen Intelligenz
MapReduce
Parallel programming model for processing large datasets on clusters, dividing processing into two main phases: Map for filtering and transforming, and Reduce for aggregating results.
Lambda Architecture
Data processing architecture combining a batch path for comprehensive analysis and a speed path for real-time results, with a unified service layer to merge both views.
Kappa Architecture
Simplification of Lambda architecture using only a stream processing pipeline, where data is processed in real-time and historical queries are satisfied by replaying events.
Batch Processing
Processing mode where data is collected and processed in batches at predefined intervals, optimized for throughput rather than latency, typical of traditional ETL analyses.
Stream Processing
Continuous processing of data in motion as it is generated, enabling real-time analysis with minimal latency between capture and processing.
Distributed File System
File system storing data across multiple servers while appearing as a single system to users, ensuring replication and fault tolerance for reliability.
HDFS
Hadoop Distributed File System, distributed file system designed to store petabytes of data on standard hardware with high fault tolerance through block replication.
YARN
Yet Another Resource Negotiator, Hadoop resource manager separating data processing from resource management, enabling execution of multiple frameworks on the same cluster.
RDD
Resilient Distributed Dataset, fundamental data structure of Spark representing an immutable and partitioned collection of objects that can be computed in parallel with automatic fault tolerance.
Data Locality
Distributed computing principle where tasks are executed on nodes containing the necessary data, minimizing network transfer and significantly improving performance.
Speculative Execution
Fault tolerance mechanism launching copies of slow tasks on other nodes, using the first completed result to reduce the impact of faulty or overloaded nodes.
DAG
Directed Acyclic Graph, representation of the Spark workflow where transformations are organized in a directed acyclic graph, optimizing parallel execution of steps.
Fault Tolerance
Ability of a distributed system to continue functioning correctly in case of component failures, typically through redundancy, replication, and automatic recovery mechanisms.
Consistency Model
Contract defining data consistency guarantees in a distributed system, ranging from strong consistency to eventual consistency based on application needs.
Combiner
MapReduce optimization function executed locally on each mapper to reduce the volume of data transferred during shuffle, applying pre-aggregation before the reduce phase.