Glossario IA
Il dizionario completo dell'Intelligenza Artificiale
HDFS
Hadoop's primary distributed file system designed to store petabytes of data on standard machine clusters with automatic replication and fault tolerance.
MapReduce
Programming paradigm and implementation for distributed processing of large datasets on clusters, dividing tasks into mapping and reduction phases.
YARN
Hadoop's resource manager that orchestrates the allocation of CPU and memory resources to applications while managing task lifecycles in the cluster.
HBase
Distributed, column-oriented, non-relational NoSQL database built on HDFS, offering real-time access to massive data with strong consistency.
Hive
Data warehouse infrastructure on Hadoop enabling querying of large datasets with a SQL-like language (HiveQL) while using MapReduce for execution.
Pig
High-level data analysis platform using the Pig Latin language to express complex data transformation programs executed on Hadoop.
Spark
Ultra-fast unified processing engine for Big Data, offering APIs in Scala, Java, Python and R with support for SQL, streaming, machine learning and graph processing.
ZooKeeper
Centralized distributed coordination service for maintaining configuration information, naming, distributed synchronization, and group service management.
Flume
Distributed, reliable, and available service for collecting, aggregating, and moving large amounts of streaming data to HDFS with an agent-based architecture.
Sqoop
Tool designed to efficiently transfer bulk data between Hadoop and structured databases such as relational databases.
Oozie
Workflow and coordinator system for managing and executing complex Hadoop data processing pipelines with time-based and conditional dependencies.
Mahout
Library of distributed machine learning and data mining algorithms implemented on Hadoop MapReduce for processing large datasets.
Ambari
Hadoop cluster management and monitoring platform offering a web interface for provisioning, managing, and monitoring the complete Hadoop ecosystem.
HCatalog
Metadata and table management service for the Hadoop ecosystem, providing a unified view of data for tools like Pig, Hive, and MapReduce.
Avro
Data serialization system with evolving schema, providing compact and fast data formats for exchanges between Hadoop services.
Parquet
Columnar file format optimized for analytical query performance on Hadoop, with efficient compression and support for complex types.
Impala
Massively parallel SQL query engine for Hadoop providing low-latency interactive query performance on data stored in HDFS and HBase.
Tez
Generalized acyclic data execution framework for Hadoop YARN, optimizing performance of complex processing by eliminating unnecessary MapReduce phases.
Storm
Distributed real-time stream processing system for Hadoop, capable of processing massive volumes of data with millisecond-level latencies.
Kafka
High-performance, high-availability distributed messaging platform for collecting and processing real-time data streams in the Hadoop ecosystem.