Data Lakes - 인공지능 용어집

📖

용어

Data Lake

Centralized storage repository designed to hold large amounts of raw data in their native format. It allows storage of structured, semi-structured, and unstructured data at petabyte scale.

📖

용어

Data Swamp

Data Lake that has lost its governance and organization, making data difficult to access and use. It results from a lack of metadata management and proper documentation.

📖

용어

Hybrid architecture combining the advantages of Data Lakes and data warehouses to provide unified data management. It enables direct analysis on data stored in an open format optimized for performance.

📖

용어

Data Ingestion

Process of collecting and transferring data from various sources to a centralized storage system like a Data Lake. It can be performed in real-time, batch, or streaming according to business needs.

📖

용어

Schema-on-Read

Approach where data structure is applied at the time of reading rather than writing. It offers maximum flexibility for storing heterogeneous data without defining a schema in advance.

📖

용어

Schema-on-Write

Traditional methodology where data schema must be defined before writing into the system. It ensures data quality and consistency but reduces storage flexibility.

📖

용어

Data Catalog

Organized and indexed metadata describing available data in a Data Lake. It facilitates data discovery, understanding, and governance through a centralized interface.

📖

용어

Data Governance

Set of policies, procedures, and standards defining data management within the organization. It ensures quality, security, compliance, and appropriate use of Data Lake data.

📖

용어

Data Partitioning

Technique for dividing data into smaller segments based on specific criteria such as date or category. It optimizes query performance by limiting reads to relevant partitions.

📖

용어

Data Sharding

Horizontal partitioning of data distributed across multiple servers to improve scalability and performance. Each shard contains a unique subset of the total data.

📖

용어

Data Replication

Process of copying data from one location to another to ensure high availability and fault tolerance. It can be synchronous or asynchronous depending on consistency requirements.

📖

용어

Data Versioning

Mechanism for tracking and managing data changes over time in a Data Lake. It facilitates auditing, error recovery, and temporal trend analysis.

📖

용어

Data Lineage

Complete traceability of the data lifecycle from source to final destination. It documents transformations, movements, and relationships between different data entities.

📖

용어

Data Mesh

Decentralized data management architecture treating data as distributed products. It eliminates bottlenecks from central teams by promoting functional domain autonomy.

📖

용어

Delta Lake

Open-source storage layer bringing ACID transactions to Data Lakes built on distributed file systems. It enables updates, deletions, and time travel queries on parquet data.

📖

용어

Apache Iceberg

Open-source table format for large analytical Data Lakes, offering optimal query performance and seamless schema evolution. It separates planning operations from execution operations.

📖

용어

Apache Hudi

Streaming data framework providing batch and real-time processing capabilities on Data Lakes. It enables incremental updates and deletions with consistency guarantees.

📖

용어

Data Virtualization

Data integration approach allowing access and manipulation of data without physically moving it from their sources. It creates a unified and abstract view of distributed data.

📖

용어

Data Fabric

Unified and intelligent data management architecture facilitating data access wherever it resides. It seamlessly combines data integration, governance, and orchestration.

📖

용어

Zone Medallion

Data Lake architecture organizing data into three zones: Bronze (raw), Silver (cleaned), and Gold (aggregated). It progressively structures data for analysis and decision-making.

AI 용어집