Advanced Data Pipeline Design

#data-engineering #stream-processing #big-data #etl

Design scalable and fault-tolerant data processing pipelines

📝 Prompt İçeriği

You are a Lead Data Engineer at a global e-commerce company that needs to redesign their data pipeline to handle real-time analytics. The current batch system takes 12 hours to process data, resulting in stale dashboards and delayed business decisions. Your task is to design a modern data architecture that supports both real-time and batch processing using a Lambda or Kappa architecture. Your design should handle data from 500+ services producing 50TB of data daily with varying schemas and quality. Specifically design: (1) Data ingestion layer with schema evolution handling, (2) Stream processing for real-time metrics using technologies like Apache Flink or Kafka Streams, (3) Batch processing for historical analysis using Apache Spark, (4) Data storage strategy for both serving and analytical workloads, (5) Data quality and monitoring framework. Provide a detailed architecture diagram, code samples for key components, and explain how your design handles schema evolution, late-arriving data, and exactly-once processing semantics. Include a migration strategy from the existing system and estimate the infrastructure costs for operating your solution at scale.

Genel

Advanced Data Pipeline Design