6.2.2. Key Concepts Review: Data Engineering for ML
First Principle: Effective data engineering for ML fundamentally establishes a robust and scalable data pipeline, ensuring data is ingested, stored, processed, and governed in a way that is fit for purpose, enabling accurate model training and reliable inference.
This review consolidates concepts for data engineering in ML.
Core Concepts & AWS Services for Data Engineering in ML:
- Data Ingestion:
- Operational Databases: AWS DMS (CDC), DynamoDB Streams.
- Real-time: Kinesis Data Streams, Kinesis Firehose, MSK.
- Batch: S3 (direct), DataSync, Snow Family.
- Data Storage and Persistence:
- Data Lakes: Amazon S3 (raw, processed, model artifacts).
- Data Warehouses: Amazon Redshift (structured analytics, Redshift ML).
- NoSQL Databases: DynamoDB (real-time features, online feature store), DocumentDB.
- Data Transformation and Processing:
- Batch ETL: AWS Glue (serverless Spark/Python), Amazon EMR (managed Hadoop/Spark), Amazon Athena (serverless SQL on S3).
- Streaming Processing: Kinesis Data Analytics (SQL/Flink), Spark Streaming on EMR.
- Data Prep for SageMaker: SageMaker Data Wrangler (visual ETL), SageMaker Processing Jobs (managed processing).
- Data Catalogs and Governance:
- AWS Glue Data Catalog: Central metadata.
- AWS Lake Formation: Fine-grained access control on data lakes.
- Data Access Controls: IAM, resource policies.
Scenario: You need to design a scalable data pipeline to prepare large volumes of diverse data for an ML project, involving ingestion from streaming sources, batch transformations, and secure storage with fine-grained access control.
Reflection Question: How do data engineering principles and services (e.g., Kinesis for streaming, S3 for data lake, Glue for ETL, Lake Formation for governance) fundamentally ensure that data is ingested, stored, processed, and governed in a way that is fit for purpose, enabling accurate model training and reliable inference?