2.1. Data Sources and Ingestion
First Principle: Data ingestion for ML fundamentally involves efficiently and reliably collecting raw data from diverse sources, ensuring it is ready for subsequent processing, storage, and analysis for model training and inference.
Before any machine learning can occur, data must be brought into a usable environment. This involves connecting to various data sources and ingesting data into AWS.
Key Concepts of Data Ingestion for ML:
- Diverse Sources: Data can originate from transactional databases, application logs, streaming events, third-party APIs, on-premises systems, etc.
- Batch vs. Streaming:
- Batch Ingestion: For large volumes of historical data that can be processed periodically (e.g., daily, hourly).
- Streaming Ingestion: For real-time data that needs to be processed continuously with low latency (e.g., sensor data, clickstreams).
- Incremental vs. Full Load:
- Full Load: Transferring the entire dataset.
- Incremental Load: Transferring only new or changed data.
- Data Formats: Data can be in various formats (CSV, JSON, Parquet, Avro, ORC).
AWS Services for Data Ingestion:
- Batch: Amazon S3 (direct uploads), AWS DataSync (online transfer), AWS Snow Family (offline transfer), AWS Glue (crawlers for discovery, jobs for ETL), AWS DMS (database migration).
- Streaming: Amazon Kinesis Data Streams (real-time capture), Amazon Managed Streaming for Apache Kafka (MSK) (managed Kafka).
Scenario: You need to ingest historical customer transaction data from an on-premises SQL Server database and real-time clickstream data from your website into AWS for an ML project.
Reflection Question: How do different data ingestion strategies and AWS services (e.g., batch ingestion using AWS DMS for databases, streaming ingestion using Kinesis Data Streams for real-time events) fundamentally ensure that raw data is efficiently and reliably collected from diverse sources and made ready for ML workloads?