2.1.1. Ingestion from Operational Databases (RDS, DynamoDB)
First Principle: Ingesting data from operational databases fundamentally involves securely extracting structured data for ML, whether through batch snapshots or real-time change data capture (CDC), without impacting OLTP performance.
Operational databases (like Amazon RDS for relational or Amazon DynamoDB for NoSQL) are common sources for machine learning data. Strategies for extracting this data must consider performance impact on the live database.
Key Concepts & Strategies:
- Batch Export:
- Method: Periodically export data to Amazon S3 in a chosen format (e.g., CSV, Parquet).
- Tools:
- AWS Glue ETL jobs: Connect directly to RDS or DynamoDB (as data sources), read data, transform if needed, and write to S3.
- Amazon Athena: Can connect to RDS databases via federated queries for ad-hoc extraction.
- Custom scripts on EC2: Use database connectors to extract and upload.
- Change Data Capture (CDC):
- Method: Capturing and delivering changes (inserts, updates, deletes) from the source database in real-time or near real-time.
- Tools:
- AWS Database Migration Service (DMS): (Migrates databases with minimal downtime.) Supports continuous replication (CDC) from many source databases (including RDS, on-premises, DynamoDB) to various targets, including S3 (for data lakes) or Kinesis (for streaming).
- DynamoDB Streams: (Captures changes to DynamoDB tables.) A time-ordered sequence of item-level modifications in a DynamoDB table. Can trigger AWS Lambda functions or be consumed by Kinesis Firehose.
- Direct Integration:
Scenario: You need to get customer data from an Amazon RDS for PostgreSQL instance into an Amazon S3 data lake for daily batch ML training. Additionally, you want to capture real-time updates to a DynamoDB table to update an online feature store for low-latency inference.
Reflection Question: How do strategies like batch export using AWS Glue (for RDS) and Change Data Capture (CDC) using DynamoDB Streams fundamentally enable secure and efficient extraction of structured data from operational databases for ML training and real-time inference, without impacting OLTP performance?