Copyright (c) 2025 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.

2.1.1. Ingestion from Operational Databases (RDS, DynamoDB)

First Principle: Ingesting data from operational databases fundamentally involves securely extracting structured data for ML, whether through batch snapshots or real-time change data capture (CDC), without impacting OLTP performance.

Operational databases (like Amazon RDS for relational or Amazon DynamoDB for NoSQL) are common sources for machine learning data. Strategies for extracting this data must consider performance impact on the live database.

Key Concepts & Strategies:
  • Batch Export:
    • Method: Periodically export data to Amazon S3 in a chosen format (e.g., CSV, Parquet).
    • Tools:
  • Change Data Capture (CDC):
  • Direct Integration:

Scenario: You need to get customer data from an Amazon RDS for PostgreSQL instance into an Amazon S3 data lake for daily batch ML training. Additionally, you want to capture real-time updates to a DynamoDB table to update an online feature store for low-latency inference.

Reflection Question: How do strategies like batch export using AWS Glue (for RDS) and Change Data Capture (CDC) using DynamoDB Streams fundamentally enable secure and efficient extraction of structured data from operational databases for ML training and real-time inference, without impacting OLTP performance?