2.2.3. NoSQL Databases (DynamoDB, DocumentDB)
First Principle: NoSQL databases fundamentally provide flexible schema, high scalability, and low-latency access patterns, making them ideal for storing and serving real-time features or rapidly changing operational data for ML inference.
For data that doesn't fit a relational model, requires extreme scalability, or needs very low-latency access (e.g., for real-time feature lookups during inference), NoSQL databases are often the preferred choice.
Key Concepts of NoSQL Databases for ML:
- Flexible Schema: Unlike relational databases, NoSQL databases can handle semi-structured or unstructured data and allow schema to evolve over time, which is beneficial for dynamic feature sets.
- High Scalability: Designed to scale horizontally to handle massive amounts of data and high request rates.
- Low Latency: Optimized for fast read and write operations, crucial for real-time inference feature lookups.
- Data Models: Different types (key-value, document, graph, wide-column).
AWS Services for NoSQL Data Storage in ML:
- Amazon DynamoDB: (Key-value and document database.)
- What it is: A fully managed, serverless, key-value and document database that delivers single-digit millisecond performance at any scale.
- Use Cases for ML:
- Real-time Feature Store: Storing and serving features for online inference (SageMaker Feature Store can use DynamoDB as an online store).
- Model Metadata: Storing model versioning, training run details, or endpoint configurations.
- Inference Results: Storing lightweight, high-volume inference results for downstream applications.
- Key Features: Global tables for multi-Region replication, DynamoDB Streams for change data capture.
- Amazon DocumentDB (with MongoDB compatibility): (Managed MongoDB-compatible database.)
- What it is: A fast, scalable, highly available, and fully managed document database service that supports MongoDB workloads.
- Use Cases for ML: Storing semi-structured data like user profiles, product catalogs, or nested JSON documents that serve as input for ML models.
- Amazon Neptune: (Managed graph database.) For graph data models, useful for ML problems involving relationships (e.g., social networks, fraud detection, recommendation engines).
Scenario: You are building a real-time recommendation system. You need to store frequently updated user preferences and product attributes that your model will use for immediate inference. You also need to store historical customer interaction data with a flexible schema for future, evolving ML models.
Reflection Question: How do NoSQL databases (e.g., DynamoDB for low-latency feature serving, DocumentDB for flexible schema historical data) fundamentally provide the flexibility, high scalability, and low-latency access patterns ideal for storing and serving real-time features and rapidly changing operational data for ML inference?