2.2. Data Storage and Persistence
First Principle: Data storage and persistence for ML fundamentally involve selecting the right storage solution based on data type, access patterns, scalability, and cost, ensuring data is durably and efficiently available throughout the ML lifecycle.
Once data is ingested into AWS, it needs to be stored in a way that supports its lifecycle within the ML workflow—from raw input to processed features to model artifacts. Different data types and access patterns require different storage solutions.
Key Concepts of Data Storage & Persistence for ML:
- Data Lake: A centralized repository that allows you to store all your structured and unstructured data at any scale. It can store data as is, without needing to first structure the data.
- Data Warehouse: A system used for reporting and data analysis, and is considered a core component of business intelligence. It's a central repository of integrated data from one or more disparate sources.
- NoSQL Database: Provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases.
- Data Format: Choosing appropriate formats (e.g., Parquet, Avro, ORC for columnar; JSON for semi-structured; CSV for simple tabular). Columnar formats are often preferred for analytical workloads due to performance benefits.
- Access Patterns: How frequently and how the data will be accessed (e.g., streaming reads, random access, large scans).
- Cost: Balancing storage cost with access performance and durability.
AWS Services for Data Storage & Persistence in ML:
- Amazon S3 (Simple Storage Service): (Scalable object storage.) Ideal for building data lakes. Highly durable, scalable, cost-effective, and supports various storage classes for different access patterns. Used for raw data, processed data, and model artifacts.
- Amazon Redshift: (Cloud data warehouse.) For structured, analytical data that requires complex SQL queries and aggregations. Well-suited for serving as a source for feature engineering or for storing aggregated results of ML predictions.
- Amazon DynamoDB: (NoSQL database service.) For key-value and document data models, offering single-digit millisecond latency at any scale. Ideal for storing and serving real-time features (e.g., a SageMaker Feature Store backend) or for low-latency inference results.
- Amazon DocumentDB (with MongoDB compatibility): (Managed MongoDB-compatible database.) For document data that requires flexible schema.
- Amazon ElastiCache: (Managed in-memory cache.) For caching frequently accessed features for extremely low-latency inference.
Scenario: You need to store vast amounts of raw, unstructured log data for future analysis, historical customer transaction data for complex SQL queries, and real-time user features for low-latency model inference.
Reflection Question: How does selecting the right data storage solution (e.g., Amazon S3 for a data lake, Amazon Redshift for a data warehouse, Amazon DynamoDB for real-time features) based on data type, access patterns, scalability, and cost fundamentally ensure data is durably and efficiently available throughout the ML lifecycle?