AWS-MLS-C01 & AWS CERTIFICATION | Other Related ML/Analytics Services - AWS Certified Machine Learning

1.4.4. Other Related ML/Analytics Services

First Principle: Complementary AWS analytics and database services provide the essential data infrastructure for ML workflows, enabling ingestion, processing, and storage of diverse data types at scale.

Beyond SageMaker and the AI Services, a robust machine learning solution on AWS often leverages a suite of complementary analytics and database services for managing data throughout its lifecycle.

Key Related AWS Analytics and Database Services for ML:

Data Ingestion & Streaming:
- AWS Kinesis: (Real-time data streaming.) Collects, processes, and analyzes real-time streaming data (e.g., clickstreams, IoT telemetry).
- AWS DataSync: (Online data transfer service.) For automated, fast, online transfer of large amounts of data between on-premises storage and AWS storage services (like S3).
- AWS Snow Family: (Offline data transfer devices.) For petabyte-scale data transfers into/out of AWS when internet transfer is impractical.
Data Warehousing & Querying:
- Amazon Redshift: (Cloud data warehouse.) A fully managed, petabyte-scale data warehouse service for analytical queries on structured data.
- Amazon Athena: (Interactive query service.) Serverless interactive query service that makes it easy to analyze data directly in Amazon S3 using standard SQL.
ETL & Data Processing:
- AWS Glue: (Serverless data integration service.) A serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. Includes Glue Data Catalog and Glue ETL jobs.
- Amazon EMR (Elastic MapReduce): (Managed Hadoop framework.) A managed cluster platform that simplifies running big data frameworks like Apache Spark, Hadoop, and Hive on AWS.
Databases:
- Amazon RDS (Relational Database Service): (Managed relational databases.) For operational relational databases that can serve as data sources for ML.
- Amazon DynamoDB: (NoSQL database service.) A fast, flexible NoSQL database service for single-digit millisecond performance at any scale, often used for real-time features or inference data.

Scenario: You are building an ML pipeline that needs to ingest real-time customer clickstream data, store large volumes of raw and processed data for analysis, perform complex ETL operations, and then query this data using SQL for feature engineering.

Reflection Question: How do complementary AWS analytics and database services (e.g., Kinesis for streaming ingestion, S3 for data lake storage, Glue for ETL, Athena for SQL querying) fundamentally provide the essential data infrastructure for ML workflows, enabling ingestion, processing, and storage of diverse data types at scale?