3.1.5.1. Data Ingestion Patterns and Services (Kinesis, DataSync)
š” First Principle: Data ingestion efficiently and reliably collects and transfers data from diverse sources into AWS for storage and processing, supporting real-time analytics or batch operations.
Data ingestion is the process of collecting and transferring data from various sources into a storage system for processing and analysis. The choice of ingestion pattern depends on the data's volume, velocity, and desired processing latency.
- Real-time/Streaming Ingestion: For high-velocity, continuous data streams that need immediate processing.
- "Amazon Kinesis": A suite of services for processing streaming data. Includes Kinesis Data Streams (captures/processes large streams for real-time applications) and Kinesis Firehose (delivers streaming data to destinations like S3, Redshift, or Splunk, often for batch analytics).
- Batch Ingestion: For large volumes of data that can be transferred periodically or in bulk.
- "AWS DataSync": An online data transfer service that simplifies, automates, and accelerates moving data between on-premises storage and AWS storage services, or between AWS storage services. Simplifies, automates, and accelerates moving data between on-premises storage and AWS storage services (e.g., S3, EFS, FSx).
- "AWS Snow Family": A collection of physical devices that help migrate petabytes of data into and out of AWS. For offline, very large-scale data transfers.
Key Data Ingestion Services:
- "Kinesis (Streams/Firehose)": Real-time, streaming data.
- "DataSync": Online batch/incremental transfer for files.
- "Snow Family": Offline/massive batch transfer.
Scenario: Imagine using Amazon Kinesis Data Streams to ingest real-time website clickstream data for immediate analytics, or AWS DataSync to securely migrate terabytes of on-premises historical logs to Amazon S3 for archival and batch processing.
Visual: Data Ingestion Patterns and AWS Services
Loading diagram...
ā ļø Common Pitfall: Choosing an offline transfer (Snow Family) for data that needs to be processed in near real-time, or using Kinesis for a one-time, petabyte-scale data migration.
Key Trade-Offs:
- Latency (Kinesis) vs. Cost/Simplicity (DataSync/Snow Family): Kinesis provides low-latency streaming but is more complex and can be more expensive. DataSync and Snow Family are cost-effective for large batch transfers but introduce higher latency.
Reflection Question: How do ingestion pattern choices (real-time streaming vs. batch transfer) impact data latency, cost, and scalability for different use cases, and how do Kinesis and DataSync address these distinct needs?