2.1.2. Real-time Data Ingestion (Kinesis, Kafka)
First Principle: Real-time data ingestion fundamentally enables immediate processing of continuous data streams, crucial for applications requiring low-latency insights or real-time model updates.
Many modern ML applications require continuous data input for real-time predictions, dashboards, or quick model retraining. Real-time ingestion services handle high volumes of streaming data.
Key Concepts of Real-time Data Ingestion:
- Streaming Data: Data that is continuously generated by thousands of data sources, which typically send in the data records simultaneously, in small sizes (kilobytes).
- Low Latency: Processing data with minimal delay from its generation to its availability for consumption.
- Scalability: Ability to handle varying and large volumes of incoming data.
- Durability: Ensuring data is not lost even if consumers fail.
AWS Services for Real-time Data Ingestion:
- Amazon Kinesis: A family of services for working with streaming data.
- Kinesis Data Streams: (Captures, processes, and stores data streams.) For capturing, processing, and storing large streams of data records. Data is stored for up to 365 days. Used for custom applications needing direct access to shards.
- Kinesis Firehose: (Delivers real-time streams to destinations.) An ETL service for streaming data. Automatically loads streaming data into Amazon S3, Amazon Redshift, Amazon OpenSearch Service, or Splunk. It can transform, compress, and encrypt data before delivery.
- Kinesis Data Analytics: (Processes and analyzes streaming data.) For real-time processing of streaming data using SQL or Apache Flink.
- Amazon Managed Streaming for Apache Kafka (MSK): (Fully managed service for Apache Kafka.) For running Apache Kafka clusters on AWS, providing compatibility with existing Kafka applications and ecosystems. Offers high throughput and low latency.
- Producers: Applications or devices that send data to Kinesis or MSK.
- Consumers: Applications that read and process data from the streams (e.g., Lambda functions, EC2 instances, Kinesis Data Analytics).
Scenario: Your IoT devices continuously send sensor data, and your web application generates clickstream events, all of which need to be ingested in real-time for immediate analysis and potential real-time model inference.
Reflection Question: How do real-time data ingestion services (e.g., Kinesis Data Streams for direct stream processing, Kinesis Firehose for direct delivery to storage) fundamentally enable immediate processing of continuous data streams, crucial for applications requiring low-latency insights and real-time model updates?