2.1.2. Amazon Kinesis Data Firehose
š” First Principle: Firehose is the "load" service for streaming data ā it captures, optionally transforms, and delivers data to a destination. If Kinesis Data Streams is the highway, Firehose is the delivery truck that picks up from the highway and drops off at S3, Redshift, OpenSearch, or an HTTP endpoint.
Firehose eliminates the need to write consumer code. You configure a source (direct PUT, Kinesis Data Streams, or MSK), an optional transformation (Lambda function), and a destination ā and Firehose handles buffering, batching, compression, encryption, and retry logic automatically.
The critical distinction: Firehose is near-real-time, not real-time. It buffers incoming records and flushes in intervals configured by buffer size (1ā128 MB) and buffer time (60ā900 seconds). The minimum delivery latency is 60 seconds. When an exam question requires sub-second processing, Firehose alone is insufficient ā you need KDS or MSK with a custom consumer or Managed Flink.
Transformation with Lambda. Firehose can invoke a Lambda function to transform each batch of records before delivery. Common uses: converting JSON to Parquet, enriching records with lookup data, filtering irrelevant events, or reformatting timestamps. The Lambda function receives a batch of records, processes them, and returns results with a status per record (Ok, Dropped, or ProcessingFailed).
Delivery destinations: S3 (most common), Amazon Redshift (via S3 staging + COPY), Amazon OpenSearch Service, HTTP endpoints (Splunk, Datadog, custom APIs), and third-party destinations. For Redshift delivery, Firehose first writes to S3, then issues a COPY command ā understanding this two-step process is exam-relevant.
Error handling. Failed records can be sent to a separate S3 prefix (backup bucket), ensuring no data loss even when the Lambda transform or destination fails.
ā ļø Exam Trap: When a question says "near-real-time delivery to S3" with "minimal operational overhead," Firehose is almost always the answer ā not a Lambda consumer writing to S3 (more code to maintain) or a Glue streaming job (more complex). But if the question says "real-time processing with sub-second latency," you need Kinesis Data Streams with a custom consumer or Managed Apache Flink.
Reflection Question: A company wants to stream application logs to S3 in Parquet format for Athena queries. They want the simplest solution with no custom consumer code. What's the architecture?