1.1.1. Why Data Pipelines Exist: The Source-to-Insight Problem
š” First Principle: Data has no value sitting at its source. Value is created when data reaches the people and systems that can act on it ā in the right format, at the right time, with the right guarantees.
Organizations generate data everywhere: transactional databases record purchases, IoT sensors stream temperatures, web applications log every click, and third-party APIs deliver market data. But the systems that create data are rarely the systems that analyze data. A production PostgreSQL database is optimized for fast writes and ACID transactions ā not for running complex analytical queries across millions of rows. Running a heavy report against it would slow down the application for every customer.
This is the source-to-insight problem: data is born in one shape, in one place, optimized for one purpose ā and it needs to reach a different place, in a different shape, optimized for a different purpose. Data pipelines solve this by decoupling data creation from data consumption.
On the exam, this manifests as scenario questions where you must recognize why a pipeline is needed and which AWS services bridge the gap between source and destination. A question might describe an application writing to DynamoDB and analysts needing aggregate reports ā the answer involves DynamoDB Streams, a transformation layer, and an analytics store like Redshift or Athena over S3.
ā ļø Exam Trap: Don't confuse the need for a pipeline with the need for a more powerful database. If a question describes slow analytics queries on a transactional database, the answer is almost never "upgrade the database" ā it's "build a pipeline to an analytics store."
Reflection Question: If data is being generated in Amazon RDS and your analytics team needs daily aggregate reports, what's wrong with just running the reports directly against RDS?