1.1. The Observability Problem: Why You Can't Manage What You Can't See
š” First Principle: You cannot reliably operate a system you cannot inspect. Before AWS existed, operators SSHed into servers to check if things were running. That worked for ten servers. It doesn't work for ten thousand. Observability is the discipline of instrumenting systems so their internal state can be inferred from external outputs ā without touching the system directly.
Imagine running a fleet of 500 EC2 instances spread across three regions. If one instance starts throwing errors at 2 AM, how do you know? If CPU spikes on 20 instances simultaneously, is it an attack, a deployment gone wrong, or legitimate traffic? Without observability, your only option is reactive firefighting ā you discover problems when customers complain.
The three pillars of observability map directly to AWS services:
| Pillar | What It Captures | Primary AWS Service |
|---|---|---|
| Metrics | Numerical measurements over time (CPU %, request count, latency) | Amazon CloudWatch |
| Logs | Timestamped text records of discrete events | CloudWatch Logs, CloudTrail |
| Traces | End-to-end request journeys across services | AWS X-Ray |
Each pillar answers a different question. Metrics tell you something is wrong (latency spiked). Logs tell you what happened (specific error messages). Traces tell you where it broke in a distributed system (which microservice).
ā ļø Exam Trap: The exam will ask you to pick the right observability tool for a scenario. Remember: CloudWatch metrics for numbers over time, CloudWatch Logs for event text, X-Ray for cross-service request tracing. A question about "identifying which microservice is causing latency" = X-Ray, not CloudWatch.
Reflection Question: If an application's error rate doubles, but response time stays the same ā which pillar would you consult first, and why?