3.2.2.2. Common CloudWatch Metrics and Logs (EC2 CPU, RDS Queue, ALB 5xx)
First Principle: Gaining actionable insight into system health and performance enables rapid identification of operational issues and informed troubleshooting.
Amazon CloudWatch provides the essential metrics and logs to quickly assess core AWS resources. Understanding common metrics is crucial for maintaining robust and responsive applications.
Key metrics to monitor include:
- EC2 CPU Utilization: Percentage of allocated compute units in use.
- Interpretation: High (>80%) suggests resource contention/inefficiency; low indicates over-provisioning.
- Practical Relevance: Alarm on sustained high CPU to scale or investigate.
- RDS Queue Depth: Number of DB commands waiting.
- Interpretation: Growing depth signifies backlog (slow queries, insufficient resources, traffic surge).
- Practical Relevance: Monitor for spikes to diagnose DB performance or connection pooling.
- ALB 5xx Errors: Errors from application/backend servers (internal server errors, bad gateway).
- Interpretation: Non-zero or sudden increase indicates critical application failures.
- Practical Relevance: Alarms are vital for immediate detection of outages/misconfigurations.
Key Common CloudWatch Metrics & Their Interpretation:
- EC2 CPU Utilization: Server load.
- RDS Queue Depth: Database backlog.
- ALB 5xx Errors: Application/backend failures.
Scenario: A DevOps team manages a web application behind an Application Load Balancer (ALB) with an Amazon RDS backend. Users are reporting occasional "service unavailable" errors, and the team needs to quickly identify the source of the problem.
Reflection Question: How would you use common CloudWatch metrics like ALB 5xx Errors, EC2 CPU Utilization, and RDS Queue Depth to gain actionable insights and rapidly pinpoint the operational issue?
These metrics are often derived from underlying logs (e.g., EC2 system logs, RDS enhanced monitoring logs, ALB access logs), which provide granular details for deeper analysis.
š” Tip: Consider how a sudden, significant change in any of these metrics (e.g., a sharp increase in ALB 5xx errors or RDS Queue Depth) would immediately signal a potential incident requiring urgent investigation.