3.3.3.1. AWS Metrics and Logging Services for Troubleshooting (CloudWatch, X-Ray)
3.3.3.1. AWS Metrics and Logging Services for Troubleshooting (CloudWatch, X-Ray)
When something breaks in production, your first 5 minutes of troubleshooting determine whether you resolve in 15 minutes or 4 hours. Having the right metrics and logs pre-configured is the difference.
Troubleshooting workflow:
- Detect: CloudWatch alarm fires or user reports issue
- Assess scope: Check CloudWatch dashboard — is it one instance or fleet-wide?
- Identify pattern: Check metrics (error rate, latency, CPU) for the onset time
- Find root cause: Query logs (Logs Insights) and traces (X-Ray) from that time window
- Resolve: Apply fix and verify metrics return to normal
X-Ray for distributed tracing: X-Ray instruments your application to capture traces across microservices. Each trace shows the full request path with latency at each service hop.
# X-Ray SDK instrumentation (Python)
from aws_xray_sdk.core import xray_recorder
from aws_xray_sdk.core import patch_all
patch_all() # Auto-instruments boto3, requests, sqlite3, etc.
@xray_recorder.capture('process_order')
def process_order(order_id):
# X-Ray captures this as a subsegment with timing
db_result = query_database(order_id)
api_result = call_payment_api(order_id)
return db_result, api_result
X-Ray service map visualizes all services in your architecture with health status (green/yellow/red), average latency, and error rates per service. This instantly shows which service is causing the problem.
CloudWatch Contributor Insights identifies the top contributors to a metric — the top 10 IP addresses generating errors, the top 5 API endpoints by latency, or the top 3 Lambda functions consuming the most concurrency.
Exam Trap: X-Ray uses sampling to control costs — by default, it records the first request each second and 5% of additional requests. If you're investigating an intermittent error, the failing request may not have been sampled. Increase the sampling rate temporarily during investigation, or use X-Ray's filter expressions to search for traces with specific error codes.
