Copyright (c) 2026 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.

3.3.3.1. AWS Metrics and Logging Services for Troubleshooting (CloudWatch, X-Ray)

3.3.3.1. AWS Metrics and Logging Services for Troubleshooting (CloudWatch, X-Ray)

When something breaks in production, your first 5 minutes of troubleshooting determine whether you resolve in 15 minutes or 4 hours. Having the right metrics and logs pre-configured is the difference.

Troubleshooting workflow:
  1. Detect: CloudWatch alarm fires or user reports issue
  2. Assess scope: Check CloudWatch dashboard — is it one instance or fleet-wide?
  3. Identify pattern: Check metrics (error rate, latency, CPU) for the onset time
  4. Find root cause: Query logs (Logs Insights) and traces (X-Ray) from that time window
  5. Resolve: Apply fix and verify metrics return to normal

X-Ray for distributed tracing: X-Ray instruments your application to capture traces across microservices. Each trace shows the full request path with latency at each service hop.

# X-Ray SDK instrumentation (Python)
from aws_xray_sdk.core import xray_recorder
from aws_xray_sdk.core import patch_all
patch_all()  # Auto-instruments boto3, requests, sqlite3, etc.

@xray_recorder.capture('process_order')
def process_order(order_id):
    # X-Ray captures this as a subsegment with timing
    db_result = query_database(order_id)
    api_result = call_payment_api(order_id)
    return db_result, api_result

X-Ray service map visualizes all services in your architecture with health status (green/yellow/red), average latency, and error rates per service. This instantly shows which service is causing the problem.

CloudWatch Contributor Insights identifies the top contributors to a metric — the top 10 IP addresses generating errors, the top 5 API endpoints by latency, or the top 3 Lambda functions consuming the most concurrency.

Exam Trap: X-Ray uses sampling to control costs — by default, it records the first request each second and 5% of additional requests. If you're investigating an intermittent error, the failing request may not have been sampled. Increase the sampling rate temporarily during investigation, or use X-Ray's filter expressions to search for traces with specific error codes.

Alvin Varughese
Written byAlvin Varughese•Founder•15 professional certifications