2.2.3. AWS X-Ray and Distributed Tracing
š” First Principle: In a distributed system, a single user request touches multiple services ā an API gateway, a Lambda function, a DynamoDB table, an SQS queue. When that request fails or runs slowly, you need to follow it across all those hops to find the culprit. X-Ray does this by injecting a unique trace ID into every request and collecting timing data at each service boundary.
Without distributed tracing, debugging latency in a microservices architecture is like diagnosing a headache by asking "does your whole body hurt?" ā the answer is too broad to be useful. X-Ray narrows the question to "which exact service added 3 seconds to this request?"
X-Ray Concepts:
| Concept | What It Is |
|---|---|
| Trace | The complete record of a request's journey from start to finish |
| Segment | One service's portion of the trace (e.g., the Lambda function's processing time) |
| Subsegment | A breakdown within a segment (e.g., time spent on a DynamoDB call within Lambda) |
| Service Map | Visual graph of how services connect and their health |
| Annotation | Key-value pair indexed for filtering (e.g., user_id, order_id) |
| Metadata | Key-value pair for additional context (not indexed, not searchable) |
Sampling: X-Ray doesn't record 100% of requests by default (that would be prohibitively expensive at scale). The default sampling rule records the first request per second per host and 5% of additional requests. You can configure custom sampling rules to increase coverage for specific paths or decrease it for noisy, low-value requests.
X-Ray Daemon: On EC2, the X-Ray daemon is a local UDP proxy. Your application sends trace data to localhost:2000 (UDP), and the daemon batches and forwards it to the X-Ray service. On Lambda, the daemon runs automatically ā you just need to enable Active Tracing in the Lambda configuration.
Integration: X-Ray integrates natively with Lambda, API Gateway, ECS, EKS, EC2 (via SDK), Elastic Beanstalk, and many AWS SDKs. For ECS, you run the X-Ray daemon as a sidecar container in the task definition.
ā ļø Exam Trap: X-Ray annotations are indexed and searchable ā you can filter traces by annotation. Metadata is not indexed and cannot be searched. The exam will describe a scenario where you need to find all traces for a specific customer ā the correct answer involves annotations, not metadata.
Reflection Question: A microservices application has intermittent latency spikes. CloudWatch metrics show nothing unusual at the aggregate level. Which X-Ray feature would you use to identify the specific service responsible, and how would you filter the results?