2.1.1. Standard Metrics, Custom Metrics, and Namespaces
š” First Principle: Metrics are organized by namespace, and the namespace tells you who published the data. AWS services publish into their own namespaces (AWS/EC2, AWS/RDS, AWS/Lambda). Your application publishes into a custom namespace you define. Understanding namespaces is the first step to finding and querying any metric in CloudWatch.
Standard Metrics are automatically published by AWS services at no additional cost. However, there's a critical gap: EC2 publishes CPU, network, and disk I/O by default, but it does not publish memory utilization or disk space utilization. Why? Because AWS runs the hypervisor, not your operating system. Memory is managed inside the OS, which AWS can't see without an agent.
| EC2 Metric | Published By Default? | Why / Why Not |
|---|---|---|
| CPU Utilization | ā Yes | Hypervisor can measure this |
| Network In/Out | ā Yes | Hypervisor can measure this |
| Disk Read/Write Ops | ā Yes | For instance store; EBS is separate |
| Memory Utilization | ā No | Inside the OS; requires CloudWatch agent |
| Disk Space Used | ā No | Inside the OS; requires CloudWatch agent |
Metric Resolution: By default, EC2 publishes metrics at 5-minute intervals (basic monitoring). You can enable detailed monitoring for 1-minute intervals ā this costs extra and is required if you want faster Auto Scaling reactions.
Custom Metrics are published by your own code using the PutMetricData API. Examples: number of items in a processing queue, user login failures, cache hit rate. You define the namespace, metric name, unit, and value. Custom metrics are billed per metric per month.
High-Resolution Custom Metrics can be published at 1-second intervals (vs. the standard 1-minute). These are useful for high-frequency monitoring like Lambda invocation latency or API gateway response times.
Metric Statistics: When you query a metric over a period, you choose a statistic:
| Statistic | Use Case | Example |
|---|---|---|
| Average | Typical utilization | Average CPU over 5 minutes |
| Sum | Totals | Total number of requests |
| Maximum | Peak detection | Highest latency spike |
| Minimum | Low-water mark | Lowest available memory |
| SampleCount | Count of data points | Number of API calls |
| p99, p95, p50 | Latency percentiles | 99th percentile response time |
ā ļø Exam Trap: For latency monitoring, the exam expects you to know that Average is misleading ā it hides tail latency. A p99 of 5 seconds means 1% of users wait 5+ seconds even if average is 200ms. The correct statistic for SLA monitoring is a percentile.
Reflection Question: Your application publishes custom metrics at 1-minute resolution. A new requirement asks you to detect anomalies within 10 seconds. What metric configuration change is needed?