5.1.2.1. Application Insights and End-to-End Transaction Tracing
5.1.2.1. Application Insights and End-to-End Transaction Tracing
Application Insights is Azure's APM (Application Performance Management) service, collecting request telemetry, dependency traces, exceptions, and custom events from instrumented applications. End-to-end transaction search shows the full distributed trace for any request — which dependency failed, what exception was thrown, and how long each step took. This pinpoints whether issues originate in application code, downstream services, or database queries. For incident diagnosis, KQL (Kusto Query Language) queries on the exceptions table joined with requests reveal failure patterns: grouping by exception type, dependency target, and time distribution identifies whether failures cluster around a specific cause. Azure Monitor alert rules define conditions (error rate > 5% for 5+ minutes) that trigger notifications or automated responses. Smart detection automatically identifies anomalies in failure rates and performance degradation.
Application Insights operates through auto-instrumentation (SDK-less) or SDK-based instrumentation. Auto-instrumentation — available for Azure App Service, AKS, and Azure Functions — collects standard telemetry (requests, dependencies, exceptions) without code changes. SDK-based instrumentation adds custom events, metrics, and distributed tracing correlation.
The diagnostic workflow follows a funneling pattern: Dashboards (overview health) → Alerts (what changed?) → Application Map (which component is affected?) → Transaction Search (specific failing requests) → End-to-end trace (full distributed call chain) → Logs (KQL deep analysis). Each step narrows focus from "something is wrong" to "this specific database query is timing out because of a missing index."
Smart Detection automatically identifies anomalies: sudden increases in failure rates, response time degradation, and dependency failures. Alerts fire without manual threshold configuration, catching issues that rule-based alerts miss because the team didn't anticipate the failure mode.
Custom metrics extend beyond standard telemetry. Track business-relevant events: telemetryClient.TrackEvent("OrderPlaced", properties, metrics) enables business dashboards alongside technical monitoring. Custom availability tests (URL ping or multi-step web tests) verify user-facing functionality from multiple global locations.
Sampling reduces telemetry volume and cost without losing diagnostic value. Adaptive sampling (the default) adjusts the sampling rate based on traffic volume — high-traffic applications keep a representative sample while low-traffic applications retain 100% of telemetry. Fixed-rate sampling provides consistent behavior for compliance scenarios that require deterministic data retention.
Distributed tracing with Application Insights correlates requests across microservices using operation IDs. When a user request flows through API Gateway → Order Service → Payment Service → Notification Service, the entire chain appears as one trace in Transaction Search. Each hop shows duration, success/failure, and dependency details. W3C Trace Context headers propagate correlation automatically when services use the Application Insights SDK.
Live Metrics Stream provides real-time telemetry with less than one second of latency — essential during active deployments or incident investigation. Unlike standard Application Insights queries (which have ingestion delay), Live Metrics shows current requests, failures, and dependency calls as they happen. This is the first tool to open during an incident to understand whether the problem is ongoing or resolved.
Distributed tracing with Application Insights uses operation IDs to correlate requests across microservices. A single user action might traverse 5 services — distributed tracing shows the complete chain, identifying which service introduced the latency.
Smart detection in Application Insights automatically identifies anomalies — sudden failure rate increases, response time degradation, and dependency slowdowns — without manual threshold configuration. This catches issues that static alert rules miss because the threshold shifts with traffic patterns.