Copyright (c) 2026 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.

7.2.2. Performance Troubleshooting

💡 First Principle: Performance troubleshooting for GenAI applications requires isolating the problem to one of two fundamentally different performance domains: retrieval performance (database/search problem) and generation performance (model inference problem). The diagnostic tools, root causes, and fixes are completely different for each.

Latency troubleshooting decision tree:
Lambda cold start analysis for GenAI functions:
# Detect cold starts vs warm invocations via Lambda init duration
# X-Ray shows 'Initialization' subsegment only on cold starts
# Cold start adds 200-800ms for Python Lambda + boto3 import

# Mitigation: Lambda Provisioned Concurrency for latency-sensitive functions
lambda_client.put_provisioned_concurrency_config(
    FunctionName='genai-query-handler',
    Qualifier='production',      # Lambda alias
    ProvisionedConcurrentExecutions=10  # Keep 10 warm instances
)
Memory and timeout sizing for Bedrock invocation Lambdas:
# Correct sizing for Bedrock-invoking Lambda functions
# Under-provisioning memory = slow CPU = slow request serialization/deserialization
# Memory: 512MB–1GB is typically appropriate (Bedrock calls are I/O bound, not CPU)
# Timeout: set to 15 minutes for long-context generation jobs
#          set to 30s for user-facing interactive queries (API Gateway limit)

# Check Lambda duration metrics in CloudWatch:
# If P99 duration approaches timeout → increase timeout
# If MemoryUtilization > 80% → increase memory
# If Init duration present → add Provisioned Concurrency

⚠️ Exam Trap: Lambda Provisioned Concurrency keeps functions warm (eliminates cold starts) but incurs cost for every provisioned instance regardless of whether it receives traffic. For applications with highly variable traffic, Provisioned Concurrency can be more expensive than the latency benefit is worth. Evaluate Application Auto Scaling for Provisioned Concurrency as a middle ground — scales warm instances with traffic patterns.

Reflection Question: Your GenAI application has excellent P50 latency (2.1s) but terrible P99 latency (22s). This pattern — where most requests are fast but occasional requests are very slow — is characteristic of which specific failure mode, and what CloudWatch metric would confirm your diagnosis?

Alvin Varughese
Written byAlvin Varughese
Founder15 professional certifications