4.2.2. CI/CD Pipelines for GenAI Applications
💡 First Principle: GenAI applications require CI/CD pipelines that test not just code correctness but AI behavior correctness — a code change that doesn't break unit tests can still break FM output quality, introduce prompt injection vulnerabilities, or shift response style in ways traditional tests don't catch.
GenAI CI/CD pipeline components:
FM behavior testing in CodeBuild:
# Golden dataset evaluation in CI/CD pipeline
def run_fm_eval(golden_dataset_path, model_id, prompt_arn):
results = []
with open(golden_dataset_path) as f:
test_cases = json.load(f)
for case in test_cases:
response = invoke_bedrock(case['input'], model_id, prompt_arn)
score = evaluate_response(response, case['expected'], case['evaluation_criteria'])
results.append({'case_id': case['id'], 'score': score, 'passed': score >= case['threshold']})
pass_rate = sum(r['passed'] for r in results) / len(results)
# Fail the pipeline if pass rate drops below 90%
if pass_rate < 0.90:
print(f"FAIL: Pass rate {pass_rate:.1%} below 90% threshold")
sys.exit(1) # Non-zero exit code fails CodeBuild
print(f"PASS: {pass_rate:.1%} of test cases passed")
return results
Canary deployment for FM applications: Unlike traditional canary deployments (routing % traffic to new code), GenAI canary deployment must also route % traffic to a new model version or prompt version. Lambda aliases and weighted routing enable this:
# Lambda alias with weighted routing: 95% to current, 5% to new
lambda_client.update_alias(
FunctionName='my-fm-handler',
Name='production',
RoutingConfig={
'AdditionalVersionWeights': {
'new-version-number': 0.05 # 5% canary
}
}
)
⚠️ Exam Trap: Traditional application deployment rollback is instantaneous — redeploy the previous container image. FM application rollback has an additional dimension: if the issue is in the prompt (not the code), you must also roll back the prompt version in Bedrock Prompt Management AND redeploy the code that references the previous prompt ARN. Missing either step leaves the system in an inconsistent state.
Reflection Question: Your CI/CD pipeline deploys a code change that updates how retrieved context is formatted before being injected into the FM prompt. Unit tests pass. How would you detect that this change degraded FM output quality, and what pipeline stage would catch it?