2.7.2. CI/CD and Infrastructure as Code
š” First Principle: Data pipelines are software ā they should be versioned, tested, and deployed through automated pipelines, just like application code. Infrastructure as Code (IaC) ensures that your Glue jobs, Step Functions, S3 buckets, and IAM roles are defined in templates that can be reviewed, version-controlled, and consistently deployed across environments (dev, staging, production).
AWS CloudFormation defines AWS resources in JSON or YAML templates. You describe the desired state (a Glue job with these parameters, an S3 bucket with this lifecycle policy), and CloudFormation creates, updates, or deletes resources to match. Key exam concept: CloudFormation drift detection identifies when actual resource configuration has diverged from the template ā useful for compliance auditing.
AWS CDK (Cloud Development Kit) lets you define CloudFormation resources using familiar programming languages (Python, TypeScript, Java, C#). CDK is higher-level than raw CloudFormation ā constructs abstract common patterns (a data lake with S3 + Glue + Lake Formation in a few lines of code). The v1.1 syllabus explicitly tests IaC with CloudFormation and CDK.
CI/CD pipeline for data engineering:
CodePipeline orchestrates the deployment workflow. CodeBuild runs tests (unit tests for Lambda functions, integration tests for Glue jobs, SQL syntax validation). CodeDeploy handles deployment strategies. Together, they automate the path from code commit to production deployment.
ā ļø Exam Trap: AWS CodeCommit was removed from the v1.1 in-scope list. If a question about source control appears, the answer may reference Git generally or third-party repositories (GitHub, Bitbucket) integrated with CodePipeline ā not CodeCommit specifically.
Reflection Question: A data team manages 15 Glue jobs across dev, staging, and production environments. Each job has different configurations (S3 paths, database endpoints) per environment. How does CloudFormation or CDK handle this?