4.3.2. CloudTrail, CloudTrail Lake, and Log Analysis
š” First Principle: CloudTrail records every API call in your AWS account ā who did what, when, from where, and what happened. If CloudWatch answers "is the system healthy?", CloudTrail answers "who touched the system?" This is the foundation of security auditing, compliance, and incident investigation for data pipelines.
CloudTrail captures management events (API calls that create, modify, or delete AWS resources) by default. Data events (S3 object-level operations, Lambda invocations, DynamoDB item-level operations) require explicit configuration and incur additional cost. For data engineering, data events are critical ā they reveal who accessed which S3 objects and when.
CloudTrail Lake stores CloudTrail events in a managed, queryable data store. Instead of sending trail logs to S3 and querying them with Athena (the traditional approach), CloudTrail Lake provides a built-in SQL query interface with faster performance and simpler setup. Use it for centralized security queries across accounts and regions.
Log analysis patterns tested on the exam:
| Log Source | Analysis Service | Use Case |
|---|---|---|
| Application logs | CloudWatch Logs Insights | Debugging pipeline failures |
| API audit trails | CloudTrail Lake or Athena | Security investigation, compliance |
| High-volume logs | Amazon OpenSearch | Interactive search, dashboards |
| Large-scale log ETL | Amazon EMR | Processing TB+ of log data |
ā ļø Exam Trap: CloudTrail logs API calls, not application-level data access. If a question asks "who queried this Redshift table?", the answer is Redshift audit logging (not CloudTrail). CloudTrail would show who called the Redshift API (created a cluster, modified settings), but not the SQL queries executed within Redshift.
Reflection Question: Security needs to investigate whether any IAM user accessed a specific S3 bucket containing PII data during the last 30 days. Which service provides this information, and what event type must be enabled?