4.4.1. Validation, Profiling, and Quality Rules
š” First Principle: Data validation enforces four dimensions of quality ā completeness (no missing values), consistency (values follow expected patterns), accuracy (values reflect reality), and integrity (relationships between tables hold). AWS Glue Data Quality and DataBrew rules automate these checks within your ETL pipeline, flagging or quarantining records that fail.
AWS Glue Data Quality integrates quality checks directly into Glue ETL jobs using Data Quality Definition Language (DQDL). Rules like Completeness "email" >= 0.99 (99% of email values must be non-null) and ColumnValues "age" between 0 and 150 run during ETL processing. Failed records can be routed to a "quarantine" path for investigation while clean records proceed.
AWS Glue DataBrew provides visual data profiling ā statistical summaries, value distributions, correlation analysis, and missing value detection. DataBrew quality rules define expectations that data must meet, with pass/fail results. Use DataBrew for profiling (understanding data characteristics) and Glue Data Quality for enforcement (blocking bad data).
Common validation patterns: null checks on required fields, range validation (dates within expected bounds, amounts non-negative), referential integrity (foreign keys exist in the reference table), uniqueness constraints (no duplicate primary keys), and format validation (email patterns, phone number formats).
ā ļø Exam Trap: Data quality checks add processing time and cost. The exam may present a trade-off: "the pipeline must complete within 30 minutes, but adding quality checks extends it to 45 minutes." The answer is usually sampling (check a statistically significant subset) or moving quality checks to a parallel process that doesn't block the main pipeline ā not skipping quality altogether.
Reflection Question: A Glue ETL job processes 100 million records nightly. You need to ensure that the order_total column contains no negative values and the customer_email column is at least 99.5% complete. How do you implement this without significantly increasing pipeline runtime?