5.3.2. Data Masking, Anonymization, and PII Protection
š” First Principle: Encryption protects data from unauthorized access; masking protects data from authorized-but-unnecessary access. A data analyst who needs to analyze purchase patterns doesn't need to see credit card numbers ā masking replaces 4532-XXXX-XXXX-1234 with ****-****-****-1234, preserving analytical utility while protecting the sensitive value.
Data masking techniques:
Static masking replaces sensitive values in the data permanently ā the original value is gone. Used for creating non-production datasets (dev, test, analytics sandboxes).
Dynamic masking replaces values at query time ā the underlying data is unchanged, but different users see different levels of detail based on their permissions. Lake Formation's column-level security and data filters enable a form of dynamic masking.
Anonymization removes the ability to re-identify individuals. Techniques include generalization (replace exact age with age range), aggregation (report averages not individual values), and k-anonymity (ensure each individual is indistinguishable from at least k-1 others).
Key salting adds random data to values before hashing, preventing rainbow table attacks against pseudonymized data. If you hash email addresses for pseudonymization, salting ensures that identical emails produce different hashes across datasets.
Tokenization replaces sensitive values with non-sensitive tokens that can be reversed by an authorized tokenization service. Unlike encryption (which uses mathematical keys), tokenization uses a lookup table ā the token has no mathematical relationship to the original value.
ā ļø Exam Trap: Anonymization and pseudonymization are different. Pseudonymization (replacing names with IDs) can be reversed if you have the mapping table ā it's still personal data under GDPR. Anonymization (properly done) cannot be reversed ā it's no longer personal data. The exam may test this distinction in compliance-related questions.
Reflection Question: A company needs to share customer transaction data with a third-party analytics firm. The data includes customer names, email addresses, and purchase amounts. What combination of masking and anonymization techniques protects PII while maintaining analytical value?