2.1.1. Normalization and Data Integrity
đź’ˇ First Principle: Normalization organizes data to minimize redundancy and dependency, ensuring that each piece of information exists in exactly one place.
Scenario: A spreadsheet tracks orders with columns: OrderID, CustomerName, CustomerAddress, ProductName, ProductPrice. If a customer places 100 orders, their name and address are stored 100 times. If they move, you must update 100 rows—and if you miss one, data becomes inconsistent.
Why Normalize?
- Eliminate Redundancy: Store each fact once
- Prevent Anomalies:
- Insert Anomaly: Can't add a new customer without an order
- Update Anomaly: Changing customer address requires multiple updates
- Delete Anomaly: Deleting last order loses customer information
- Ensure Consistency: Single source of truth for each data element
Normal Forms (Simplified)
- First Normal Form (1NF): Eliminate repeating groups; each cell contains atomic values
- Second Normal Form (2NF): Remove partial dependencies; every non-key column depends on the entire primary key
- Third Normal Form (3NF): Remove transitive dependencies; non-key columns depend only on the primary key
Relational Keys
- Primary Key: Unique identifier for each row (e.g., CustomerID)
- Foreign Key: Reference to a primary key in another table (creates relationships)
- Composite Key: Primary key made of multiple columns
Common Database Objects
- Tables: The primary storage structures containing rows and columns.
- Views: Virtual tables based on the result set of a SQL query. They don't store data themselves but provide a way to look at data from one or more tables.
- Indexes: Structures used to speed up retrieval of data. Think of it like the index at the back of a book; it allows the database engine to find rows without scanning the entire table.
Visual: Normalization Example
Loading diagram...
⚠️ Common Pitfall: Over-normalizing for OLAP workloads. Normalization is essential for OLTP but actually hurts OLAP performance. Analytical systems deliberately denormalize to avoid expensive JOINs.
Key Trade-Offs:
- Normalization vs. Query Performance: Normalized data requires JOINs to reconstruct complete records. More JOINs = slower queries.
- Data Integrity vs. Write Complexity: Normalization ensures integrity but requires multiple table inserts for a single logical operation.
Reflection Question: Why would a data warehouse designer deliberately violate Third Normal Form (3NF) by duplicating customer data in the sales fact table?