Copyright (c) 2026 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.

2.1.1. Normalization and Data Integrity

đź’ˇ First Principle: Normalization organizes data to minimize redundancy and dependency, ensuring that each piece of information exists in exactly one place.

Scenario: A spreadsheet tracks orders with columns: OrderID, CustomerName, CustomerAddress, ProductName, ProductPrice. If a customer places 100 orders, their name and address are stored 100 times. If they move, you must update 100 rows—and if you miss one, data becomes inconsistent.

Why Normalize?

  • Eliminate Redundancy: Store each fact once
  • Prevent Anomalies:
    • Insert Anomaly: Can't add a new customer without an order
    • Update Anomaly: Changing customer address requires multiple updates
    • Delete Anomaly: Deleting last order loses customer information
  • Ensure Consistency: Single source of truth for each data element

Normal Forms (Simplified)

  • First Normal Form (1NF): Eliminate repeating groups; each cell contains atomic values
  • Second Normal Form (2NF): Remove partial dependencies; every non-key column depends on the entire primary key
  • Third Normal Form (3NF): Remove transitive dependencies; non-key columns depend only on the primary key

Relational Keys

  • Primary Key: Unique identifier for each row (e.g., CustomerID)
  • Foreign Key: Reference to a primary key in another table (creates relationships)
  • Composite Key: Primary key made of multiple columns

Common Database Objects

  • Tables: The primary storage structures containing rows and columns.
  • Views: Virtual tables based on the result set of a SQL query. They don't store data themselves but provide a way to look at data from one or more tables.
  • Indexes: Structures used to speed up retrieval of data. Think of it like the index at the back of a book; it allows the database engine to find rows without scanning the entire table.
Visual: Normalization Example
Loading diagram...

⚠️ Common Pitfall: Over-normalizing for OLAP workloads. Normalization is essential for OLTP but actually hurts OLAP performance. Analytical systems deliberately denormalize to avoid expensive JOINs.

Key Trade-Offs:
  • Normalization vs. Query Performance: Normalized data requires JOINs to reconstruct complete records. More JOINs = slower queries.
  • Data Integrity vs. Write Complexity: Normalization ensures integrity but requires multiple table inserts for a single logical operation.

Reflection Question: Why would a data warehouse designer deliberately violate Third Normal Form (3NF) by duplicating customer data in the sales fact table?