Copyright (c) 2026 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.

3.3.4. Handling Data Quality Issues

💡 First Principle: Real-world data contains duplicates, nulls, and late arrivals. Robust data engineering handles these systematically rather than hoping they don't exist.

Handling Duplicates

# PySpark: Remove duplicates
df_deduped = df.dropDuplicates(["OrderID"])

# Keep most recent duplicate
from pyspark.sql.window import Window
window = Window.partitionBy("OrderID").orderBy(col("UpdatedAt").desc())
df_latest = df.withColumn("row_num", row_number().over(window)) \
              .filter(col("row_num") == 1) \
              .drop("row_num")

Handling Missing Data

# Fill nulls with default values
df_filled = df.na.fill({
    "Quantity": 0,
    "Discount": 0.0,
    "ShipDate": "1900-01-01"
})

# Drop rows with nulls in critical columns
df_complete = df.na.drop(subset=["CustomerID", "ProductID"])

Handling Late-Arriving Data

  • Challenge: Data arrives after the batch window closes
  • Solutions:
    • Watermarks: Track latest processed timestamp
    • Reprocessing windows: Reprocess recent data periodically
    • Delta Lake merge: Update existing records with late data
Alvin Varughese
Written byAlvin Varughese
Founder•15 professional certifications