Copyright (c) 2026 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.
3.3.4. Handling Data Quality Issues
💡 First Principle: Real-world data contains duplicates, nulls, and late arrivals. Robust data engineering handles these systematically rather than hoping they don't exist.
Handling Duplicates
# PySpark: Remove duplicates
df_deduped = df.dropDuplicates(["OrderID"])
# Keep most recent duplicate
from pyspark.sql.window import Window
window = Window.partitionBy("OrderID").orderBy(col("UpdatedAt").desc())
df_latest = df.withColumn("row_num", row_number().over(window)) \
.filter(col("row_num") == 1) \
.drop("row_num")
Handling Missing Data
# Fill nulls with default values
df_filled = df.na.fill({
"Quantity": 0,
"Discount": 0.0,
"ShipDate": "1900-01-01"
})
# Drop rows with nulls in critical columns
df_complete = df.na.drop(subset=["CustomerID", "ProductID"])
Handling Late-Arriving Data
- Challenge: Data arrives after the batch window closes
- Solutions:
- Watermarks: Track latest processed timestamp
- Reprocessing windows: Reprocess recent data periodically
- Delta Lake merge: Update existing records with late data
Written byAlvin Varughese
Founder•15 professional certifications