Copyright (c) 2026 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.

3.3.2. PySpark Transformations

💡 First Principle: PySpark provides distributed data processing for large-scale transformations. When data exceeds memory limits or transformations require complex logic, Spark distributes the work across multiple nodes.

Common PySpark Operations
# Read Delta table
df = spark.read.format("delta").load("Tables/sales")

# Filter
df_filtered = df.filter(col("Region") == "West")

# Group and aggregate
df_summary = df.groupBy("Region").agg(
    sum("SalesAmount").alias("TotalSales"),
    count("OrderID").alias("OrderCount")
)

# Handle nulls
df_clean = df.na.fill({"Discount": 0.0}).na.drop(subset=["CustomerID"])

# Add calculated column
df_enriched = df.withColumn("SalesWithTax", col("SalesAmount") * 1.08)

# Write to Delta table
df_enriched.write.format("delta").mode("overwrite").save("Tables/sales_enriched")
Alvin Varughese
Written byAlvin Varughese
Founder•15 professional certifications