Copyright (c) 2026 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.

3.4.3. Spark Structured Streaming with Delta Tables

💡 First Principle: Spark Structured Streaming treats streams as unbounded tables, using the same DataFrame API as batch processing. Combined with Delta Lake, it provides exactly-once processing and queryable historical data.

Scenario: A retail platform ingests clickstream data continuously. The data must be immediately queryable for real-time dashboards while also available for historical analysis.

Streaming to Delta Pattern

# Read stream from Event Hub
stream_df = spark \
    .readStream \
    .format("eventhubs") \
    .options(**event_hub_config) \
    .load()

# Parse and transform
from pyspark.sql.functions import from_json, col
schema = StructType([
    StructField("deviceId", StringType()),
    StructField("temperature", DoubleType()),
    StructField("timestamp", TimestampType())
])

parsed_df = stream_df \
    .select(from_json(col("body").cast("string"), schema).alias("data")) \
    .select("data.*")

# Write to Delta table
query = parsed_df \
    .writeStream \
    .format("delta") \
    .outputMode("append") \
    .option("checkpointLocation", "/checkpoints/temperature") \
    .start("Tables/temperature_readings")

⚠️ Exam Trap: Not specifying checkpoint location means a restart reprocesses all data from the beginning. Checkpoints track stream progress—always configure them in production streaming jobs.

Alvin Varughese
Written byAlvin Varughese
Founder15 professional certifications