Copyright (c) 2026 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.

3.4.3. Spark Structured Streaming with Delta Tables

đź’ˇ First Principle: Spark Structured Streaming treats streams as unbounded tables, using the same DataFrame API as batch processing. Combined with Delta Lake, it provides exactly-once processing and queryable historical data.

Scenario: A retail platform ingests clickstream data continuously. The data must be immediately queryable for real-time dashboards while also available for historical analysis.

Streaming to Delta Pattern

# Read stream from Event Hub
stream_df = spark \
    .readStream \
    .format("eventhubs") \
    .options(**event_hub_config) \
    .load()

# Parse and transform
from pyspark.sql.functions import from_json, col
schema = StructType([
    StructField("deviceId", StringType()),
    StructField("temperature", DoubleType()),
    StructField("timestamp", TimestampType())
])

parsed_df = stream_df \
    .select(from_json(col("body").cast("string"), schema).alias("data")) \
    .select("data.*")

# Write to Delta table
query = parsed_df \
    .writeStream \
    .format("delta") \
    .outputMode("append") \
    .option("checkpointLocation", "/checkpoints/temperature") \
    .start("Tables/temperature_readings")

⚠️ Exam Trap: Not specifying checkpoint location means a restart reprocesses all data from the beginning. Checkpoints track stream progress—always configure them in production streaming jobs.

Alvin Varughese
Written byAlvin Varughese
Founder•15 professional certifications