Copyright (c) 2026 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.
3.4.3. Spark Structured Streaming with Delta Tables
💡 First Principle: Spark Structured Streaming treats streams as unbounded tables, using the same DataFrame API as batch processing. Combined with Delta Lake, it provides exactly-once processing and queryable historical data.
Scenario: A retail platform ingests clickstream data continuously. The data must be immediately queryable for real-time dashboards while also available for historical analysis.
Streaming to Delta Pattern
# Read stream from Event Hub
stream_df = spark \
.readStream \
.format("eventhubs") \
.options(**event_hub_config) \
.load()
# Parse and transform
from pyspark.sql.functions import from_json, col
schema = StructType([
StructField("deviceId", StringType()),
StructField("temperature", DoubleType()),
StructField("timestamp", TimestampType())
])
parsed_df = stream_df \
.select(from_json(col("body").cast("string"), schema).alias("data")) \
.select("data.*")
# Write to Delta table
query = parsed_df \
.writeStream \
.format("delta") \
.outputMode("append") \
.option("checkpointLocation", "/checkpoints/temperature") \
.start("Tables/temperature_readings")
⚠️ Exam Trap: Not specifying checkpoint location means a restart reprocesses all data from the beginning. Checkpoints track stream progress—always configure them in production streaming jobs.
Written byAlvin Varughese
Founder•15 professional certifications