Copyright (c) 2026 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.

2.4.2. Amazon EMR and Apache Spark

💡 First Principle: EMR gives you a full Hadoop/Spark cluster under your control. Unlike Glue where AWS manages everything, EMR lets you choose instance types, install custom libraries, run multiple frameworks simultaneously (Spark, Hive, Presto, HBase, Flink), and maintain persistent clusters for iterative workloads. The trade-off: more power and flexibility, but more operational responsibility.

When does the exam point to EMR over Glue? Look for: custom Spark configurations (specific JARs, non-standard libraries), multiple frameworks in the same workflow (Spark + Hive + Presto), persistent clusters for interactive analysis (EMR notebooks, JupyterHub), or very large-scale processing (petabyte-scale where fine-tuning cluster configuration matters).

EMR deployment modes:

EMR on EC2 (traditional). You specify instance types, cluster size, and scaling policies. Supports transient clusters (launch, process, terminate) and persistent clusters (always running). Transient clusters are cost-effective for batch jobs; persistent clusters suit interactive workloads.

EMR on EKS. Runs Spark jobs on an existing EKS Kubernetes cluster. Useful when you want to consolidate Spark and other containerized workloads on a shared compute platform.

EMR Serverless. Submits Spark or Hive jobs without provisioning clusters. AWS manages compute resources automatically. This closes the gap between EMR and Glue — EMR Serverless offers Glue's operational simplicity with EMR's full Spark compatibility. The exam may test this distinction: if a question mentions "serverless Spark with full open-source compatibility," EMR Serverless is the answer.

FeatureAWS GlueEMR on EC2EMR Serverless
ManagementFully serverlessCluster management requiredServerless
FrameworksSpark, Python Shell, RaySpark, Hive, Presto, HBase, Flink, etc.Spark, Hive
Custom JARsLimitedFull controlFull control
Persistent computeNoYesNo
Cost modelPer DPU-secondPer instance-hourPer vCPU/memory-hour
Best forServerless ETL, data catalog integrationComplex multi-framework workloadsServerless Spark with OSS compat

⚠️ Exam Trap: "Use Apache Spark" doesn't automatically mean EMR. Glue also runs Spark. The differentiator is whether the question requires cluster customization, multiple frameworks, or specific Spark versions. If none of these are mentioned and the question emphasizes "least operational overhead," Glue is preferred.

Reflection Question: A team wants to run PySpark jobs that use a custom machine learning library not available in Glue's environment. They need the jobs to run nightly and terminate after completion. Which EMR deployment mode minimizes cost and operational effort?

Alvin Varughese
Written byAlvin Varughese
Founder15 professional certifications