2.4.2. Amazon EMR and Apache Spark
💡 First Principle: EMR gives you a full Hadoop/Spark cluster under your control. Unlike Glue where AWS manages everything, EMR lets you choose instance types, install custom libraries, run multiple frameworks simultaneously (Spark, Hive, Presto, HBase, Flink), and maintain persistent clusters for iterative workloads. The trade-off: more power and flexibility, but more operational responsibility.
When does the exam point to EMR over Glue? Look for: custom Spark configurations (specific JARs, non-standard libraries), multiple frameworks in the same workflow (Spark + Hive + Presto), persistent clusters for interactive analysis (EMR notebooks, JupyterHub), or very large-scale processing (petabyte-scale where fine-tuning cluster configuration matters).
EMR deployment modes:
EMR on EC2 (traditional). You specify instance types, cluster size, and scaling policies. Supports transient clusters (launch, process, terminate) and persistent clusters (always running). Transient clusters are cost-effective for batch jobs; persistent clusters suit interactive workloads.
EMR on EKS. Runs Spark jobs on an existing EKS Kubernetes cluster. Useful when you want to consolidate Spark and other containerized workloads on a shared compute platform.
EMR Serverless. Submits Spark or Hive jobs without provisioning clusters. AWS manages compute resources automatically. This closes the gap between EMR and Glue — EMR Serverless offers Glue's operational simplicity with EMR's full Spark compatibility. The exam may test this distinction: if a question mentions "serverless Spark with full open-source compatibility," EMR Serverless is the answer.
| Feature | AWS Glue | EMR on EC2 | EMR Serverless |
|---|---|---|---|
| Management | Fully serverless | Cluster management required | Serverless |
| Frameworks | Spark, Python Shell, Ray | Spark, Hive, Presto, HBase, Flink, etc. | Spark, Hive |
| Custom JARs | Limited | Full control | Full control |
| Persistent compute | No | Yes | No |
| Cost model | Per DPU-second | Per instance-hour | Per vCPU/memory-hour |
| Best for | Serverless ETL, data catalog integration | Complex multi-framework workloads | Serverless Spark with OSS compat |
⚠️ Exam Trap: "Use Apache Spark" doesn't automatically mean EMR. Glue also runs Spark. The differentiator is whether the question requires cluster customization, multiple frameworks, or specific Spark versions. If none of these are mentioned and the question emphasizes "least operational overhead," Glue is preferred.
Reflection Question: A team wants to run PySpark jobs that use a custom machine learning library not available in Glue's environment. They need the jobs to run nightly and terminate after completion. Which EMR deployment mode minimizes cost and operational effort?