Spark on Amazon EMR (for CS 205)
Very Important: Please terminate the cluster as soon as you are done. Otherwise, you will continue to be charged!
Prerequisites: You need to have Amazon Educate account and a working ssh setup
Obtain an Amazon AWS account if you have not done it already. Please visit:
AWS Educate
SSH keypair (one time)
Instructions for quick EMR cluster creation (No Jupyter or Ipython notebook)
(you can run spark jobs with spark-submit and pyspark)
Instructions for EMR with Jupyter Notebook (takes at least 30 minutes!)
(This is based on the AWS blog: https://aws.amazon.com/blogs/big-data/running-jupyter-notebook-and-jupyterhub-on-amazon-emr/)
- Login to Amazon AWS consol
https://aws.amazon.com/emr/ - Click on Services at the top left and choose EMR from the panel of services
- Click on "Create Cluster" and choose "Advanced Options"
- "Software Configuration": Add Spark to the list and choose "Next" at the bottom
- "Hardware": You can leave the default hardware (m3.xlarge) and choose the number of
core instances. (the default are 1 and 2). You will be charged for this. If you want to know how much
you will be charged, visit:
https://aws.amazon.com/emr/pricing/ - General Options:
- choose a name
- uncheck termination protection
- choose Bootstrap Action:
- Add Bootstrap Action --> choose "Custom Action"
--> click "Configure and Add"
--> Choose any name
--> Copy and paste the following in "Script Location"
s3://emr-scripts-kindires/install-jupyter-emr-12212016.sh
- Add Bootstrap Action --> choose "Custom Action"
- Security ---> if you have ssh keys created (see below), select that for the EC2 keypair.
You need this for sshing to the cluster. - Create cluster
Terminating the cluster
You can terminate the cluster by pressing the terminate button on the cluster "dashboard".
Working with Spark on the EMR cluster
Copyright © 2024 The President and Fellows of Harvard College * Accessibility * Support * Request Access * Terms of Use