Working on the EMR cluster (CS 205)

Very Important: Please terminate the cluster as soon as you are done. Otherwise, you will continue to be charged!

It will show the status as "Waiting".

Master public DNS:ec2-##-##-##-##.compute-1.amazonaws.com SSH

ssh:

If you click on "SSH", it will give you instructions on how to ssh.

Copying files to hadoop file system (HDFS)

Before you can work with any data (input files), they need to be loaded to HDFS.

For example, for a file "file.txt" (or directory "data"):

hdfs dfs -put file.txt

hdfs dfs -put data

Running spark

You can run spark from the command line with:

spark-submit <python_script>

or

spark-submit <python_script> ..files or other options

Note that any data file should already be on hdfs.

Examples

Important:

For examples 2 and 3, you need to load the /usr/lib/spark/data directory to hdfs. So do:

hdfs dfs -put /usr/lib/spark/data

wordcount.py:
The program is in: /usr/lib/spark/examples/src/main/python
You can try it on itself. To do that, load wordcount.py to hdfs

hdfs dfs -put /usr/lib/spark/examples/src/main/python/wordcount.py
spark-submit wordcount.py wordcount.py
logistic_regression_with_lbfgs_example.py
visit: https://spark.apache.org/docs/latest/mllib-linear-methods.html
The program is in /usr/lib/spark/examples/src/main/python/mllib

spark-submit logistic_regression_with_lbfgs_example.py
multi_class_metrics_example.py
visit: https://spark.apache.org/docs/latest/mllib-linear-methods.html
The program is in /usr/lib/spark/examples/src/main/python/mllib

spark-submit multi_class_metrics_example.py

jupyter access (formerly ipython notebook)

Once the cluster is up:

jupyter notebook is available at port 8880

Before you can connect to it, you need to ssh tunnel. In a terminal (on Mac and LInux), type:

ssh -i ~/mykeypair.pem -N -L 8880:ec2-###-##-##-###.compute-1.amazonaws.com:8880 hadoop@ec2-###-##-##-###.compute-1.amazonaws.com
where ec2-###-##-##-###.compute-1.amazonaws.com is your machine's DNS displayed on the cluster dashboard. Also you need to substitute your key pair for mykeypair.pem. For windows, follow:
http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-ssh-tunnel.html#emr-ssh-tunnel-win
the notebook will be available on a browser at: localhost://8880. The password is "jupyter"
See below [under (2)] for an example Jupyter notebook that can be run on EMR.

Files relevant to the class

1) Large files for word count. You can clone this on EMR.

has text files from the Gutenberg collection of books.

2) Ipython notebook with spark.

The following is an EMR version of a notebook from CS 109:

The original notebook is:

The above is from the github repository https://github.com/cs109/2015lab8.