Working on the EMR cluster (CS 205)
Very Important: Please terminate the cluster as soon as you are done. Otherwise, you will continue to be charged!
Once the cluster is ready
If you click on "SSH", it will give you instructions on how to ssh.
Copying files to hadoop file system (HDFS)
Before you can work with any data (input files), they need to be loaded to HDFS.
For example, for a file "file.txt" (or directory "data"):
hdfs dfs -put file.txt
hdfs dfs -put data
Running spark
You can run spark from the command line with:
spark-submit <python_script>
or
spark-submit <python_script> ..files or other options
Note that any data file should already be on hdfs.
Examples
Important:
For examples 2 and 3, you need to load the /usr/lib/spark/data directory to hdfs. So do:
hdfs dfs -put /usr/lib/spark/data
- wordcount.py:
The program is in: /usr/lib/spark/examples/src/main/python
You can try it on itself. To do that, load wordcount.py to hdfs
hdfs dfs -put /usr/lib/spark/examples/src/main/python/wordcount.py
spark-submit wordcount.py wordcount.py - logistic_regression_with_lbfgs_example.py
visit: https://spark.apache.org/docs/latest/mllib-linear-methods.html
The program is in /usr/lib/spark/examples/src/main/python/mllib
spark-submit logistic_regression_with_lbfgs_example.py - multi_class_metrics_example.py
visit: https://spark.apache.org/docs/latest/mllib-linear-methods.html
The program is in /usr/lib/spark/examples/src/main/python/mllib
spark-submit multi_class_metrics_example.py
jupyter access (formerly ipython notebook)
Once the cluster is up:
jupyter notebook is available at port 8880
Before you can connect to it, you need to ssh tunnel. In a terminal (on Mac and LInux), type:
- ssh -i ~/mykeypair.pem -N -L 8880:ec2-###-##-##-###.compute-1.amazonaws.com:8880 hadoop@ec2-###-##-##-###.compute-1.amazonaws.com
whereec2-###-##-##-###.compute-1.amazonaws.com is your machine's DNS displayed on the cluster dashboard.
For windows, follow:
Also you need to substitute your key pair for mykeypair.pem.
http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-ssh-tunnel.html#emr-ssh-tunnel-win - the notebook will be available on a browser at: localhost://8880. The password is "jupyter"
- See below [under (2)] for an example Jupyter notebook that can be run on EMR.
Files relevant to the class
1) Large files for word count. You can clone this on EMR.
The data directory in: https://github.com/cs109/2015lab8
has text files from the Gutenberg collection of books.
2) Ipython notebook with spark.
The following is an EMR version of a notebook from CS 109:
Lab8-Apache-Spark-modified-for-emr.ipynb
The original notebook is:
https://github.com/cs109/2015lab8/blob/master/Lab8-Apache-Spark.ipynb
The above is from the github repository https://github.com/cs109/2015lab8.
Copyright © 2024 The President and Fellows of Harvard College * Accessibility * Support * Request Access * Terms of Use