Working on the EMR cluster (CS 205)

Very Important: Please terminate the cluster as soon as you are done. Otherwise, you will continue to be charged!

Once the cluster is ready

It will show the status as "Waiting". 
Master public DNS:ec2-##-##-##-##.compute-1.amazonaws.com SSH

ssh:

If you click on "SSH", it will give you instructions on how to ssh.

Copying files to hadoop file system (HDFS)

Before you can work with any data (input files), they need to be loaded to HDFS.

For example, for a file "file.txt" (or directory "data"):

hdfs dfs -put file.txt 

hdfs dfs -put data

Running spark

You can run spark from the command line with:

spark-submit <python_script>

or

spark-submit <python_script> ..files or other options

Note that any data file should already be on hdfs.

 Examples

Important:

For examples 2 and 3, you need to load the /usr/lib/spark/data directory to hdfs. So do:

hdfs dfs -put  /usr/lib/spark/data

  1. wordcount.py:
    The program is in: /usr/lib/spark/examples/src/main/python
    You can try it on itself. To do that, load wordcount.py to hdfs

    hdfs dfs -put  /usr/lib/spark/examples/src/main/python/wordcount.py
    spark-submit wordcount.py wordcount.py
  2. logistic_regression_with_lbfgs_example.py
    visit: https://spark.apache.org/docs/latest/mllib-linear-methods.html
    The program is in /usr/lib/spark/examples/src/main/python/mllib

    spark-submit logistic_regression_with_lbfgs_example.py
  3. multi_class_metrics_example.py
    visit: https://spark.apache.org/docs/latest/mllib-linear-methods.html
    The program is in /usr/lib/spark/examples/src/main/python/mllib

    spark-submit  multi_class_metrics_example.py

jupyter access (formerly ipython notebook)

Once the cluster is up:

jupyter notebook is available at port 8880

Before you can connect to it, you need to ssh tunnel. In a terminal (on Mac and LInux), type:

Files relevant to the class

1) Large files for word count. You can clone this on EMR.

    The data directory in: https://github.com/cs109/2015lab8

    has text files from the Gutenberg collection of books.

2) Ipython notebook with spark.

      The following is an EMR version of a notebook from CS 109:

            Lab8-Apache-Spark-modified-for-emr.ipynb

     The original notebook is:

     https://github.com/cs109/2015lab8/blob/master/Lab8-Apache-Spark.ipynb

     The above is from the github repository https://github.com/cs109/2015lab8

 

Copyright © 2024 The President and Fellows of Harvard College * Accessibility * Support * Request Access * Terms of Use