Tips for Cannon

Below are tips that we’ve collected in the 10+ years that we’ve been using, administrating, managing, and helping others use supercomputers.

Thank you for using supercomputers. You are making the world a better place by doing your research with less sampling, larger data sets, etc. We’re all in the business of better science!
The best way for you to raise your lab’s priority on Cannon is to coordinate use so it isn’t used as much. The second-best way is to buy hardware for it. Contact me if that is of interest.
Be mindful of other researchers. Try to be precise with your usage.
Pay attention to MOTD (the message when you log in). Ensure you recognize when you last logged in, and take note of any new changes and downtime.

Interactive Use

Use salloc to get a dynamic allocation on a node and use it in realtime.
salloc -p seas_gpu,gpu --time=0-4 -c 1 --mem=8000 --gres=gpu:4

Note: A policy exists to limit interactive usage on seas_gpu to <1 core and less than six hours.

Batch Use

The following are changes you can make within your SBATCH job files:
1. Get emails with job statuses:
  #SBATCH --mail-user=your_email@seas.harvard.edu
  #SBATCH --mail-type=ALL
2. There are many partitions on Cannon that you can use. By far, the best way to learn about what partitions you have access to, is to run spart. Regardless:
  1. seas_compute: 90 compute nodes with a total of 4,488 cores that are available to all SEAS researchers.
  2. seas_gpu: 43 nodes with 195 GPUs and 2,368 cores that are available to all SEAS researchers who need a GPU.
  3. sapphire: 192 nodes with 21,504 cores available to any researcher at Harvard.
  4. gpu: 36 nodes with 144 GPUs available to any researcher at Harvard who needs a GPU.
  5. gpu_requeue: 172 nodes with 762 GPUs available to any researcher at Harvard who needs GPU nodes and needs the jobs to run quicker, but realize that your jobs can be requeued by researchers with higher priorities (i.e. they use Cannon less frequently, have purchased more hardware, etc.).
  6. shared: 300 nodes with a total of 14,400 cores available to any researcher at Harvard who DOES NOT need a GPU.
  7. serial_requeue: 1475 nodes with a total of 90,820 cores available to any researcher who DOES NOT need a GPU and needs the jobs to run quicker, but realize that your jobs can be requeued by researchers with higher priorities (use Cannon less frequently, have purchased more hardware, etc.). If you have longer running jobs, you should see DMTCP below.
  8. bigmem: 4 nodes with nearly a terabyte of memory available and a total of 448 cores to any researcher who does not need a GPU.
  9. Your lab’s partition: Talk with your lab members. You may have one.
3. To use these partitions, string them together like so:
  CPU
  #SBATCH --partition=seas_compute,shared,serial_requeue
  GPU
  #SBATCH --partition=seas_gpu,gpu,gpu_requeue
4. Declare the number of GPUs that you need:
  #SBATCH --gres=gpu:1
5. Use a specific type of GPU (in this case, the latest available):
  #SBATCH --constraint=a100
6. More constraints are possible:

Network (MPI jobs)

holyhdr

holyib

bosib

GPUs (GPU jobs)

a100 (2020-)

a40 (2020-)

rtx2080ti (2018-2020)

v100 (2017-2020)

titanv (2017-2018)

1080 (2017-2018)

titanx (2016-2017)

m40 (2015-2017)

k80 (2014-2015)

k20m (2012-2014)

Processor

intel

amd

Processor Family

icelake (Intel 2019-)

cascadelake (Intel 2019-)

westmere (Intel 2010-)

skylake (Intel 2015-2019)

broadwell (Intel 2014-2018)

haswell (Intel 2013-)

ivybridge (Intel 2012-2015)

abudhabi (AMD 2012-2017)

x86 Extensions

avx512 (Intel 2016-)

interlagos (AMD 2003-2017)

fma4 (AMD 2011-2014)

avx2 (Intel/AMD 2011-)

avx (Intel/AMD 2011-)

CUDA Versions

cc8.6

cc7.0

cc7.5

cc6.1

cc6.0

cc5.2

cc3.7

cc3.5

E.g. If you wanted the latest available processor and wanted the whole node:
#SBATCH --constraint=icelake
#SBATCH -n 1
You can also say what you would prefer, but state both:
#SBATCH --constraint="icelake|cascadelake"

#SBATCH --prefer="icelake"

Declare the number of cores, memory, and run-time when you run your jobs. This will help the scheduler, SLURM, determine exactly what you need and run your job no matter how low of a priority you have.
1. For multithreading/GPU:
  #SBATCH --ntasks=1 # One multithreaded program
  #SBATCH --ntasks-per-node=1 # Runs on 1 node
  #SBATCH --cpus-per-task=10 # With 10 CPU cores
  #SBATCH --mem-per-cpu=8G # 8GB per CPU core
  #SBATCH --time=0-05:00:00 # 5 hours
2. For MPI:
  #SBATCH --ntasks=24 # 24 MPI programs
  #SBATCH --ntasks-per-node=1 # Will use 1 node per task
  #SBATCH --cpus-per-task=10 # 10 cores per node/task
  #SBATCH --mem-per-cpu=8G # 8GB per CPU core
  #SBATCH –time=0-05:00:00 # 5 hours
To determine any of the items above, try a smaller sample size on your PC or as a test run on Cannon. Use the constraint variables above to see if the code runs just as fast on older hardware since it costs less fairshare.
The serial_requeue and gpu_requeue partitions will run your jobs sooner than shared but the jobs can be canceled and automatically requeued. Generally, they are faster though, and have half the Fairshare cost.
#SBATCH -p serial_requeue
-or-
#SBATCH -p gpu_requeue
Other partitions, like seas_gpu and seas_gpu_requeue, also exist.
Long running jobs that are requeued can cause much longer run times. If you are using CPUs and not GPUs (work in process) consider DMTCP: GitHub - jrwellshpc/dmtcp_scripts: DMTCP scripts to get Python scripts working with SLURM.
To check on your jobs
1. Check the status of your jobs using sacct, not squeue, which is delayed due to system load.
2. To get a prediction of when your job will run, try:
  scontrol show job JOBID | grep StartTime
3. To see how full a partition is:
  showq -p partitionname -o
4. If you happen to be the only person with jobs running on a queue, you could scontrol hold and release the jobs you would rather run later:
  scontrol hold <job_id>
  scontrol release <job_id>
5. To promote one of your jobs (if there are many) run::
  scontrol update job <job_id> nice
6. If you want to look up your fairshare, first get the header:
  sshare -a | head
  Then:
  sshare -a | grep username # where username is your username
  In all likelihood, your fairshare should be the same as your lab’s fairshare.

Testing for Multiprocessing

After a job launches you can use sinfo to view what node the job is running on, and then SSH to that node (e.g. ssh holy2a10301). Once there…
Type top to view processes running on the system. Type u and enter your username to see only your processes. Under %CPU, if you see <= 100% you are only using one processor core on a supercomputer with 100,000+ of them. Type q to exit top.
Type lscpu. The CPU(s) line will tell you how many cores you could be using.
Google your programming language followed by multiprocessing to see how to use more processors. E.g. python multiprocessing
Try Googling your libraries too: python openai multiprocessing

Storage

Tier 0 costs for high speed storage are $50/TB/year. Please contact FAS RC (see below) if you are interested, but be mindful that this storage is not backed up at all.
Tier 1 Lab storage (/n/pi_lab) costs are $250/TB/year. It has snapshots throughout the day and has disaster recovery coverage. It can be mounted from on campus computers too. The first 4TB are free. Please contact FAS RC (see below) if you need more than the default.
Slower storage is available on Tier 2, at $100/TB/year. It is encrypted at rest and has disaster recovery coverage. It can be mounted from on campus computers too. Please contact FAS RC (see below) if you are interested.
If you have data that you will not be using in awhile, consider moving it to Tier 3 Tape Storage. That’s only $5/TB/year. Please contact FAS RC (see below) if you need more than the default.
More information is available here: Data Storage

Contacts

To contact FAS RC, email rchelp@rc.fas.harvard.edu.

SEAS User Documentation

Tips for Cannon

Analytics

Related content