ChIP-seq Practice Exercises
***NOTE: When working on O2, be sure to run this in /n/scratch2/ rather than your home directory .****
# ChIP-Seq Analysis Workflow
1. Create Create a directory called your eCommons ID within on /n/scratch2
/. Enter using your eCommons user ID as the directory name. Within that directory and create a new directory called called
HCFC1_chipseq
.
2. You have strong evidence that HCFC1 is the transcription co-factor that associates with your protein of interest. To confirm this hypothesis you need to find binding regions for HCFC1 and see if they overlap with your current list of regions. The ENCODE project has ChIP-seq data for HCFC1 using a human liver cancer cell line HepG2 which contains 32 bp single end reads. We have downloaded this data and made it available for you on Orchestra. (NOTE: If you are interested in finding out more about the dataset you can find the ENCODE record here).
a. Setup a project directory structure within the HCFC1_chipseq
directory as shown below and copy over the raw FASTQ files from /n/groups/hbctraining/ngs-data-analysis-longcourse/chipseq/HCFC1
into the appropriate directory:
...
- Run FASTQC
- Align reads with Bowtie2 using the parameters we used in class.
NOTE: For the Bowtie2 index you will need to point to the hg19 index files from:/n/groups/shared_databases/igenome/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/
- Change alignment file format from SAM to BAM (can be done using samtools or sambamba)
- Sort the BAM file by read coordinate locations (can be done using sambamba or with samtools)
- Filter to keep only uniquely mapping reads (this will also remove any unmapped reads and duplicates) using sambamba
- Index the final BAM file. This will be useful for visualization and QC.
**NOTE: The script will require positional parameters, and using basename will help with the naming of output files**
...
c. Create a separate job submission script to run the shell script you created in b) on all four .fastq
files. You have the option of running this in serial or parallel. Take a look at thethe automation lesson to help with setting up the job submission script.
...
OPTIONAL: If you feel very ambitious get X11 setup on your personal account and try running ChIPQC to create a report for HCFC1. Discuss the quality of the data using these metrics.
X11 setup instructions are on this page under the "SSH X11 Forwarding" subheader. If you run into problems here please reach out to HMSRC folks, this is known to be tricky!
gf. Sort each of the narrowPeak files using:
...
sort -k8,8nr HCFC1-rep2_peaks.narrowPeak > HCFC1-rep2_peaks_sorted.narrowPeak
hg. Use IDR to assess reproducibility between replicates using the sorted peaks in 2g 2f. Use the awk
command from class to determine how many peaks meet an IDR cutoff of 0.05?
...
**NOTE 2: Just perform the 1 step we did in class to generate the IDR stats; there is no need to do the remaining 2 steps in the IDR workflow. (However, you can do them optionally if you are interested)**
ih. These high confidence peaks from IDR can be used to explore downstream tools for functional enrichment and motif discovery. Use GREAT to annotate the peaks and write those annotations to file. Take a look at binding site locations relative to the TSS, what does this tell us?
Evaluate the GO enrichment analysis from GREAT, what terms do you see over-represented?
hi. **OPTIONAL:** Try motif analysis using the MEME suite and comment on the results you find.
...