Practice Exercises

ChIP-seq Practice Exercises

****NOTE: When working on O2, be sure to run this in /n/scratch2/ rather than your home directory .****

# ChIP-Seq Analysis Workflow

1. Create a directory called your eCommons ID within /n/scratch2/. Enter that directory and create a new directory called HCFC1_chipseq.

2. You have strong evidence that HCFC1 is the transcription co-factor that associates with your protein of interest. To confirm this hypothesis you need to find binding regions for HCFC1 and see if they overlap with your current list of regions. The ENCODE project has ChIP-seq data for HCFC1 using a human liver cancer cell line HepG2 which contains 32 bp single end reads. We have downloaded this data and made it available for you on Orchestra. (NOTE: If you are interested in finding out more about the dataset you can find the ENCODE record here).

a. Setup a project directory structure within the HCFC1_chipseq directory as shown below and copy over the raw FASTQ files from /n/groups/hbctraining/ngs-data-analysis-longcourse/chipseq/HCFC1 into the appropriate directory:

├── HCFC1_chipseq/
      ├── raw_data/
      ├── logs/
      ├── meta/
      ├── scripts/
      ├── results/
         ├── fastQC/
         ├── bowtie2/
         ├── unique_align/
         ├── macs2/
         └── sambamba/

b. Create a shell script that uses positional parameters and takes in a .fastq file as input and will perform the following:

1. 1. Run FASTQC
  2. Align reads with Bowtie2 using the parameters we used in class.
    NOTE: For the Bowtie2 index you will need to point to the hg19 index files from: /n/groups/shared_databases/igenome/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/
  3. Change alignment file format from SAM to BAM (can be done using samtools or sambamba)
  4. Sort the BAM file by read coordinate locations (can be done using sambamba or with samtools)
  5. Filter to keep only uniquely mapping reads (this will also remove any unmapped reads) using sambamba

**NOTE: The script will require positional parameters, and using basename will help with the naming of output files**

#!/bin/bash

# Take the input FASTQ file and create a short name for output files
fq=$1
NAME=`basename $fq .fastq`

**NOTE: make sure you load the required modules for each step

**NOTE: your final filtered BAM files should be located in unique_align

c. Create a separate job submission script to run the shell script you created in b) on all four .fastq files. You have the option of running this in serial or parallel. Take a look at the lessons from the RNA-Seq session to help with setting up the job submission script.

d. Once the job submission script has finished running, use MACS2 to call peaks for each replicate using the input and treat BAM files and setting a q-value threshold of 0.05. How many peaks are being called for each replicate?

e. Use phantompeakqualtools to check the quality of the ChIP signal for HCFC1-rep1 and HCFC1-rep2. Report the NSC, RSC, and QualityTag values. What can you conclude about the quality of the data based on these quality measures?

f. Run ChIPQC and upload the report for HCFC1. Discuss the quality of the data using these metrics.

g. Sort each of the narrowPeak files using:

sort -k8,8nr HCFC1-rep1_peaks.narrowPeak > HCFC1-rep1_peaks_sorted.narrowPeak

sort -k8,8nr HCFC1-rep2_peaks.narrowPeak > HCFC1-rep2_peaks_sorted.narrowPeak

h. Use IDR to assess reproducibility between replicates using the sorted peaks in 2g. Use the awk command from class to determine how many peaks meet an IDR cutoff of 0.05?

**NOTE 1: You would normally set up the IDR run on peaks called using a more liberal threshold, but for this question we will work with what we have.**

**NOTE 2: Just perform the 1 step we did in class to generate the IDR stats; there is no need to do the remaining 2 steps in the IDR workflow. (However, you can do them optionally if you are interested and we are happy to help you troubleshoot if need be.)**

i. These high confidence peaks from IDR can be used to explore downstream tools for functional enrichment and motif discovery. Use ChIPseeker to do the following:

Plot the read count frequency relative to the TSS (use a window of +/- 1000 bp). Upload this figure.

Annotate the peaks and write those annotations to file. Upload the results table.

Plot a pie chart using the genomic region annotations. Which genomic feature is the most represented?

Perform a GO enrichment analysis using the gene annotations (enrichGO) and parameters used in class. If there are any terms enriched output the results to file and visualize using a dotplot. Provide the code you used and upload figures and results table (if any).

h. **OPTIONAL:** Try motif analysis using the MEME suite and comment on the results you find.

Practice Exercises

[data-colorid=rdkoheikar]{color:#cc0000} html[data-color-mode=dark] [data-colorid=rdkoheikar]{color:#ff3333}[data-colorid=ty8u99cj2v]{color:#cc0000} html[data-color-mode=dark] [data-colorid=ty8u99cj2v]{color:#ff3333}ChIP-seq Practice Exercises

# ChIP-Seq Analysis Workflow

ChIP-seq Practice Exercises