Practice Exercises Answer Key

Session V: ChIP-seq and optional questions - Answer Key

****NOTE: The questions for Session V homework are more comprehensive to help you get a better feel for the workflow and will involve working with full datasets. Therefore, for this homework we suggest that you do not work in your home directory. Follow the instructions below to work on /n/scratch2.****

# ChIP-Seq Analysis Workflow

1. Create a directory on /n/scratch2 using your eCommons user ID as the directory name. Within that directory create a directory called HCFC1_chipseq.

$ mkdir -p /n/scratch2/eCommons/HCFC1_chipseq

2. You have strong evidence that HCFC1 is the transcription co-factor that associates with your protein of interest. To confirm this hypothesis you need to find binding regions for HCFC1 and see if they overlap with your current list of regions. The ENCODE project has ChIP-seq data for HCFC1 available for download. The dataset is using a human liver cancer cell line HepG2 and the data contains 32 bp single end reads. We have downloaded this data and made it available for you on Orchestra.

a. Setup a project directory structure within the HCFC1_chipseq directory as shown below and copy over the raw FASTQ files from /groups/hbctraining/ngs-data-analysis2016/chipseq/HCFC1/ into the appropriate directory:

├── HCFC1_chipseq
      ├── data
          └── raw_fastq
      ├── results/
          ├── fastQC
          ├── bowtie2
          ├── unique_align
          ├── macs2
          └── sambamba

$ cd /n/scratch2/eCommons/HCFC1_chipseq

$ mkdir -p data/raw_fastq results/

$ cd results/

$ mkdir fastQC bowtie2 unique_align macs2 sambamba

$ cp /groups/hbctraining/ngs-data-analysis2016/chipseq/HCFC1/* /n/scratch2/eCommons/HCFC1_chipseq/data/raw_fastq

b. Create a shell script that uses positional parameters and takes in a .fastq file as input and will perform the following:

1. 1. Run FASTQC on all FASTQ files
  2. Align reads with bowtie2 using the parameters we used in class. You will need to copy over the hg19 index files from /groups/shared_databases/igenome/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/
  3. Change alignment file format from SAM to BAM (can be done using samtools or sambamba)
  4. Sort the BAM file by read coordinate locations (can be done using sambamba or with samtools)
  5. Filter to keep only uniquely mapping reads (this will also remove any unmapped reads) using sambamba
  6. Index the final BAM file. This will be useful for visualization and QC (deepTools)

**NOTE: The script will require positional parameters, and using basename will help with the naming of output files**

#!/bin/bash

# Take the input FASTQ file and create a short name for output files
fq=$1
NAME=`basename $fq .fastq`

**NOTE: make sure you load the required modules for each step

**NOTE: your final filtered BAM files should be located in unique_align

# USAGE: sh chipseq_align_on_input_file.sh <fastq file>

# This script will take the location and name of a fastq file and perform the following steps on it in a new directory.

## starting with fastqc,

## alignment with Bowtie2,

## SAM to BAM conversion with samtools,

## BAM sorting with sambamba,

## filtering to keep only uniquely mapping reads with sambamba,

## index BAM file.

#!/bin/bash

fq=$1

NAME=`basename $fq .fastq` # shorten the name of the file

echo "****Starting analysis on $NAME****"

# Loading all the modules

module load seq/fastqc/0.11.3

export PATH=/opt/bcbio/centos/bin:$PATH

# Set paths

baseDir=/n/scratch2/eCommons/HCFC1_chipseq

# FastQC on raw file

echo "****FastQC on raw FASTQ****"

fastqc $fq

echo "****Moving fastQC files****"

mv ${baseDir}/data/*fastqc* ${baseDir}/results/fastQC

# Align trimmed file

echo "****Running Bowties2 to align raw reads****"

bowtie2 -p 6 -q -x /groups/shared_databases/igenome/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/genome -U $fq -S ${baseDir}/results/bowtie2/${NAME}_aln_unsorted.sam

# SAM to BAM

echo "****SAM to BAM conversion****"

samtools view -h -S -b ${baseDir}/results/bowtie2/${NAME}_aln_unsorted.sam -o ${baseDir}/results/bowtie2/${NAME}_aln_unsorted.bam

# Sort BAM

echo "****Sort and Index BAM file****"

sambamba sort -t 6 -o ${baseDir}/results/sambamba/${NAME}_aln_sorted.bam ${baseDir}/results/bowtie2/${NAME}_aln_unsorted.bam

# Filter to keep only uniquely aligned reads

echo "****Filtering reads for unique mappings****"

sambamba view -h --nthreads 1 -f bam -F "[XS] == null and not unmapped " ${baseDir}/results/sambamba/${NAME}_aln_sorted.bam > ${baseDir}/results/unique_align/${NAME}_aln.bam

# Index BAM file containing uniquely aligned reads

echo "****Indexing BAM file****"

samtools index ${baseDir}/results/unique_align/${NAME}_aln.bam

c. Create a separate job submission script to run the shell script you created in b) on all four .fastq files. You have the option of running this in serial or parallel. Take a look at the lessons from the RNA-Seq session to help with setting up the job submission script.

#!/bin/bash

for fq in /n/scratch2/eCommons/HCFC1_chipseq/data/raw_fastq/*.fastq

do

bsub -q priority -n 6 -W 1:30 -R "rusage[mem=8000]" -J chipseq_HCFC1 -o %J.out -e %J.err "sh chipseq_align_on_input_file.sh $fq"

sleep 1

done

d. Once the job submission script has finished running, use MACS2 to call peaks for each replicate using the input and treat BAM files and setting a q-value threshold of 0.05. How many peaks are being called for each replicate?

$ cd /n/scratch2/eCommons/HCFC1_chipseq

$ macs2 callpeak -t results/unique_align/HCFC1_treat_rep1_aln.bam -c results/unique_align/HCFC1_input_rep1_aln.bam -f BAM -g hs -q 0.05 --bdg --outdir results/macs2 -n HCFC1-rep1

$ macs2 callpeak -t results/unique_align/HCFC1_treat_rep2_aln.bam -c results/unique_align/HCFC1_input_rep2_aln.bam -f BAM -g hs -q 0.05 --bdg --outdir results/macs2 -n HCFC1-rep2

Replicate1 has 8418 peaks

Replicate 2 has 10899 peaks

e. Sort each of the narrowPeak files using:

sort -k8,8nr HCFC1-rep1_peaks.narrowPeak > HCFC1-rep1_peaks_sorted.narrowPeak

sort -k8,8nr HCFC1-rep2_peaks.narrowPeak > HCFC1-rep2_peaks_sorted.narrowPeak

f. Use phantompeakqualtools to check the quality of the ChIP signal for HCFC1-rep1 and HCFC1-rep2. Report the NSC, RSC, and QualityTag values. What can you conclude about the quality of the data based on these quality measures?

# Downloaded the phantompeakqualtools to my unique_align/ folder, but you could have just used the one we downloaded in class

wget https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/phantompeakqualtools/ccQualityControl.v.1.1.tar.gz

tar -xzf ccQualityControl.v.1.1.tar.gz

# Organized a directory structure for QC

cd phantompeakqualtools/

mkdir qual logs

module load stats/R/3.2.1 seq/samtools/1.2

# Ran phantompeakqualtools on all ChIP bam files

for bam in ../*treat*; do bam2=`basename $bam _aln.bam`; Rscript run_spp.R -c=$bam -savp -out=qual/${bam2}.qual > logs/${bam2}.Rout; done

HCFC1-rep1: NSC = 1.138398, RSC = 1.316096, and QualityTag = 1

HCFC1-rep2: NSC = 1.383237, RSC = 1.436852, and QualityTag = 1

Values of NSC > 1.1 and RSC > 1 indicate ChIP experiments with good signal to noise for the peak calls. This is reflected in the QualityTag value of 1, which indicates 'High' quality ChIP data.

g. Use deepTools to explore the correlation between samples using a heatmap. How well do the read coverages for the replicate samples correlate? Upload the heatmap to your homework folder.

# Created a 'deeptools' directory within 'unique_align' folder and changed directories into it

# Running mutliBamSummary to create coverage estimates for each sample

multiBamSummary bins --ignoreDuplicates -p 6 --bamfiles ../*aln.bam -out deeptools_multiBAM.out.npz --outRawCounts readCounts.tab

# Creating heatmap

plotCorrelation --corData deeptools_multiBAM.out.npz --plotFile deepTools_heatmap.png --corMethod pearson --whatToPlot heatmap --labels Input_Rep1 Input_Rep2 Treat_Rep1 Treat_Rep2 --plotNumbers

The coverage profiles for the replicate samples correlate very well, and we see stark differences in coverage profiles between the input and ChIP groups.

h. Use IDR to assess reproducibility between replicates using the sorted peaks in 2e. **NOTE: you would normally set up IDR on peaks called using a more liberal threshold, but for this question we will work with what we have.** Use the awk command from class to determine how many peaks meet and IDR of 0.05?

$ mkdir /n/scratch2/eCommons/HCFC1_chipseq/results/idr

$ cd /n/scratch2/eCommons/HCFC1_chipseq/results/

$ idr --samples macs2/HCFC1-rep1_peaks_sorted.narrowPeak macs2/HCFC1-rep2_peaks_sorted.narrowPeak \

--input-file-type narrowPeak \

--rank q.value \

--output-file idr/HCFC1-idr \

--plot \

--log-output-file idr/HCFC1.idr.log

The total number of peaks that pass the threshold of IDR < 0.05 is 2529 peaks.

awk '{if($12 > 1.3) print $0}' idr/HCFC1-idr | wc -l

i. SUGGESTION: Use the resulting peaks from IDR to explore downstream tools for functional enrichment (GREAT) and motif discovery (MEME suite). Explain how these tools might help in drawing conclusions about your original hypothesis.

## OPTIONAL:

3. Use the full dataset available at /groups/hbctraining/ngs-data-analysis2016/rnaseq/full_dataset/ to perform the MOV10 analysis as listed below. Please make sure your working directory is NOT your home directory, and that it is in the group directory for your lab.

a. Using the workflow and submission scripts we generated in class, parallelize the RNA-Seq analysis of the whole files. Start with QC and trimming, perform the alignment with STAR, and obtain gene counts. You will have to modify a few things, like the directories, the number of cores, the location of the index, and any other pertinent things. Use the STAR index available at /groups/shared_databases/igenome/Homo_sapiens/UCSC/hg19/Sequence/starIndex/.

Once you have completed all the steps on Orchestra, transfer the files over to your personal computer and perform the differential expression analysis on the counts data using DESeq2. Make sure you submit your workflow on orchestra asjobs as scripts

Report the DE genes identified for the KD and OE samples, in comparison to the Irrel control, at an FDR threshold of 0.05, by uploading a text file with only significant genes and all the other columns output by DESeq2.

b. Use the raw data to run through Sailfish, followed by tximport and DESeq2. Report the DE genes identified for the KD and OE samples, in comparison to the Irrel control, at an FDR threshold of 0.05, by uploading a text file with only significant genes and all the other columns output by DESeq2.

Note: You can create a Sailfish index for yourself using the transcript fasta file here: /groups/bcbio/bcbio/genomes/Hsapiens/GRCh37/rnaseq/ref-transcripts.fa

c. Get the overlap between the 2 methods after separating the lists into UP-regulated and DOWN-regulated genes, and report the numbers and lists of genes for the comparisons below:

UP-regulated genes in KD identified at FDR 0.05 from Sailfish (2b) and the standard approach (2a)

Down-regulated genes in KD identified at FDR 0.05 from Sailfish (2b) and the standard approach (2a)

UP-regulated genes in OE identified at FDR 0.05 from Sailfish (2b) and the standard approach (2a)

DOWN-regulated genes in OE identified at FDR 0.05 from Sailfish (2b) and the standard approach (2a)