Practice Exercises Answer Key

ChIP-seq Practice Exercises: Answer Key

# ChIP-Seq Analysis Workflow

1. Create a directory on /n/scratch2 using your eCommons user ID as the directory name. Within that directory create a directory called HCFC1_chipseq.

$ mkdir -p /n/scratch2/your_eCommons_ID/HCFC1_chipseq

2. You have strong evidence that HCFC1 is the transcription co-factor that associates with your protein of interest. To confirm this hypothesis you need to find binding regions for HCFC1 and see if they overlap with your current list of regions. The ENCODE project has ChIP-seq data for HCFC1 available for download. The dataset is using a human liver cancer cell line HepG2 and the data contains 32 bp single end reads. We have downloaded this data and made it available for you on Orchestra.

a. Setup a project directory structure within the HCFC1_chipseq directory as shown below and copy over the raw FASTQ files from /n/groups/hbctraining/ngs-data-analysis2016/chipseq/HCFC1/ into the appropriate directory:

├── HCFC1_chipseq
      ├── data
          └── raw_fastq
      ├── results/
          ├── fastQC
          ├── bowtie2
          ├── unique_align
          ├── macs2
          └── sambamba

$ cd /n/scratch2/eCommons/HCFC1_chipseq

$ mkdir -p data/raw_fastq results/

$ cd results/

$ mkdir fastQC bowtie2 unique_align macs2 sambamba

$ cp /n/groups/hbctraining/ngs-data-analysis2016/chipseq/HCFC1/* /n/scratch2/your_eCommons_ID/HCFC1_chipseq/data/raw_fastq

b. Create a shell script that uses positional parameters and takes in a .fastq file as input and will perform the following:

1. 1. Run FASTQC on all FASTQ files
  2. Align reads with bowtie2 using the parameters we used in class. You will need to copy over the hg19 index files from /n/groups/shared_databases/igenome/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/
  3. Change alignment file format from SAM to BAM (can be done using samtools or sambamba)
  4. Sort the BAM file by read coordinate locations (can be done using sambamba or with samtools)
  5. Filter to keep only uniquely mapping reads (this will also remove any unmapped reads) using sambamba
  6. Index the final BAM file. This will be useful for visualization and QC (deepTools)

**NOTE: The script will require positional parameters, and using basename will help with the naming of output files**

#!/bin/bash

# Take the input FASTQ file and create a short name for output files
fq=$1
NAME=`basename $fq .fastq`

**NOTE: make sure you load the required modules for each step

**NOTE: your final filtered BAM files should be located in unique_align

# USAGE: sh chipseq_align_on_input_file.sh <fastq file>

# This script will take the location and name of a fastq file and perform the following steps on it in a new directory.

## starting with fastqc,

## alignment with Bowtie2,

## SAM to BAM conversion with samtools,

## BAM sorting with sambamba,

## filtering to keep only uniquely mapping reads with sambamba,

## index BAM file.

#!/bin/bash

fq=$1

NAME=`basename $fq .fastq` # shorten the name of the file

echo "****Starting analysis on $NAME****"

# Loading all the modules

module load seq/fastqc/0.11.3

module load gcc/6.2.0 bowtie2/2.2.9

module load samtools/1.3.1

export PATH=/n/app/bcbio/tools/bin:$PATH

# Set paths

baseDir=/n/scratch2/your_eCommons_ID/HCFC1_chipseq

# FastQC on raw file

echo "****FastQC on raw FASTQ****"

fastqc $fq

echo "****Moving fastQC files****"

mv ${baseDir}/data/*fastqc* ${baseDir}/results/fastQC

# Align trimmed file

echo "****Running Bowtie2 to align raw reads****"

bowtie2 -p 6 -q -x /n/groups/shared_databases/igenome/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/genome -U $fq -S ${baseDir}/results/bowtie2/${NAME}_aln_unsorted.sam

# SAM to BAM

echo "****SAM to BAM conversion****"

samtools view -h -S -b ${baseDir}/results/bowtie2/${NAME}_aln_unsorted.sam -o ${baseDir}/results/bowtie2/${NAME}_aln_unsorted.bam

# Sort BAM

echo "****Sort and Index BAM file****"

sambamba sort -t 6 -o ${baseDir}/results/sambamba/${NAME}_aln_sorted.bam ${baseDir}/results/bowtie2/${NAME}_aln_unsorted.bam

# Filter to keep only uniquely aligned reads

echo "****Filtering reads for unique mappings****"

sambamba view -h --nthreads 1 -f bam -F "[XS] == null and not unmapped and not duplicate" ${baseDir}/results/sambamba/${NAME}_aln_sorted.bam > ${baseDir}/results/unique_align/${NAME}_aln.bam

# Index BAM file containing uniquely aligned reads

echo "****Indexing BAM file****"

samtools index ${baseDir}/results/unique_align/${NAME}_aln.bam

c. Create a separate job submission script to run the shell script you created in b) on all four .fastq files. You have the option of running this in serial or parallel. Take a look at the a utomation lesson to help with setting up the job submission script.

## IN PARALLEL

#!/bin/bash

for fq in /n/scratch2/your_eCommons_ID/HCFC1_chipseq/data/raw_fastq/*.fastq

do

sbatch -p short -t 0-12:00 -n 6 --job-name chipseq-analysis -o %j.out -e %j.err --wrap="sh chipseq_align_on_input_file.sh $fq"

sleep 1

done

d. Once the job submission script has finished running, use MACS2 to call peaks for each replicate using the input and treat BAM files and setting a q-value threshold of 0.05. How many peaks are being called for each replicate?

$ cd /n/scratch2/your_eCommons_ID/HCFC1_chipseq

$ macs2 callpeak -t results/unique_align/HCFC1_treat_rep1_aln.bam -c results/unique_align/HCFC1_input_rep1_aln.bam -f BAM -g hs -q 0.05 --outdir results/macs2 -n HCFC1-rep1

$ macs2 callpeak -t results/unique_align/HCFC1_treat_rep2_aln.bam -c results/unique_align/HCFC1_input_rep2_aln.bam -f BAM -g hs -q 0.05 --outdir results/macs2 -n HCFC1-rep2

Replicate1 has 8418 peaks

Replicate 2 has 10899 peaks

e. Use phantompeakqualtools to check the quality of the ChIP signal for HCFC1-rep1 and HCFC1-rep2. Report the NSC, RSC, and QualityTag values. What can you conclude about the quality of the data based on these quality measures?

# Downloaded the phantompeakqualtools to my unique_align/ folder, but you could have just used the one we downloaded in class

wget https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/phantompeakqualtools/ccQualityControl.v.1.1.tar.gz

tar -xzf ccQualityControl.v.1.1.tar.gz

# Organized a directory structure for QC

cd phantompeakqualtools/

mkdir qual logs

module load stats/R/3.2.1 seq/samtools/1.2

# Ran phantompeakqualtools on all ChIP bam files

for bam in ../*treat*; do bam2=`basename $bam _aln.bam`; Rscript run_spp.R -c=$bam -savp -out=qual/${bam2}.qual > logs/${bam2}.Rout; done

HCFC1-rep1: NSC = 1.138398, RSC = 1.316096, and QualityTag = 1

HCFC1-rep2: NSC = 1.383237, RSC = 1.436852, and QualityTag = 1

Values of NSC > 1.1 and RSC > 1 indicate ChIP experiments with good signal to noise for the peak calls. This is reflected in the QualityTag value of 1, which indicates 'High' quality ChIP data.

f. Sort each of the narrowPeak files using:

sort -k8,8nr HCFC1-rep1_peaks.narrowPeak > HCFC1-rep1_peaks_sorted.narrowPeak

sort -k8,8nr HCFC1-rep2_peaks.narrowPeak > HCFC1-rep2_peaks_sorted.narrowPeak

g. Use IDR to assess reproducibility between replicates using the sorted peaks in 2f. Use the awk command from class to determine how many peaks meet and IDR of 0.05?

$ mkdir /n/scratch2/your_eCommons_ID/HCFC1_chipseq/results/idr

$ cd /n/scratch2/your_eCommons_ID/HCFC1_chipseq/results/

$ idr --samples macs2/HCFC1-rep1_peaks_sorted.narrowPeak macs2/HCFC1-rep2_peaks_sorted.narrowPeak \

--input-file-type narrowPeak \

--rank q.value \

--output-file idr/HCFC1-idr \

--plot \

--log-output-file idr/HCFC1.idr.log

The total number of peaks that pass the threshold of IDR < 0.05 is 2529 peaks.

awk '{if($12 > 1.3) print $0}' idr/HCFC1-idr | wc -l

h. These high confidence peaks from IDR can be used to explore downstream tools for functional enrichment and motif discovery. Use GREAT to annotate the peaks and write those annotations to file. Take a look at binding site locations relative to the TSS, what does this tell us?

Evaluate the GO enrichment analysis from GREAT, what terms do you see over-represented?

i. **OPTIONAL:** Try motif analysis using the MEME suite and comment on the results you find.

Bioinformatics Training at Harvard Chan Bioinformatics Core

Practice Exercises Answer Key

Analytics

ChIP-seq Practice Exercises: Answer Key

# ChIP-Seq Analysis Workflow

Related content

Practice Exercises Answer Key

[data-colorid=i8xzlv1ji2]{color:#cc0000} html[data-color-mode=dark] [data-colorid=i8xzlv1ji2]{color:#ff3333}[data-colorid=nyhh27myz6]{color:#cc0000} html[data-color-mode=dark] [data-colorid=nyhh27myz6]{color:#ff3333}ChIP-seq Practice Exercises: Answer Key

# ChIP-Seq Analysis Workflow

Related content

ChIP-seq Practice Exercises: Answer Key