younglab / origami Goto Github PK

View Code? Open in Web Editor NEW

1.0 7.0 4.0 5.61 MB

Statistical Modeling of ChIA-PET interactions

License: Other

Shell 45.78% C++ 8.69% R 19.40% Perl 20.86% Makefile 1.10% M4 4.18%

chia-pet chromatin chromatin-interaction

origami's Introduction

origami

Overview

Origami is a pipeline for processing and calling high-confidence chromatin loops associated with the ChIPped factor. The pipeline implements a semi-Bayesian two-compenent mixture model to learn which contacts within the data set appear to represent apparent structured chromatin loops in a cell population while correcting for important biases in ChIA-PET data (notably, linear genomic distance between the ends of a potential chromatin loop).

Contact Information

For more information or any questions, please contact [email protected] or [email protected].

Installation

To install origami, first run the configure script to test for the presence of mdost of the dependencies origami needs. After configure completes successfully, run make to compile the C++ code. Then make install will copy all the necessary files to the installation directory.

(Note: at the present time, the configure script does not check if Bamtools is installed correctly, please make sure it is! The Makfile may need to be adjusted to adapt to where Bamtools API files are installed on your system.)

Depedencies

cutadapt (tested with version 1.8)
samtools (tested with version 1.2)
bamtools (tested with version 2.2.3)
bowtie1
MACS1 and MACS2
R (also requires the R libraries GenomicsRanges, matrixStats)
g++
bash
make
autoconf

Running the pipeline

The pipeline has three steps that need to be run:

origami-alignment -- aligns the ChIA-PET reads
origami-analysis -- estimates the confidence in each ChIA-PET interaction
origami-conversion -- converts the origami output to popular genomic file formats for visualization

Aligning ChIA-PET reads (origami-alignment)

Aligning the reads is handled through the origami-alignment executable.

Command: origami-alignment [options] <FASTQ read 1> <FASTQ read 2>

This script will take a bowtie index (must be Bowtie 1 index) and the paired read files and handle the adapter trimming, alignment, and read peak calling. The output of this step is a BAM file of aligned paired reads and peak calls from MACS1 and MACS2 named accordingly. The FASTQ files can be in either gzipped or bz2zipped formats. The optional arguments are below:

-o,--output=: The output directory to put all the aligned reads and peak calls in. Defaults to "output"
-h: Shows the help menu
-v: Turn on verbose mode
-p: Activates parallization on the LSF queue, must be paired with --lsf-queue. It is also recommended that --splitnum be set approrpiate to the data
--divide-pets=[NUM]: When -p is active, this will divide the FASTQ reads into subfiles of NUM reads to distribute over a LSF queue for faster processing
--lsf-queue=[queue name]: Sets the LSF queue to queue all the parallized jobs, implies -p. There is no default for this option
-m=,--min-len=[integer]: Sets the minimum read length to keep post-adapter trimming. Default is 15 bp
--mode=[mode]: Sets the type of linker-trimming mode, either long (bridge-linker) or ab (AB-linker) are acceptable options
--forward-linker: Set the forward linker to search for in the paired reads. Default is ACGCGATATCTTATCTGACT.
--reverse-linker: Set the reverse linker to search for in the paired reads. Default is AGTCAGATAAGATATCGCGT.
--ab-linker: Activates the AB-linker trimming mode instead of the default long-read linker mode.
--a-linker: Sets the A-linker sequence. Default is CTGCTGTCCG. Implies --ab-linker
--b-linker: Sets the B-linker sequence. Default is CTGCTGTCAT. Implies --ab-linker
--pp=[executable command]: Preprocess reads. Runs each of the passed FASTQ files through the executable script before trimming and alignment. Default is no preprocessing
--macs-gsize=[value]: Set the genome size option for MACS1 and MACS2 (see the MACS documentation). Default is hs
--keep-tmp: Don't delete temporary files when the alignment is finished

Estimating the confidence estimates in each interaction (origami-analysis)

Command: origami-analysis [options]

After the alignment step, the next executable estimates the confidence in each identified interaction within the data set. This takes the resultsing BAM file and peak file from the alignment step and handles the identification of interactions and estimating their confidence. Any BAM file that is paired end and shorted by read name can be passed to the script. Any BED-type peaks file can be passed to the algorithm. The executable will generate two files: -model-data.Rdata and results.csv. The options are below:

-h: Shows the help menu
--without-distance: Turns off weighting interactions by their ditsance
--iterations=[positive integer]: Sets the number of iterations for the MCMC algorithm. Default is 10000
--burn-in=[integer]: Sets the number of iterations used for the burn-in step of the MCMC algorithm. Default is 100. A value of 0 turns off the burn-in period.
--prune=[0 or 2+]: Sets the number of steps at which the MCMC algorithm prunes an interation. The default is 5. A value of 0 turns off the pruning.
--slop-dist=[integer]: Extends peak peak in both directions by the given length. Default is 0. This also requires --slop-genome
--slop-genome=[file]: A file that contains two columns: the first for each chromosome in the genome and the length of the chromosome. Needs --slop-dist.
--save-full-model: Saves all the MCMC data generated. By default it saves a reduced set of essential parameters. (Warning: these files can get very big and can potentially exhaust the RAM while running, so use this for a low number of iterations if possible)

Converting the results to a useable format (origami-conversion)

From the results CSV file generated from the origami-analysis, the results file can be converted through the following command:

Command: origami-conversion <bedpe/bed/washu>

This will convert the interactions in the results file with their particular column into a BEDPE, BED12, or WashU file format. For the BED12 file format, since this assumes each entry starts and ends on the same chromosome, this excludes all interchromosmal interations. Also, this will report all interactions regardless of significance, so one may want to filter the results after the conversion. The script allows for the following column indexes (generally one will use 3):

1 -- Reports the PET count of each interaction
2 -- Reportes the (uncorrected) hypergeometric p-value of the interaction (based on the Fullwood et al 2009 method)
3 -- Reports the confidence/belief probability from the MCMC procedue

origami's People

Contributors

Stargazers

Watchers

Forkers

xtmgah rakarnik ashuein sirusb

origami's Issues

the Bamtools API

Could you please show me an example to show how to modify the makefile when error "configure: error: Cannot find one or more Bamtools API header files, please set -I to the path to the Bamtools header files within CPPFLAGS=" occurred? thank you very much.

Bug in merging split BAM files into one when multiple samples are run in parallel from same base directory

The merging of split BAM files into one breaks since the merging isn't done in the tmp folder but the folder the software is dispatched from.

The problem likely lies here in bowtie_align.sh:

  dispatch "samtools merge $OUTDIR/tmp/left_kept.bam $OUTDIR/tmp/leftkept*.bam"
  dispatch "samtools merge $OUTDIR/tmp/right_kept.bam $OUTDIR/tmp/rightkept*.bam"

undefined references?

Modules are loaded as required and we get a good response from configure (see notes at the end).

But when I run make, I get undefined reference errors.

[root@server origami-1.1]# make
cd src && g++ -I/binaries/bamtools/2.5.0/include/bamtools/ -c mapped-reads-merge.cpp
cd src && g++ -o mapped-reads-merge mapped-reads-merge.o -lbamtools
/binaries/bamtools/2.5.0/lib64/../lib64/libbamtools.a(BgzfStream_p.cpp.o): In function `BamTools::Internal::BgzfStream::InflateBlock(unsigned long const&)':
BgzfStream_p.cpp:(.text+0x1da): undefined reference to `inflateInit2_'
BgzfStream_p.cpp:(.text+0x1ed): undefined reference to `inflate'
BgzfStream_p.cpp:(.text+0x200): undefined reference to `inflateEnd'
BgzfStream_p.cpp:(.text+0x2a7): undefined reference to `inflateEnd'
BgzfStream_p.cpp:(.text+0x2f5): undefined reference to `inflateEnd'
/binaries/bamtools/2.5.0/lib64/../lib64/libbamtools.a(BgzfStream_p.cpp.o): In function `BamTools::Internal::BgzfStream::DeflateBlock(int)':
BgzfStream_p.cpp:(.text+0xc01): undefined reference to `deflateEnd'
BgzfStream_p.cpp:(.text+0xc7b): undefined reference to `deflateInit2_'
BgzfStream_p.cpp:(.text+0xc92): undefined reference to `deflate'
BgzfStream_p.cpp:(.text+0xca8): undefined reference to `deflateEnd'
BgzfStream_p.cpp:(.text+0xce4): undefined reference to `crc32'
BgzfStream_p.cpp:(.text+0xcf1): undefined reference to `crc32'
collect2: error: ld returned 1 exit status
make: *** [bamexes] Error 1

but a grep -r <reference> /binaries/bamtools/2.5.0/ returns a positive match in each case? EG:

[root@server origami-1.1]# grep -r inflate /binaries/bamtools/2.5.0/*                        
Binary file /binaries/bamtools/2.5.0/bin/bamtools matches
Binary file /binaries/bamtools/2.5.0/lib/libbamtools.a matches
Binary file /binaries/bamtools/2.5.0/lib64/libbamtools.a matches

I'm not sure how to fix that problem. Any tips?

Notes

[root@server origami-1.1]# module list
Currently Loaded Modulefiles:
  1) cutadapt/1.15        3) java/1.8.0_144-jre   5) samtools/1.6         7) bamtools/2.5.0
  2) macs/2.1.1           4) R/3.4.0              6) bowtie/1.2.2         8) bedtools/2.26

[root@server origami-1.1]# ./configure --prefix=/binaries/origami/1.1/
configure: Checking for software tools needed by origami
checking for cutadapt... cutadapt
checking for macs2... macs2
checking for bowtie... bowtie
checking for bedtools... bedtools
checking for R... R
checking for Rscript... Rscript
checking for samtools... samtools
checking for perl... perl
checking for g++... g++
checking whether the C++ compiler works... yes
checking for C++ compiler default output file name... a.out
checking for suffix of executables... 
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C++ compiler... yes
checking whether g++ accepts -g... yes
configure: Checking for bedtools version
configure: Checking for C++ libraries needed by origami
checking how to run the C++ preprocessor... g++ -E
checking for grep that handles long lines and -e... /usr/bin/grep
checking for egrep... /usr/bin/grep -E
checking for ANSI C header files... yes
checking for sys/types.h... yes
checking for sys/stat.h... yes
checking for stdlib.h... yes
checking for string.h... yes
checking for memory.h... yes
checking for strings.h... yes
checking for inttypes.h... yes
checking for stdint.h... yes
checking for unistd.h... yes
checking api/BamReader.h usability... yes
checking api/BamReader.h presence... yes
checking for api/BamReader.h... yes
checking api/BamWriter.h usability... yes
checking api/BamWriter.h presence... yes
checking for api/BamWriter.h... yes
configure: Checking for R libraries needed by origami
checking if R library GenomicRanges is installed... yes
checking if R library matrixStats is installed... yes
configure: creating ./config.status
config.status: creating Makefile
config.status: creating bin/origami-alignment
config.status: creating bin/origami-analysis

Pairing of mapped ChIA-PET reads into single BAM file is broken

Reads were only partially merged correctly (see [Issue #1]). This is impacting downstream peak calling, so this needs to be fixed before moving forward.

Issue : fastq data cat into output/tmp/reads having error.

I had tried to use both fastq.gz and fastq files on the origami-alignment.
The first step origami-alignment unzip and cat the fastq files to the output/tmp/left-reads.fq and output/tmp/right-reads.fq had always come with an error

This is cutadapt 1.15 with Python 2.7.6
Command line parameters: -n 3 --overlap 10 -e 0 --discard-untrimmed -m 20 -a CTGCTGTCCG -A CTGCTGTCCG -o output/tmp/l_same_aa.fq -p output/tmp/r_same_aa.fq output/tmp/left_reads.fq output/tmp/right_reads.fq
Running on 1 core
Trimming 2 adapters with at most 0.0% errors in paired-end mode ...
cutadapt: error: In read named 'SRR6010260.33 WIGTC-HISEQ:1:1212:1644:2156 length=50': length of quality sequence (48) and length of read (50) do not match
This is cutadapt 1.15 with Python 2.7.6
Command line parameters: -n 3 --overlap 10 -e 0 --discard-untrimmed -m 20 -a CTGCTGTCAT -A CTGCTGTCAT -o output/tmp/l_same_bb.fq -p output/tmp/r_same_bb.fq output/tmp/left_reads.fq output/tmp/right_reads.fq
Running on 1 core
Trimming 2 adapters with at most 0.0% errors in paired-end mode ...
cutadapt: error: In read named 'SRR6010260.33 WIGTC-HISEQ:1:1212:1644:2156 length=50': length of quality sequence (48) and length of read (50) do not match
This is cutadapt 1.15 with Python 2.7.6
Command line parameters: -n 3 --overlap 10 -e 0 --discard-untrimmed -m 20 -a CTGCTGTCCG -A CTGCTGTCAT -o output/tmp/l_diff_ab.fq -p output/tmp/r_diff_ab.fq output/tmp/left_reads.fq output/tmp/right_reads.fq
Running on 1 core
Trimming 2 adapters with at most 0.0% errors in paired-end mode ...
cutadapt: error: In read named 'SRR6010260.33 WIGTC-HISEQ:1:1212:1644:2156 length=50': length of quality sequence (48) and length of read (50) do not match
This is cutadapt 1.15 with Python 2.7.6
Command line parameters: -n 3 --overlap 10 -e 0 --discard-untrimmed -m 20 -a CTGCTGTCAT -A CTGCTGTCCG -o output/tmp/l_diff_ba.fq -p output/tmp/r_diff_ba.fq output/tmp/left_reads.fq output/tmp/right_reads.fq
Running on 1 core
Trimming 2 adapters with at most 0.0% errors in paired-end mode ...
cutadapt: error: In read named 'SRR6010260.33 WIGTC-HISEQ:1:1212:1644:2156 length=50': length of quality sequence (48) and length of read (50) do not match
This is cutadapt 1.15 with Python 2.7.6
Command line parameters: -n 3 --overlap 10 -e 0 --discard-trimmed -m 20 -o output/tmp/l_neither.fq -p output/tmp/r_neither.fq SRR6010260_R1.fastq SRR6010260_R2.fastq

I checked with the raw fastq files the number of quality sequence and length of read for 'SRR6010260.33 WIGTC-HISEQ:1:1212:1644:2156 length=50' is correct. However, in the output/tmp/right-reads.fq files, it removed 2 base of quality sequence and result in error.

fastq raw data

@SRR6010260.33 WIGTC-HISEQ:1:1212:1644:2156 length=50
GTGCTGGGGTCCATGTGGGCCAGATGCCCTGGGCCCTGGGCAGGGCCAGG
+SRR6010260.33 WIGTC-HISEQ:1:1212:1644:2156 length=50
@d0@0<<<C/<D<C1<1<<1<@ghhc??CGC1E<@/1CGEFCDHHH@E??

output/tmp/right-reads.fq

@SRR6010260.33 WIGTC-HISEQ:1:1212:1644:2156 length=50
GTGCTGGGGTCCATGTGGGCCAGATGCCCTGGGCCCTGGGCAGGGCCAGG
+SRR6010260.33 WIGTC-HISEQ:1:1212:1644:2156 length=50
@d0@0<<<C/<D<C1<1<<1<@ghhc??CGC1E<@CGEFCDHHH@E??

It appears that the cat process of fastq files had removed "\1" from the fastq data?
Please check on this issue. I am not sure how it affect downstream alignment if the cutadapt step does not works properly.

Thanks.