felixkrueger / bismark Goto Github PK

View Code? Open in Web Editor NEW

363.0 15.0 101.0 38.91 MB

A tool to map bisulfite converted sequence reads and determine cytosine methylation states

Home Page: http://felixkrueger.github.io/Bismark/

License: GNU General Public License v3.0

Perl 5.68% Smarty 0.16% LilyPond 16.79% HTML 77.33% Python 0.04%

bioinformatics dna methylation

bismark's People

Contributors

Stargazers

Watchers

Forkers

avilella libyarlaylab ewels swingingsimian jsh58 vreuter sdewell shicheng-guo martinjvickers kyounghyoun rifius jdblischak dayedepps diegozavallo hulf025 vesnapajic xuegengwang anran1214 zzygyx9119 kkapuria3 liyanji-code vpbrendel brendelgroup greatfireball jgalazka rakarnik simingk pythseq bwlang yasfoster finallyisnoone tadkeys fjames003 caitlingilbert jeffrey-cs nicolelambert benjamincarlin mmmads barkasn phue chaigsh 1010stone dmitrysigaev zorrodong ewail carbonhp stephenkraemer bmill3r xjyx musketeer-d maggie-wang-bbt isatole17 avkermanov biobit zhsh006 mingleiyang keroguynes jianpingquan nilaycan nebiolabs dadadawei xiaodonghuang-berry ktonosaki konradotto adeslatt humaasif xiaoyantuo maduranga95 ruih-ruih fikriyeolmez wellstein-lab dmsoanes genostack yang-mj snapcolee wentting adesalegn jing-xinxing iromeo semenko chuguoguniang yukieymd bioinformzhang cristein qindan2008 changyiyen alexcpan daniel-ariad-prognomiq kitt-conception lixiang117423 charles-jh trafficone thedawns veroiarrachtai biouser constant-007 meldaw84 aliyurtcicek parthomayo josefoviedo

bismark's Issues

Remove .fastq.gz or .fq.gz from Bismark output filenames

Because file names are getting silly long anyway.

Ambiguous alignments wrongly considered unique

Compared with v0.15.0, the latest version of Bismark (v0.16.1) is giving far fewer reads (~10%) that "did not map uniquely". That is, reads that were previously discarded for having ambiguous alignments are now labeled unique and given in the output BAM file. This may be related to the fix "Bismark: Changed the behaviour of corner cases..." from v0.16.0.

All of the new alignments are to the top strand (CT/CT) and have equivalent alignments to the other strand (CT/GA), but not the same position (or even the same chromosome).

Thanks,

John Gaspar
[email protected]

One example:
// C_to_T converted read
@NS500532:74:HMLCNBGXX:1:11101:16159:1424_1:N:0:AAACATCG
GTTTTATATTGAATTTTGAGGAAGATTGTAGATAGTTTTGATTTTGTTAGAATTTTTTGTGATGTTAATAGTTAAA
+
AAAAAEEEEEEEEEEAEEAEEEEE6EEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEA/EEAEE<EEEEEEE

// aligned with Bismark to mm9:

v0.15.0 alignment: ambiguous, not given

v0.16.1 alignment: NS500532:74:HMLCNBGXX:1:11101:16159:1424_1:N:0:AAACATCG 0 chr5 2414222442 76M * 0 0 GTTTTATATTGAATTTTGAGGAAGATTGTAGATAGTTTTGATTTTGTTAGAATTTTTTGTGACGTTAATAGTTAAA AAAAAEEEEEEEEEEAEEAEEEEE6EEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEA/EEAEE<EEEEEEE NM:i:19 MD:Z:1C0C0C0C1C1C6C9C2C3C2C0C0C3C0C3C7C1C14C4 XM:Z:.hhhh.h.x......x.........x..x...x..hhx...hh...h.......h.x.....Z........h.... XR:Z:CT XG:Z:CT

// individual alignments with Bowtie2 confirm two equivalent alignments (AS=0):

// individual alignment with Bowtie2, CT converted:
NS500532:74:HMLCNBGXX:1:11101:16159:1424_1:N:0:AAACATCG 0 chr5_CT_converted 2414222442 76M * 0 0 GTTTTATATTGAATTTTGAGGAAGATTGTAGATAGTTTTGATTTTGTTAGAATTTTTTGTGATGTTAATAGTTAAA AAAAAEEEEEEEEEEAEEAEEEEE6EEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEA/EEAEE<EEEEEEE AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:76 YT:Z:UU

// individual alignment with Bowtie2, GA converted:
NS500532:74:HMLCNBGXX:1:11101:16159:1424_1:N:0:AAACATCG 16 chr19_GA_converted 3239183042 76M * 0 0 TTTAACTATTAACATCACAAAAAATTCTAACAAAATCAAAACTATCTACAATCTTCCTCAAAATTCAATATAAAAC EEEEEEE<EEAEE/AEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEE6EEEEEAEEAEEEEEEEEEEAAAAA AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:76 YT:Z:UU

bam2nuc / bismark2summary to calculate nucleotide fold coverage

The excellent new bam2nuc module apparently adds a very fetching new table to the bismark summary reports:

It looks like we have sample counts and genome counts for each (di)nucleotide - would it be possible to additionally calculate and print the fold coverage for these? Should be minimal work but a useful metric.

sort CpG report

I just discovered that the cytosine report (*.CpG_report.txt.gz) is not sorted. It seems to be sorted by position, but not by chromosome. Is there a reason for that? Can they be sorted in the future?

methylation calling in watson and crick helix chain separately

I think there must some way to calling the methylation frequency in watson and crick helix chain separately. Am I right? Where I can find this information before the methylation frequency calling?

bam2nuc option to generate genomic_nucleotide_frequencies.txt only

We have our Bismark reference genomes in a central location without write privileges for non-admins (eg. me). It would be nice to have the bam2nuc genomic_nucleotide_frequencies.txt written there, but I need to ask someone to generate it for me. Currently, it seems that the only way to get this to happen is to actually analyse some real data.

Please could a command line option be added to bam2nuc to generate just the genomic_nucleotide_frequencies.txt file without processing any data?

Additionally / alternatively, the generation of this file could be added to the bismark_genome_preparation script.

Bismark alignment step: bail if filenames entered do no exist

This came up during our methylation course.

Coverage result

Does bismark already know the number of cytosines within the reference genome? It would be nice to print this, along with the average observation coverage within different contexts if so.

Methylation extractor should die when encountering read pairs mapped to different chromosomes

The methylation extraction in paired-end mode requires Read 1 and Read 2 to follow each other. We do run a check on the first 1M sequences whether someone or some other program interfered with this order (e.g. by merging PE files with samtools merge instead of samtools merge -n or samtools cat) but sometimes the reads are muddled up at a later stage which may lead to methylation extraction for incorrect regions (or chromosomes). Can we add a check that R1 and R2 should always be identical or die horribly with a sensible error message?

Documenation and --help / --version for bismark2summary

It's kind of tricky to figure out how to use bismark2summary currently - some documentation would be awesome! (I've heard that Markdown is a great way to write docs 😉 )

When I struggled to find documentation, I tried using the standard command line flags --help/-h/--version/-v but none of them did anything, instead it just tried to run with those filenames. It would be nice if these could be added in to be consistent with the other bismark tools.

Update manual to include `--multicore` mode

Can we please update the Bismark manual to include the --multicore option which seems to be present only on the Babraham website or with --help.

feature request: explicit --reference/--genome_folder option

This is a feature request for bismark. Currently, the genome_folder is parsed from the command line as an unnamed option:

  ### GENOME FOLDER
  my $genome_folder = shift @ARGV; # mandatory
  unless ($genome_folder){
    warn "Genome folder was not specified!\n";
    print_helpfile();
    exit;
  }

It would be easier to parse the genome folder from the bam header there was an explicit option for it. E.g.:

# Current mode kept for compatibility
bismark --bam --maxins 500 /path/to/fasta/homo_sapiens/dna/GRCh38 -1 SAMPLE_R1.cut.fq.gz -2 SAMPLE_R2.cut.fq.gz

# New explicit mode
bismark --bam --maxins 500 --genome_folder /path/to/fasta/homo_sapiens/dna/GRCh38 -1 SAMPLE_R1.cut.fq.gz -2 SAMPLE_R2.cut.fq.gz

There are programs downstream that rely on a consistent way of parsing the bam header, and it's easier to parse if all the parameters are named.

For your consideration :-)

feature request: cram support for bismark_methylation_extractor

Feature request: ability to read in cram files for bismark_methylation_extractor

Currently assumes it's a bam in gzip format:

gzip: TST36-80-11_S1_L001_R1_001.cutB.sorted.cram.mrgd.cram: not in gzip format
Had trouble closing filehandle BAM in disguise:
Use of uninitialized value $meth_call in split at /data/home/user/avilella/Bismark/bismark_methylation_extractor line 4361, <IN> line 1.
Use of uninitialized value $strand in string eq at /data/home/user/avilella/Bismark/bismark_methylation_extractor line 4420, <IN> line 1.
Use of uninitialized value $strand in string eq at /data/home/user/avilella/Bismark/bismark_methylation_extractor line 4790, <IN> line 1.
Use of uninitialized value $strand in string eq at /data/home/user/avilella/Bismark/bismark_methylation_extractor line 4838, <IN> line 1.
Use of uninitialized value $strand in concatenation (.) or string at /data/home/user/avilella/Bismark/bismark_methylation_extractor line 4888, <IN> line 1.
The strand information was neither + nor -:

error running bismark methylation extractor

How can I find out what's wrong?


[...]

chr HLA-DRB1-15-01-01-04 (11056 bp)
chr HLA-DRB1-15-02-01 (10313 bp)
chr HLA-DRB1-15-03-01-01 (11567 bp)
chr HLA-DRB1-15-03-01-02 (11569 bp)
chr HLA-DRB1-16-02-01 (11005 bp)

Stored sequence information of 3366 chromosomes/scaffolds in total

==============================================================================
Methylation information will now be written into a genome-wide cytosine report
==============================================================================

>>> Writing genome-wide cytosine report to: CEG36-139-7_S7_L001_R1_001.cutB.sorted.mrgd.barcoded.CpG_report.txt.gz <<<

No last chromosome was defined, something must have gone wrong while reading the data in (e.g. specified wrong file path for a gzipped coverage file?). Please check your command!



Finished generating genome-wide cytosine report

Support alignments to large genomes in Bowtie mode

Can we please allow alignments with Bowtie 1 when the genome was > 4 billion bp and Bowtie created ebwtl(arge) indexes. This has recently been an issue for a small New Zealand lizard with a very large genome...

How to merge PE.bam and re-aligned unmapped SE.bam?

Hello, @FelixKrueger
I am a newbie on analysis of WGBS data. I searched a lot using google, but did not get the answer. So i have to come here to ask for help. The following is my problem.

On the advice of this. I did run PE alignments with --unmapped specified in Bismark, and the resulting unmapped Read 1 and Read 2 files then were aligned in single-end mode. As a result, I got three .bam files:

*_bismark_bt2_pe.bam
*_unmapped_reads_1_bismark_bt2.bam
*_unmapped_reads_2_bismark_bt2.bam

Should I merge these three .bam files before bismark_methylation_extractor and how to merge?
or doing the following first and then how to merge the results?

bismark_methylation_extractor -p *_bismark_bt2_pe.bam
bismark_methylation_extractor -s *_unmapped_reads_1_bismark_bt2.bam
bismark_methylation_extractor -s *_unmapped_reads_1_bismark_bt2.bam

Forgive my poor English.
Hope you can understand my question...

bismark shared memory loading of reference

This is a feature request against the master branch of bismark:

Bismark loads the reference genome in memory, then calls bowtie2 to align reads.

The loading of the reference can take between 30 seconds and 5 minutes, depending on the size of the reference and the type of read access.

If the use of bismark is sporadic, e.g. a blast-like service of a few reads against a running bismark server, it takes a long time to load the reference and then only a few seconds to align the reads and produce the output.

Feature request: bismark -shm

Similar to bwa shm, an option to create an index in shared memory, so that bismark can run in "server mode":

bwa shm

Usage: bwa shm [-d|-l] [-f tmpFile] [idxbase]

Options: -d       destroy all indices in shared memory
         -l       list names of indices in shared memory
         -f FILE  temporary file to reduce peak memory

Example use:

index=/opt/Homo_sapiens/UCSC/hg19/Sequence/BWAIndex/genome.fa
bwa shm $index # load the index into shared memory
bwa mem ref.fa reads.fq > results.sam
bwa shm -l # list indices in shared memory
bwa shm -d # unload all indices

bismark_methylation_extractor multicore option

Hi,

I am running bismark_methylation_extractor on a machine instance with 8 CPUs and 15GB of RAM. I set it up so that it runs with --multicore 8 option and 10G of RAM for sorting.

I am looking at the logs (see below), and it's only using 13% CPU. Could it be I am missing a parameter here? Is there a way to make it faster or more efficient?

++ ./bismark_v0.15.0/bismark_methylation_extractor -s --ignore 10 --cytosine_report --bedGraph --gzip --multicore 8 --buffer_size 10G file.bam --genome_folder input_indexes/
 *** Bismark methylation extractor version v0.15.0 ***
Summarising Bismark methylation extractor parameters:
===============================================================
Bismark single-end SAM format specified (default)
Number of cores to be used: 8
First 10 bp will be disregarded when processing the methylation call string
Output will be written to the current directory ('/home/dnanexus')
Summarising bedGraph parameters:
[...]
Stored sequence information of 2580 chromosomes/scaffolds in total
==============================================================================
Methylation information will now be written into a genome-wide cytosine report
==============================================================================
>>> Writing genome-wide cytosine report to: CEG40-111-P027B000idx-1_S1_L001_R1_001.1.cutB_bismark_bt2_se.barcoded.CpG_report.txt.gz <<<
Writing cytosine report for chromosome chr1 (stored 16706 different covered positions)
CPU: 13% (8 cores) * Memory: 4457/15039MB * Storage: 129GB free * Net: 15ￃﾢﾆﾓ/0ￃﾢﾆﾑMBpsJul 1, 2016 8:55 AM
CPU: 13% (8 cores) * Memory: 4456/15039MB * Storage: 129GB free * Net: 15ￃﾢﾆﾓ/0ￃﾢﾆﾑMBps
Writing cytosine report for chromosome chr2 (stored 13873 different covered positions)Jul 1, 2016 8:58 AM
Writing cytosine report for chromosome chr3 (stored 9621 different covered positions)Jul 1, 2016 9:02 AM
CPU: 13% (8 cores) * Memory: 4457/15039MB * Storage: 129GB free * Net: 0MBpsJul 1, 2016 9:05 AM
CPU: 13% (8 cores) * Memory: 4457/15039MB * Storage: 129GB free * Net: 0MBps
Writing cytosine report for chromosome chr4 (stored 8915 different covered positions)Jul 1, 2016 9:06 AM
Writing cytosine report for chromosome chr5 (stored 10201 different covered positions)Jul 1, 2016 9:09 AM
Writing cytosine report for chromosome chr6 (stored 9174 different covered positions)Jul 1, 2016 9:12 AM
CPU: 13% (8 cores) * Memory: 4456/15039MB * Storage: 129GB free * Net: 0MBpsJul 1, 2016 9:15 AM
CPU: 13% (8 cores) * Memory: 4456/15039MB * Storage: 129GB free * Net: 0MBps
[...]

bismark2bedGraph not working when different output directory specified

This fault was reported on SeqAnswers: http://seqanswers.com/forums/showthread.php?p=188553#post188553
bismark2bedGraph -o ./output/chr10.bg ./CpG_chr10.txt

report on trying bismark_genome_preparation with bowtie2-2.2.6 and hs38DH

Downloaded bowtie2 2.2.6 from sourceforge.
Downloaded hs36DH from NCBI using bwakit.

grep '^>' hs38DH.fa  | wc -l
3366

Tried bismark_genome_preparation:

$HOME/avilella/Bismark/bismark_genome_preparation --path_to_bowtie $HOME/bowtie2-2.2.6 --bowtie2 $GROUP/Genomes/Human/hs38DH

Successfully ran:

Bismark/bismark_genome_preparation --path_to_bowtie $HOME/bowtie2-2.2.6 --bowtie2 /data/group/test/Genomes/Human/hs38DH                                                  
[user@compute-1-6 hs38DH]$ ls -lrt
total 3725604
-rw-rw-r-- 1 user test 3263690197 Dec  1 10:19 hs38DH.fa
drwxrwsr-x 4 user test         62 Dec  1 11:02 Bisulfite_Genome
[user@compute-1-6 hs38DH]$ tree -Difts
.
[         62 Dec  1 11:02]  ./Bisulfite_Genome
[        227 Dec  1 12:20]  ./Bisulfite_Genome/CT_conversion
[ 1018869605 Dec  1 13:32]  ./Bisulfite_Genome/CT_conversion/BS_CT.rev.1.bt2
[  760863372 Dec  1 13:32]  ./Bisulfite_Genome/CT_conversion/BS_CT.rev.2.bt2
[ 1018869605 Dec  1 12:19]  ./Bisulfite_Genome/CT_conversion/BS_CT.1.bt2
[  760863372 Dec  1 12:19]  ./Bisulfite_Genome/CT_conversion/BS_CT.2.bt2
[      40751 Dec  1 11:06]  ./Bisulfite_Genome/CT_conversion/BS_CT.3.bt2
[  760863367 Dec  1 11:06]  ./Bisulfite_Genome/CT_conversion/BS_CT.4.bt2
[ 3263425993 Dec  1 11:05]  ./Bisulfite_Genome/CT_conversion/genome_mfa.CT_conversion.fa
[        227 Dec  1 12:17]  ./Bisulfite_Genome/GA_conversion
[ 1018869605 Dec  1 13:26]  ./Bisulfite_Genome/GA_conversion/BS_GA.rev.1.bt2
[  760863372 Dec  1 13:26]  ./Bisulfite_Genome/GA_conversion/BS_GA.rev.2.bt2
[ 1018869605 Dec  1 12:16]  ./Bisulfite_Genome/GA_conversion/BS_GA.1.bt2
[  760863372 Dec  1 12:16]  ./Bisulfite_Genome/GA_conversion/BS_GA.2.bt2
[      40751 Dec  1 11:06]  ./Bisulfite_Genome/GA_conversion/BS_GA.3.bt2
[  760863367 Dec  1 11:06]  ./Bisulfite_Genome/GA_conversion/BS_GA.4.bt2
[ 3263425993 Dec  1 11:05]  ./Bisulfite_Genome/GA_conversion/genome_mfa.GA_conversion.fa
[ 3263690197 Dec  1 10:19]  ./hs38DH.fa

Change of FLAG values in v0.14.4

Hi Felix,

I'm a little confused re the change in v0.14.4 noted in the release notes:

Changed the FLAG values of paired-end alignments to the CTOT or CTOB strands so that reads can
be properly displayed in SeqMonk when imported as BAM files. This change affects only paired-end
alignments in --pbat or --non_directional mode. In detail we simply swapped the Read 1 and Read 2
FLAG values round so reads now resemble exactly concordant read pairs to the OT and OB strands.
Note that results produced by the methylation extractor or further downstream of that are not
affected by this change. FLAG values now look like this:

                                  Read 1       Read 2

                          OT:         99          147

                          OB:         83          163

                          CTOT:      147           99

                          CTOB:      163           83

When a read1 is aligned to CTOT or CTOB its FLAG is set that the read is "second in pair" when it is really the first in the pair. Similarly, when read2 is aligned to CTOT or CTOB its FLAG is set that the read is "first in pair" when it is really the second the pair. Is my understanding correct?

Is the change then just a workaround so Seqmonk can display these in the expected visual orientation or is it something more to do with the definition of first read vs. second read in PBAT data? I'm a little worried that it means I can no longer simply parse the FLAG to infer whether it is read1 or read2 (e.g., will I have to additionally parse the XR/XG tags to infer this information?)

feature request: gzip CpG_report.txt from bismark_methylation_extractor

This is a feature request to lower the disk footprint of a bismark_methylation_extractor run.

I ran master branch including the options --cytosine_report --gzip --bedGraph and I was glad to see that most of the large output files are compressed, except the CpG_report.txt file.

Would it be possible to also gzip that one?

[        256 Dec  1 14:01]  ./TEST_Run001_meth-12345678
[      14382 Dec  1 15:19]  ./TEST_Run001_meth-12345678/TST36_80_7_oxBS-12345678
[ 1387611650 Dec  1 16:26]  ./TEST_Run001_meth-12345678/TST36_80_7_oxBS-12345678/TST36-80-7_S1_L001_R1_001.cutB.sorted.mrgd.dedupled.CpG_report.txt
[    5314308 Dec  1 15:18]  ./TEST_Run001_meth-12345678/TST36_80_7_oxBS-12345678/TST36-80-7_S1_L001_R1_001.cutB.sorted.mrgd.dedupled.bedGraph.gz
[    5135404 Dec  1 15:18]  ./TEST_Run001_meth-12345678/TST36_80_7_oxBS-12345678/TST36-80-7_S1_L001_R1_001.cutB.sorted.mrgd.dedupled.bismark.cov.gz
[        825 Dec  1 15:17]  ./TEST_Run001_meth-12345678/TST36_80_7_oxBS-12345678/TST36-80-7_S1_L001_R1_001.cutB.sorted.mrgd.dedupled.bam_splitting_report.txt
[       9204 Dec  1 15:17]  ./TEST_Run001_meth-12345678/TST36_80_7_oxBS-12345678/TST36-80-7_S1_L001_R1_001.cutB.sorted.mrgd.dedupled.M-bias_R1.png
[      12211 Dec  1 15:17]  ./TEST_Run001_meth-12345678/TST36_80_7_oxBS-12345678/TST36-80-7_S1_L001_R1_001.cutB.sorted.mrgd.dedupled.M-bias.txt
[   57078031 Dec  1 15:17]  ./TEST_Run001_meth-12345678/TST36_80_7_oxBS-12345678/CHG_CTOB_TST36-80-7_S1_L001_R1_001.cutB.sorted.mrgd.dedupled.txt.gz
[   59273925 Dec  1 15:17]  ./TEST_Run001_meth-12345678/TST36_80_7_oxBS-12345678/CHG_CTOT_TST36-80-7_S1_L001_R1_001.cutB.sorted.mrgd.dedupled.txt.gz
[  118532452 Dec  1 15:17]  ./TEST_Run001_meth-12345678/TST36_80_7_oxBS-12345678/CHH_CTOB_TST36-80-7_S1_L001_R1_001.cutB.sorted.mrgd.dedupled.txt.gz
[  122569601 Dec  1 15:17]  ./TEST_Run001_meth-12345678/TST36_80_7_oxBS-12345678/CHH_CTOT_TST36-80-7_S1_L001_R1_001.cutB.sorted.mrgd.dedupled.txt.gz
[   22190643 Dec  1 15:17]  ./TEST_Run001_meth-12345678/TST36_80_7_oxBS-12345678/CpG_CTOB_TST36-80-7_S1_L001_R1_001.cutB.sorted.mrgd.dedupled.txt.gz
[   22963327 Dec  1 15:17]  ./TEST_Run001_meth-12345678/TST36_80_7_oxBS-12345678/CpG_CTOT_TST36-80-7_S1_L001_R1_001.cutB.sorted.mrgd.dedupled.txt.gz

Thx

bam2nuc CG value?

Hi,

Just a clarification on the values returned from bismark bam2nuc for bisulfite converted data:

Is the CG percent sample value for a bisulfite converted dataset the observed value of a bisulfite converted reference with all the CG positions preserved?

Is so, for a sample with only 50% methylated-CpGs, is it correct to assume that the CG percent sample value should drop by half, with respect to another sample with a 100% methylated-CpGs?

Thx

writes out all the alignments for sequence with multiple alignments

Added new option --ambig_bam --multiple for Bowtie2-mode only, which writes out a single all the alignments for sequences with multiple alignments to a special file ending in .ambiguous.bam.

Specify output filename for bismark2summary

It would be really nice if the filename could be specified for bismark2summary instead of having to always have bismark_summary.html. It looks like the original author actually wrote the code with this in mind 😉

Should be fairly easy to customise with a command line flag I think..?

/usr/bin/env perl ...?

On our clusters, we have a system wide perl but mostly run a different version of perl loaded through the environment module system. However, the bismark scripts have /usr/bin/perl hashbangs at the top so always run on the system version. Currently the sysadmins go through each script editing this every time they load a new version. Any chance you could change it to /usr/bin/env perl in a future release so that it uses whatever the default perl path is?

I have a feeling that you guys had quite strong feelings about this, but I can't remember what they were or why, so I'm preparing myself for a fierce rebuttal 😬

Check for truncated BAM files

Truncated BAM files are super scary - not only is data lost, but because the read pairing gets messed up, incorrect methylation calls can be generated.

BAM files should have EOF tags, making it possible to detect when they have been truncated. Could Bismark deduplication / methextraction steps check for this and either bail or throw a big scary warning if they're not found?

Feature request: Add bismark2summary to the Bismark master

bismark2summary needs to be reworked to also work with data sets for which we don't have mapping report, a deduplication report as well as the M-bias and splitting report (e.g. RRBS data doesn't get deduplicated). Once this is done we can add the file to the Bismark distribution.

M-bias Figure's line could be change

Hi Felix,

In the current M-bias Figure, the color and the width of the line were applied to label different categories. It is really hard to recognize. Do you think we can change it a little bit better?

Thanks.

Look into possibility to drop `--no-discordant` for Bowtie 2 mode to support dovetailing of reads

Can we explore how easy it would be to disable the Bowtie 2 option --no-discordant since this would allow dovetailing reads to align straight away. This would be very useful for PBAT-type runs that require clipping from their 5'-ends.

bismark2summary: No deduplication report present

When I run bismark2report, it runs as expected:

Found 1 alignment reports in current directory. Now trying to figure out whether there are corresponding optional reports

Writing Bismark HTML report to >> AAA_bismark_bt2_PE_report.html <<

==============================================================================================================
Using the following alignment report:       > AAA_bismark_bt2_PE_report.txt <
Using the following deduplication report:   > AAA_bismark_bt2_PE_report.deduplication_report.txt <
No splitting report file specified, skipping this step
No M-bias report file specified, skipping this step
No nucleotide coverage report file specified, skipping this step
==============================================================================================================


Processing alignment report AAA_bismark_bt2_PE_report.txt ...
Complete

Processing deduplication report AAA_bismark_bt2_PE_report.deduplication_report.txt ...
Complete

But then I run bismark2summary and it can't find the deduplication report:

No Bismark/Bowtie2 single-end BAM files detected
Found Bismark/Bowtie2 paired-end files
No Bismark/Bowtie single-end BAM files detected
No Bismark/Bowtie paired-end BAM files detected

Generating Bismark summary report from 1 Bismark BAM file(s)...
>> Reading from Bismark report: AAA_bismark_bt2_PE_report.txt
No deduplication report present, skipping...
No methylation extractor report present, skipping...

Wrote bismark project summary to bismark_summary_report.html

How can that be?

Error with methylation_extractor when plot M-bias plot

Hi,

I am working on bisulfite data, and meet some error when I run Bismark_methylation_extractor from Bismark 0.15.0.

I am wondering whether you could give me some advice about it.

Everything was fine and I got gzip output about methylation location and number or methylated reads. Then when it was going to plot M-bias plot, it just stopped. I paste the error message here.

Determining maximum read lengths for M-Bias plots
Maximum read length of Read 1: 100
Maximum read length of Read 2: 100

/usr/bin/perl: symbol lookup error: /home/groups/apps/perllib/perl-5.16.1/lib/perl5/x86_64-linux-thread-multi/auto/GD/GD.so: undefined symbol: Perl_xs_apiversion_bootcheck
samtools: writing to standard output failed: Broken pipe
samtools: error closing standard output: -1

At first, I think it is some problem with the newest samtools (1.3), since they changed the way we use samtools sort command. But I also tried samtools 1.2 and still got the same error. Is it caused by perl? Do I need to try older version of samtools and check?

The command I used (I didn't sort the bam file form bismark_align, just change the file name):

bismark_methylation_extractor -p --bedGraph --counts --cytosine_report --comprehensive --gzip --genome_folder ../bismark_genome_index/ -o ../bismark_methylation_extract/ C12.bam

bismark --bowtie2 ../bismark_genome_index -1 ../trimmed_fastq/C12.R1.paired.fastq.gz -2 ../trimmed_fastq/C12.R2.paired.fastq.gz -o ./

Thanks,
Guang

feature request: RG entry for bam output

Recent versions of samtools and other toolkits require the RG line in the header.

This is a feature request for bismark to have an option (or default) to include an RG line in the header, something like so:

-R "@RG\tPL:ILLUMINA\tID:ID\tSM:SAMPLE"

Thx

methylation extractor/coverage2cytosine path handling issue

I'd like to report a small bug in methylation_extractor, when one tries to obtain a cytosine_report in the same step as the extraction. If you specify the option -o output_folder, once the bedGraph and coverage files are generated it looks in vain for output_folder/my_library.cov.gz. I assume that the current directory is already output_folder, so that's why it can't find the coverage file.
The misleading bit is that it still goes on to produce a cytosine_report, where all cytosines are covered 0 times. Maybe it could throw a louder error.
Running it in 2 steps, with extraction/bedGraph and then coverage2cytosine works great.

Example of command:
/applications/bismark/bismark_v0.14.5/bismark_methylation_extractor -p --no_overlap --ignore_r2 2 --comprehensive --gzip --report --multicore 4 --bedGraph --ample_memory --cytosine_report --CX --genome_folder ~/genomes/Heinz_pennellii_organelles_F1_genome/ --split_by_chromosome -o methyl_extraction/ P1.deduplicated.bam

it will issue a message of the type 'gzip: cannot find file methyl_extraction/P1.deduplicated.bismark.cov.gz' in the standard output but get on with it anyway.

bismark_methylation_extractor parallelise bedGraph sorting and cytosine report step by chr

A feature request to parallelise bedGraph sorting and cytosine report step by chr.

Quoting Felix:

This step iterates through the genomic sequence, identifies every C and looks up whether or not the C was found in the coverage file. I theory this could be made more parallel, e.g. by doing it for each chromosome separately, but we did't find it very pressing to work on the efficiency of this step.

--cutoff usage

Hi Felix,

Can you make it more explicit:

--cutoff [threshold] The minimum number of times a methylation state has to be seen for that nucleotide before its methylation percentage is reported. Default: 1.

It means the methylation read counts should be >=1 or total reads at the base should be >=1?

Thanks.

I think it should be

--cutoff [threshold] The minimum number of times a methylation state (M or U) has to be seen for that nucleotide before its methylation percentage is reported. Default: 1.

Isn't it?

Thanks.

feature request: explicit --single-ended input file

Samey as with explicit --genome_folder, in case the input is not paired-ended, which has explicit -1 and -2 parameters, it would be useful to have an -se explicit option for single-ended input files.

perl's GetOptlong also allows synonyms via the pipe separator, so maybe add a short 'se' and a longer 'single_end' option would be convenient:

'se|single_end|single_fragment' => \$name_of_variable,

I think that covers all the implicit options that rely on the order of parameters being passed.

bismark2bedGraph fails to sort GZIP compressed files

It appears the bismark2bedGraph produces empty coverage files when using the command line

bismark2bedGraph --dir /Path_to_dir --buffer_size 10G --scaffolds -o CpG_file.bed.gz /path_to_dir/CpG_context_file.fastq.gz_bismark_pe.deduplicated.sorted.txt.gz

Bam file suitable for bismark_methylation_extractor

Do you think bismark_methylation_extractor will be working for all kind of bam files not only created by bismaker? Do you think the bam files created by BWA can also be used for methylation calling by bismark_methylation_extractor?

Bug Report: option `--remove_spaces` appears to cause problems in bismark2bedGraph

There was a post on SeqAnswers reporting a problem:

Changed directory to /media/LTS_33T/SG_LTS33T/monod/haib/methyfreq/
Now replacing whitespaces in the sequence ID field of the Bismark methylation extractor output /media/LTS_33T/SG_LTS33T/monod/haib/methyfreq/CpG_context_ENCFF000LWP_trimmed.fq_bismark_bt2.txt prior to bedGraph conversion

Couldn't write to file /media/LTS_33T/SG_LTS33T/monod/haib/methyfreq/CpG_context_ENCFF000LWP_trimmed.fq_bismark_bt2.txt.spaces_removed.txt: No such file or directory
Finished BedGraph conversion ...

How to understand --ignore --ignore_r2 --ignore_3prime --ignore_3prime_r2

Hi Felix,

Supposing that, in the beginning, my raw read length is 90bp (Fastq). After trim_galore and quality trim, different reads will have different length (maybe from 50-90bp). In such situation, what does it mean for the following parameters:
--ignore 5
--ignore_r2 5
--ignore_3prime 10
--ignore_3prime_r2 10

apply ignore operations to each reads in the input bam, isn't it?

Best regards,

feature request: --cram output option

Feature request to have a .cram/.crai compatibility in bismark, using samtools 1.2 or later versions, so that .cram files are produces if the --cram flag is present, and the files are consumed by bismark methylation report downstream of the alignment step.

bismark_methylation_extractor only for methylation of CpG

the current version of bismark_methylation_extractor could call methylation level for CpG, CHG and CHH. Do you think we can call methylation for CpG only? can we have such option?

the result for CHH and CHG is too large, sometimes we can not output them if we don't need them, therefore, we really need an option to omit these two file.

feature request: --ambiguous-bam

An option to print out the ambiguously mapping reads as a separate bam file file, with MAPQ=0.

there's no output result

I used the following commands to run bismark:
/PUBLIC/software/RNA/bismark/bismark_v0.12.5/bismark -u 50000 -q --bowtie2 -p 4 --prefix map --path_to_bowtie /PUBLIC/software/RNA/bowtie2-2.2.5 -o map_test/default -1 sample_1.clean.fq -2 sample_2.clean.fq
I have waited for eight hours,but there's no any results in the output directory and no error message.I wonder if anyone has meet the problem.
Thanks.

Deduplication step: Add check to see whether input file is empty...

... and bail with a meaningful error message rather than generating an empty deduplicated BAM file. Again this came up during our methylation course when a wrong file name had been entered or one of the previous steps had gone wrong.

Bug in non directional mode

Hello,

I have used Bismark a few times in directional mode, and I have been very happy with the results so far (thanks for that!). However, I believe that I have found a bug in the non-directional mode (I am using bismark_v0.15.0).

Specifically, we are looking at amplicons, and our library contains amplicons targeting the OT and OB strands of the same region. In this case, some amplicons that should be reported as mapping to the OB strand (XG:Z:GA) are reported on the OT strand (XG:Z:CT) and thus the methylation calls are wrong. This problem is likely to affect any type of non-directional library, not just amplicons.

The problem can be seen with the following reference and reads:

Reference:
GTGAAGATTGTCGTGTTTCTTGACTAATTTCCATTGCTTGTTTCATGAGTGCATCCATTAAATCTTGAACATTCTCTGACTTCGTTTCATTTTCGAATATGGTACCAATCTTTACAGACGTTAATGAATTTTTTTTAGCATCATCTGATTTTAATATCTTTGTTTTTTCAATAGTTTTCTTTGCTTTCTTTTTTTGTGATTTTGATTCATTGCTTTCATCACTCGATAATACCATAATATCATCGATTTTTTCTTTTTGTGAAACCATATCGCAAACTTTCGTGATTACATCATCTATTAAAGCTACGACATCATGCATTAATTGCAAATGAGATACATTGCCTAATTTTATTTTCTTTATATTATCTTGAGGTAATTCAGTAGGAGATCTTTTACGTGTTTCAGTTTTTTCTTGTTCATTACGTTGATTTGTTCCAAATCCTGTATTTGTTCTGTCTATAATTTCAATCTCCTCGATACTATTTGAACGTTGAGCAATGGTGACAACATTTCTGTCACTTTCAGTTGTAAATTCCATTTCTTTTCCAATAATTTTATTACTTGTGAAAAGATTTTCTTCCTTAGTTTTCTCATTTAATTTTGAAGTATATTTCATAAAAGGGCCTTTAGCACATTTTTCAAGATGAAATTGAAGCAATTGACCTAAAGATTTTTTCAAGGGTTTTCCTTCGTGCTGCTCAAGATATTTTAATATGCTTGCATGAATTCGATAATG

Forward read:
TTAATTCATTACTTTCATCACTCAATAATACCATAATATCATCAATTTTTTCTTTTTATAAAACCATATCACAAACTTTCATAATTACATCATCTATTAAAACTACGACATCATGCATTAATTACAAATAAAATACATTACCTAATTTTATTTTCTTTATATTATCTTAAAATAATTCAATAAAAAATCTTTTACATATTTCAATTTTTTCTTATTCATTACGTTAATTT

Reverse read:
ATTATTGTTTAACGTTTAAATAGTATCGAGGAGATTGAAATTATAGATAGAATAAATATAGGATTTGGAATAAATTAACGTAATGAATAAGAAAAAATTGAAATATGTAAAAGATTTTTTATTGAATTATTTTAAGATAATATAAAGAAAATAAAATTAGGTAATGTATTTTATTTGTAATTAATGCATGATGTCGTAGTTTTAATAGATGATGTAATTATGAAAGTTTG

bismark command:
bismark --bowtie2 -minins 0 --maxins 1000 --output_dir . --sam --non_directional -N 1 -L 20 -D 20 -R 3 --score_min L,-0.8,-0.8 -p 4 /home/laurawelsh/bio/myBisulf_data/work/mapping/bismark -1 tmp1.fastq -2 tmp2.fastq

I think that I have found a potential solution for this problem.
I'm attaching a diff of my code against bismark_v0.15.0.
The fix is still incomplete: the same thing must be applied to single-end case and to the case where there is ambiguous mapping on the same strand.

Am I missing missing something or is this a real bug?
If it is a real bug, does this fix seem OK?

Thanks,
Sylvain
diff.txt

chromosomal sequence could not be extracted error

Chromosomal sequence could not be extracted for M02800:55:000000000-AMN7M:1:1116:14322:13406_1:N:0:1:TTTTAAAACT chrUn_JTFH01001319v1_decoy 2196
Mar 17, 2016 11:33 AM

This is with 0.14.5 and using a genome reference with decoys and alt haps. Any ideas?

feature request: bismark_methylation_extractor --exclude_bed / --include_bed options

This is a feature request against the head master version of bme.

Would it be possible to add a feature where the CpGs counts are done either excluding or including the regions given in a bed file? E.g.

bme $input.bam --genome_folder /my/genome.fa --exclude_bed /my/excluded.regions.from.genome.bed

or

bme $input.bam --genome_folder /my/genome.fa --includee_bed /my/included.regions.from.genome.bed

I guess one option is not to do this, and do a filtering step posterior to running bme, but then none of the useful metrics would in turn be reported.

Apologies if this is already an option, I couldn't see it in the code myself.

error: No last chromosome was defined

Hi,

I am running bismark methylation extractor (head master branch) and it looks like it's unable to produce the CpG_report.txt files:

cd /bi/group/cegx/CEGX_Run196/CEGX_Run196-12345678/CEG40_90_9-36442469/ && /bi/home/rocks-av/avilella/Bismark/bismark_methylation_extractor --samtools_path=/bi/apps/samtools/0.1.18/samtools --ignore 6 --genome_folder /bi/group/cegx/Genomes/Human/hs38DH_orig --multicore 8 --buffer_size 20G --cytosine_report  --bedGraph --gzip --single-end /bi/group/cegx/CEGX_Run196/CEGX_Run196-12345678/CEG40_90_9-36442469/CEG40-90-9_S1_L001_R1_001.cutB.sorted.mrgd.barcoded.bam

The counting txt files look correct:

-rw-r--r-- 1 rocks-av cegx 1091252162 Jun  3 10:04 CHG_CTOT_CEG40-90-9_S1_L001_R1_001.cutB.sorted.mrgd.barcoded.txt.gz
-rw-r--r-- 1 rocks-av cegx 2328942363 Jun  3 10:04 CHH_CTOT_CEG40-90-9_S1_L001_R1_001.cutB.sorted.mrgd.barcoded.txt.gz
-rw-r--r-- 1 rocks-av cegx  462080614 Jun  3 10:04 CpG_CTOT_CEG40-90-9_S1_L001_R1_001.cutB.sorted.mrgd.barcoded.txt.gz
-rw-r--r-- 1 rocks-av cegx  957859394 Jun  3 10:04 CHG_CTOB_CEG40-90-9_S1_L001_R1_001.cutB.sorted.mrgd.barcoded.txt.gz
-rw-r--r-- 1 rocks-av cegx 2067793732 Jun  3 10:04 CHH_CTOB_CEG40-90-9_S1_L001_R1_001.cutB.sorted.mrgd.barcoded.txt.gz
-rw-r--r-- 1 rocks-av cegx  406603928 Jun  3 10:04 CpG_CTOB_CEG40-90-9_S1_L001_R1_001.cutB.sorted.mrgd.barcoded.txt.gz
-rw-r--r-- 1 rocks-av cegx        856 Jun  3 10:04 CEG40-90-9_S1_L001_R1_001.cutB.sorted.mrgd.barcoded_splitting_report.txt
-rw-r--r-- 1 rocks-av cegx      12492 Jun  3 10:04 CEG40-90-9_S1_L001_R1_001.cutB.sorted.mrgd.barcoded.M-bias.txt
-rw-r--r-- 1 rocks-av cegx       7399 Jun  3 10:04 CEG40-90-9_S1_L001_R1_001.cutB.sorted.mrgd.barcoded.M-bias_R1.png

But when it tries to produce the cytosine report part, it fails. The .temp files also look correct. See below. Any ideas?

Stored sequence information of 3366 chromosomes/scaffolds in total

==============================================================================
Methylation information will now be written into a genome-wide cytosine report
==============================================================================

>>> Writing genome-wide cytosine report to: CEG40-90-9_S1_L001_R1_001.cutB.sorted.mrgd.barcoded.CpG_report.txt.gz <<<

No last chromosome was defined, something must have gone wrong while reading the data in (e.g. specified wrong file path for a gzipped coverage file?). Please check your command!



Finished generating genome-wide cytosine report

felixkrueger / bismark Goto Github PK

bismark's People

Contributors

Stargazers

Watchers

Forkers

bismark's Issues

Recommend Projects

Recommend Topics

Recommend Org