smithlabcode / walt Goto Github PK

View Code? Open in Web Editor NEW

17.0 17.0 8.0 1.06 MB

WALT is a read mapping program for bisulfite sequencing DNA methylation studies.

License: GNU General Public License v3.0

Makefile 2.94% C++ 97.06%

walt's People

Contributors

Stargazers

Watchers

Forkers

wenchaolin chaitanyasponnapalli jinyancool vreuter hmyh1202 tw7649116 ruimcportela anoopi-isingh

walt's Issues

Determine whether reads are from PBAT automatically

This should be easy, based on analyzing the nucleotide composition of a collection of reads at the start of the file (e.g. 100k reads).

running WALT on Mac and/or optimize memory

Hi,
I am trying to run WALT on Mac or to optimize memory allocation on my linux machine:

ON MY MAC:
When I run
makedb -c ./split_chr_ref_folder -o hg19.masked.dbindex
I got the following error:
`dyld: lazy symbol binding failed: Symbol not found: __ZNKSt5ctypeIcE13_M_widen_initEv
Referenced from: /Users/antoinechaillon/Documents/walt/bin/makedb (which was built for Mac OS X 10.11)
Expected in: /usr/lib/libstdc++.6.dylib

dyld: Symbol not found: __ZNKSt5ctypeIcE13_M_widen_initEv
Referenced from: /Users/antoinechaillon/Documents/walt/bin/makedb (which was built for Mac OS X 10.11)
Expected in: /usr/lib/libstdc++.6.dylib`

Any chance WALT can work on MacOS?

WITH LINUX:
I've installed WALT on a linux machine but when I run
makedb -c ./split_chr_ref_folder -o hg19.masked.dbindex
I have memory problem:
**[COUNT BUCKET SIZE] [NOTICE: ERASE THE BUCKET 4194303 SINCE ITS SIZE IS 934420] [NOTICE: ERASE THE BUCKET 12582911 SINCE ITS SIZE IS 709199] [NOTICE: ERASE THE BUCKET 13631487 SINCE ITS SIZE IS 850510] [NOTICE: ERASE THE BUCKET 15728639 SINCE ITS SIZE IS 653272] [NOTICE: ERASE THE BUCKET 15990783 SINCE ITS SIZE IS 791404] [NOTICE: ERASE THE BUCKET 16515071 SINCE ITS SIZE IS 610398] [NOTICE: ERASE THE BUCKET 16580607 SINCE ITS SIZE IS 774223] [NOTICE: ERASE THE BUCKET 16711679 SINCE ITS SIZE IS 614458] [NOTICE: ERASE THE BUCKET 16728063 SINCE ITS SIZE IS 759120] [NOTICE: ERASE THE BUCKET 16760831 SINCE ITS SIZE IS 591405] [NOTICE: ERASE THE BUCKET 16764927 SINCE ITS SIZE IS 735242] [NOTICE: ERASE THE BUCKET 16773119 SINCE ITS SIZE IS 580192] [NOTICE: ERASE THE BUCKET 16774143 SINCE ITS SIZE IS 727159] [NOTICE: ERASE THE BUCKET 16776191 SINCE ITS SIZE IS 604228] [NOTICE: ERASE THE BUCKET 16776447 SINCE ITS SIZE IS 725546] [NOTICE: ERASE THE BUCKET 16776959 SINCE ITS SIZE IS 593534] [NOTICE: ERASE THE BUCKET 16777023 SINCE ITS SIZE IS 732571] [NOTICE: ERASE THE BUCKET 16777151 SINCE ITS SIZE IS 577419] [NOTICE: ERASE THE BUCKET 16777167 SINCE ITS SIZE IS 762077] [NOTICE: ERASE THE BUCKET 16777199 SINCE ITS SIZE IS 627084] [NOTICE: ERASE THE BUCKET 16777203 SINCE ITS SIZE IS 834068] [NOTICE: ERASE THE BUCKET 16777211 SINCE ITS SIZE IS 654082] [NOTICE: ERASE THE BUCKET 16777212 SINCE ITS SIZE IS 894219] [NOTICE: ERASE THE BUCKET 16777214 SINCE ITS SIZE IS 749400] [NOTICE: ERASE THE BUCKET 16777215 SINCE ITS SIZE IS 3860849] [HASH TO BUCKET] ERROR: could not allocate memory**

Any chance we can optimize the memory allocation???
thank you so much ++

Adding robustness in OptionParser for specifying options more than once

Users can accidentally specify command line options more than once. If "leftover_args" are not properly processed, such as checking for their existence, then such user accidents may "shift" the values of all parameters later. Example:
> ./walt -i hg19_walt.dbindex -1 a.fq -2 b.fq -o c.out -C GATCGGAAGAGCG -C -t 8 -m 10
In a case like this it appears the -C value will be over-written, and the extra 8 on the command line will ignored. Consider something like the following:

if (!leftover_args.empty()) {
    cerr << opt_parse.help_message() << endl;
    return EXIT_SUCCESS;
}

This kind of check is used in most programs using the OptionParser class in smithlab_cpp. It does not perfectly solve all types of problems, but it catches many.

Clipping adapter

Like the -C option of rmapbs, removing adapter sequence from the reads will help to improve mapping rates. Although this can be done prior to mapping with other tools, but we still need to figure out which of the following we want it to be:

trim adapter sequence from the reads, and then map reads with possibly various lengths; but this might require setting various seed lengths for the reads.
replacing adapter sequence with "N". I'm not sure will this solve clipping adapter.

-b flag actually sets -k

when we manually set -b (the maximum number of hits) to something greater than 300, the program complains that "top k pairs" (-k) must not be greater than 300. This is because option parser for b is actually populating the top_k variable

Output format for mapping statistics

This output format should be changed to yaml.

walt selectively times out

Dear walt,

I have 6 pairs of paired-end read files with sizes shown below.

$ ls -1s data/*fastq
57692256 data/CM1-1.1.fastq
57692320 data/CM1-1.2.fastq
59956216 data/CM1-2.1.fastq
60288296 data/CM1-2.2.fastq
70784204 data/CM3-1.1.fastq
64211772 data/CM3-1.2.fastq
64597448 data/CM3-2.1.fastq
65398276 data/CM3-2.2.fastq
66402944 data/CM6-1.1.fastq
65861148 data/CM6-1.2.fastq
67352880 data/CM6-2.1.fastq
63484840 data/CM6-2.2.fastq

On our campus cluster, I ran walt v1.01 separately on each of the 6 pairs of files using as index the 24 main chromosomes of GRCh38. Only two (CM1-1 and CM6-2) of the 6 walt runs completed in ~2hrs. 15min. each, and only the one run for CM1-1 resulted in a mapstats output file. (See output below.) The other 4 runs timed out after 24 hours with reduced output and no diagnostic information.

$ ls -1s mapped-data
total 309101236
64888788 CM1-1.mr
20 CM1-1.mr.mapstats
32358408 CM1-2.mr
42958864 CM3-1.mr
46989324 CM3-2.mr
41807880 CM6-1.mr
80097952 CM6-2.mr

Please advise, and please let me know if you need more info, thank you!

Best,
CB

name?

What do people think about a name for this mapper? The defining features are that it makes use of a two-letter alphabet and is therefore optimized for bisulfite treated reads, and makes use of a hash table of exact-match seeds and their locations in the genome.

The best I can come up with now is bibinmap, which stands for bisulfite binary mapper. It also sounds like bibimbap (http://chichilicious.com/wp-content/uploads/bibimbap-recipe.jpg) just enough to make me think of food.

Other ideas?

average and standard deviation of insert size for paired end reads

Printing the average and standard deviation insert size for paired end mapped reads in the .mapstats file would be incredibly helpful: these statistics are required for the GEO submission form, and I just found myself in a situation where I had to compute them from the distribution, so I approximated with the mode fragment length.

Distribution of fragment length not working?

I'm using the latest version of walt. The mapstats file does have this section, but there is no content.

Using a fraction of read length for the maximum allowed mismatch

Currently the -m option is an integer indicating the number of mismatch allowed per read. But in practice it is better for this number to be automatically determined based on input read length. In rmap, this option can be a float number, e.g. 0.1, to indicate the fraction. Therefore by default the maximum mismatch per map is read_length * 0.1.

For walt it is necessary to calculate such number per read during mapping, because it is possible that the input reads are processed after adapter trimming, and have different lengths.

support SAM output format

most of the bisulfite sequencing mapper support SAM output format. In order to support downstream analysis, bsmapper will be support SAM format, too.

support post-bisulfite adaptor tagging (PBAT)

Potential limitation of fragment size

The fragment size of a pair-end alignment is returned by this function as an int. Of course in most cases this would be sufficient, but sometimes we might need a very long fragment. Just to keep a record for future usage.

Error when mapping pair-end data

Got this error "The number of reads in paired-end files should be the same" which corresponds to this.

I don't think there is a logic error in the codes, instead I suspect it is caused by the openmp, because this error only occurred when using multiple threads for mapping. This error often occurs when the mapping job is close to finish (after tens of hours), therefore now it is not easy to reproduce it.

Here's my guess about the cause. @acgtun please let me know if this is possible.
The function that updates the read number variable is LoadReadsFromFastqFile (link to code), and it passes a reference of num_of_reads. There's another variable n_reads_to_process which tells the program how many reads to process at each for loop. Usually this is set to 1 million, or 500k. Could it be that at the last loop, when some mapping threads have started earlier than others, and loaded the remaining reads (less than 1 million), and num_of_reads was updated according to this number, but then this number was compared with the other thread at the previous loop.

Converting Ns in the reference genome?

Code is here
I'm not sure if it's ok to convert Ns in the reference genome to a random base. This is a potential problem.

Error with using WALT

Hello,

I'm in the process of running through the MethPipe pipeline using some practice data provided by your lab. I'm using WALT as recommend for mapping reads. I tried to create an index for hg38 but then got an error saying 'bad index file' so I wasn't able to map my reads. I also used an index available from my lab’s cluster and it gave me the same error. Why am I getting this error? Any ideas on how to fix this?

Verda

randomly output one mapped positon for ambiguous reads

@mengzhou proposed that we can randomly output one mapped position for ambiguous reads. We may add an option to bsmapper. Users can set how to report ambiguous reads. For example, -r 0 means the program only reports unique mapped reads or read pairs. -r 1 means the program randomly output one mapped position for ambiguous reads or read pairs.

What do you think @andrewdavidsmith ?

Potential minor speedup during makedb

I noticed that during makedb, hash values of the genome are computed twice; once in CountBucketSize and again in HashToBucket. We could save the hash values and save some time.

index issue in SortHashTableBucket

In SortHashTableBucket, i starts from 0, but since hash_table.counter[0] is not necessarily zero, the genome positions corresponding to the first hash value do not get sorted.

reduce the runtime for building index

Now we build four indexes for four different genomes.

Genome	Description
hg19_C2T_forward.fa	hg19 with Cs are transferred to Ts
hg19_C2T_reverse.fa	first, make reverse complement of hg19, then all Cs are transferred to Ts
hg19_G2A_forward.fa	hg19 Gs are transferred to As
hg19_G2A_reverse.fa	first, make reverse complement of hg19, then all Gs are transferred to As

In fact, hg19_G2A_forward.fa is reverse complement strand of hg19_C2T_reverse.fa, and hg19_G2A_reverse.fa is reverse complement strand of hg19_C2T_forward.fa. So we can only build two indexes on hg19_C2T_forward.fa and hg19_C2T_reverse.fa. When mapping mate _2 reads, we can reverse the reads and map them to these two genomes.

This strategy does not improve the speed and reduce the memory for mapping. However, it can save about 28 Gb disk and reduce half of the current indexing time.

git clone

hello, I think there is a typo in the readme. When I tried git clone [email protected]:smithlabcode/walt.git, it gave some sort of an authentication error. However, when I tried git clone https://github.com/smithlabcode/walt.git instead, it worked just fine...
Thanks!

Need logging messages to indicate stage of what walt is doing

This is pretty important to monitoring performance, and pretty typical for tools that have different performance characteristics for various major stages. In particular, walt should have an option to report when certain files are being read.

Allowing mixed types of input for walt

We expect to have both pair-end and single-end reads from the same library, when allowing trim_galore to retain unpaired reads after QC. Therefore it is necessary to make walt accept such type of data.

Supporting comma-separated list of fastq files

Sometimes there are multiple FASTQ files for one library, and if we map them separately, those files needs to be merged manually (with pain). It will be great if the mapper supports a lsit of input files, such as:
-1 file1_1.fq,file2_1.fq -2 file1_2.fq,file2_2.fq

Upgrade smithlab_cpp to submodule

d395774

[HASH TO BUCKET] Killed while building index

Hi, I am having the following error while building the index ç

src/walt/makedb -c ../../data/genomes/Homo_sapiens.GRCh38.dna.primary_assembly.fa -o ../../local_storage/Homo_sapiens.GRCh38.dna.primary_assembly.dbindex
[IDENTIFYING CHROMS] [DONE]
chromosome files found (approx size):
../../data/genomes/Homo_sapiens.GRCh38.dna.primary_assembly.fa (2948.00Mbp)
[BIULD INDEX FOR FORWARD STRAND (C->T)]
[READING CHROMOSOMES]
[THERE ARE 24 CHROMOSOMES IN THE GENOME]
[THE TOTAL LENGTH OF ALL CHROMOSOMES IS 2899289802]
[COUNT BUCKET SIZE]
[NOTICE: ERASE THE BUCKET 1048575 SINCE ITS SIZE IS 525811]
[NOTICE: ERASE THE BUCKET 3355443 SINCE ITS SIZE IS 567042]
[NOTICE: ERASE THE BUCKET 4194303 SINCE ITS SIZE IS 1278425]
[NOTICE: ERASE THE BUCKET 12582911 SINCE ITS SIZE IS 913102]
[NOTICE: ERASE THE BUCKET 13421772 SINCE ITS SIZE IS 563069]
[NOTICE: ERASE THE BUCKET 13631487 SINCE ITS SIZE IS 1138569]
[NOTICE: ERASE THE BUCKET 15728639 SINCE ITS SIZE IS 827276]
[NOTICE: ERASE THE BUCKET 15990783 SINCE ITS SIZE IS 1038965]
[NOTICE: ERASE THE BUCKET 16515071 SINCE ITS SIZE IS 751298]
[NOTICE: ERASE THE BUCKET 16580607 SINCE ITS SIZE IS 1000895]
[NOTICE: ERASE THE BUCKET 16711679 SINCE ITS SIZE IS 745722]
[NOTICE: ERASE THE BUCKET 16728063 SINCE ITS SIZE IS 972454]
[NOTICE: ERASE THE BUCKET 16748543 SINCE ITS SIZE IS 500424]
[NOTICE: ERASE THE BUCKET 16760831 SINCE ITS SIZE IS 716515]
[NOTICE: ERASE THE BUCKET 16764927 SINCE ITS SIZE IS 938897]
[NOTICE: ERASE THE BUCKET 16773119 SINCE ITS SIZE IS 692583]
[NOTICE: ERASE THE BUCKET 16774143 SINCE ITS SIZE IS 930324]
[NOTICE: ERASE THE BUCKET 16775374 SINCE ITS SIZE IS 536685]
[NOTICE: ERASE THE BUCKET 16776191 SINCE ITS SIZE IS 720012]
[NOTICE: ERASE THE BUCKET 16776447 SINCE ITS SIZE IS 926400]
[NOTICE: ERASE THE BUCKET 16776755 SINCE ITS SIZE IS 634760]
[NOTICE: ERASE THE BUCKET 16776959 SINCE ITS SIZE IS 701955]
[NOTICE: ERASE THE BUCKET 16777011 SINCE ITS SIZE IS 528266]
[NOTICE: ERASE THE BUCKET 16777023 SINCE ITS SIZE IS 939339]
[NOTICE: ERASE THE BUCKET 16777151 SINCE ITS SIZE IS 684355]
[NOTICE: ERASE THE BUCKET 16777164 SINCE ITS SIZE IS 518922]
[NOTICE: ERASE THE BUCKET 16777167 SINCE ITS SIZE IS 983925]
[NOTICE: ERASE THE BUCKET 16777199 SINCE ITS SIZE IS 760257]
[NOTICE: ERASE THE BUCKET 16777200 SINCE ITS SIZE IS 506378]
[NOTICE: ERASE THE BUCKET 16777203 SINCE ITS SIZE IS 1116294]
[NOTICE: ERASE THE BUCKET 16777211 SINCE ITS SIZE IS 803074]
[NOTICE: ERASE THE BUCKET 16777212 SINCE ITS SIZE IS 1215358]
[NOTICE: ERASE THE BUCKET 16777214 SINCE ITS SIZE IS 976168]
[NOTICE: ERASE THE BUCKET 16777215 SINCE ITS SIZE IS 5724697]
[HASH TO BUCKET]
Killed

Any suggestion to fix it?

Thanks,

Karl

Add options to map a section of reads

I miss the options of rmapbs that can set the index of first read to map and the total number of reads to map. For large fastq files (e.g. 100G), mapping takes too long even with multi-threading.

indexing for non-human (HIV) genome

Hi,
I am trying to map paired end fastq files from bisulfite HIV LTR. Therefore, the mapping step implies using the HIV LTR region as a reference and not the human genome
I tried to create a ref.dbindex from my ref.fa file but it does not really make sense to me:

makedb -c ./reference -o ref.dbindex [IDENTIFYING CHROMS] [DONE] chromosome files found (approx size): ./reference/ref.fa (0.00Mbp) [BIULD INDEX FOR FORWARD STRAND (C->T)] [READING CHROMOSOMES] [THERE ARE 1 CHROMOSOMES IN THE GENOME] [THE TOTAL LENGTH OF ALL CHROMOSOMES IS 352] [COUNT BUCKET SIZE] [HASH TO BUCKET] [SORTING BUCKETS FOR HASH TABLE] [WRITTING INDEX TO ref.dbindex_CT00] [BIULD INDEX FOR REVERSE STRAND (C->T)] [READING CHROMOSOMES] [THERE ARE 1 CHROMOSOMES IN THE GENOME] [THE TOTAL LENGTH OF ALL CHROMOSOMES IS 352] [COUNT BUCKET SIZE] [HASH TO BUCKET] [SORTING BUCKETS FOR HASH TABLE] [WRITTING INDEX TO ref.dbindex_CT01] [BIULD INDEX FOR FORWARD STRAND (G->A)] [READING CHROMOSOMES] [THERE ARE 1 CHROMOSOMES IN THE GENOME] [THE TOTAL LENGTH OF ALL CHROMOSOMES IS 352] [COUNT BUCKET SIZE] [HASH TO BUCKET] [SORTING BUCKETS FOR HASH TABLE] [WRITTING INDEX TO ref.dbindex_GA10] [BIULD INDEX FOR REVERSE STRAND (G->A)] [READING CHROMOSOMES] [THERE ARE 1 CHROMOSOMES IN THE GENOME] [THE TOTAL LENGTH OF ALL CHROMOSOMES IS 352] [COUNT BUCKET SIZE] [HASH TO BUCKET] [SORTING BUCKETS FOR HASH TABLE] [WRITTING INDEX TO ref.dbindex_GA11] [READING CHROMOSOMES] [THERE ARE 1 CHROMOSOMES IN THE GENOME] [THE TOTAL LENGTH OF ALL CHROMOSOMES IS 352] [WRITTING INDEX HEAD TO ref.dbindex]

It does generate the following files:
ref.dbindex
ref.dbindex_CT00
ref.dbindex_CT01
ref.dbindex_GA10
ref.dbindex_GA11

When I try to map my paired end reads to this indexed ref, it yelds to >99.9% unmapped reads... My guess is that it has to do with the indexing.

walt -i ./reference_notconverted/ref.dbindex -1 SAMPLE_R1_trimmed.fq -2 SAMPLE_R2_trimmed.fq -o C1WALT.sam
I am attaching the ref genome (bisulfite converted or not)

HIVLTR_BISUFITE.txt
HIVLTR.txt

All suggestions are welcome?
thanks !
a