smithlabcode / rmap Goto Github PK

Short reads mapper for next-generation sequencing data (DNA-seq, BS-seq, etc)

C++ 85.67% Shell 8.63% Makefile 5.70%

rmap's Introduction

This is the README file for the first release of RMAP version 2. RMAP
is a program for mapping reads from short-read sequencing technology
(such as Solexa/Illumina).


CONTACT INFORMATION:
========================================================================
Andrew D Smith
[email protected]
http://www.cmb.usc.edu/people/andrewds


SYSTEM REQUIREMENTS:
========================================================================
The RMAP software will only run on UNIX-like operating systems, and
was developed on Linux systems. The RMAP software requires a fairly
recent C++ compiler (i.e. it must include tr1 headers). RMAP has been
compiled and tested on Linux and OS X operating systems using GCC v4.1
or greater. Also, RMAP will only run on 64-bit machines.


INSTALLATION:
========================================================================
This should be easy: unpack the archive and change into the archive
directory. Then type 'make install'. A 'bin' directory will be created
in the current directory, and it will contain the program
binaries. These can be moved around, and also do not depend on any
dynamic libraries, so they should simply work when executed.


USAGE EXAMPLES:
========================================================================
Each program included in this software package will print a list of
options if executed without any command line arguments. Many of the
programs use similar options (for example, output files are specified
with '-o'). For the most basic usage of rmap to map reads, use the
command:

     rmap -o output.bed -c chroms_dir input_reads.fa

The output will appear in output.bed, and the output is in BED format
(see the UCSC Genome Browser Help documentation for details of this
format). Each line of the file indicates the mapping location for a
read, and the 'score' field in each line indicates the number of
mismatches.


FEATURES:
========================================================================
The second version of RMAP includes several features that were lacking
in the original version:

* QUALITY SCORES: Full use of quality scores, meaning quality scores
  for each base at each position can be used directly in the mapping.

* PAIRED-END READS: Paired-end reads can be mapped, and the procedure
  considers both ends at the time of initial mapping, rather than
  trying to identify mapping positions for each end separately and
  then evaluating whether they have appropriate distance and
  orientation.

* BISULFITE SEQUENCING: RMAP can map reads from bisulfite sequencing
  to an ordinary reference genome. The algorithm can exploit
  unconverted cytosines at non-CpG bases to add mapping specificity.


HISTORY
========================================================================
RMAP was originally developed by Andrew D Smith and Zhenyu Xuan at
Cold Spring Harbor Laboratory (in the lab of Michael Q Zhang). The
current (second) version was written by Andrew D Smith (presently
Assistant Professor in the Molecular and Computational Biology
Section, Department of Biological Sciences at University of Southern
California).

PERFORMANCE
========================================================================
RMAP was able to map 30M reads, each of 32nt, allowing up to 2
mismatches in 11.5 hours on a single core, which is more than 2.6M
reads/hour. Reads were simulated and sampled uniformly from the human
genome. This particular run required roughly 11GB of memory. You might
see slightly different performance, depending on computing hardware
and sequencing application.

LICENSE
========================================================================
The RMAP software for mapping reads from short-read sequencers
Copyright (C) 2009 Andrew D Smith and University of Southern California

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or (at
your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program.  If not, see <http://www.gnu.org/licenses/>.

rmap's People

Contributors

Stargazers

Watchers

Forkers

shicheng-guo konradotto

rmap's Issues

Major problem: swapped default settings for parameters dealing with chunk start and size

The variables:

size_t n_reads_to_process = 0;
size_t read_start_index = std::numeric_limits<size_t>::max();

are swapped!

rmapbs-pe input

Hi,

I have compiled rmap from bioconda on macOS X and am trying to run rmapbs-pe specifically on simulated data. The guide mentions that paired-end data should be concatenated into a single input file but the program will not accept this when it is run for example like this:

rmapbs-pe -c genome.fa -o output.bed input.fastq

This simply outputs the -help information.

When the paired-end files are instead given separately the program will run, but even with only 10 reads it never reaches completion.

rmapbs-pe -c genome.fa -o output.bed pe_1.fastq pe_2.fastq

Can anybody provide any ideas for how I could get this to run properly?

http://www.cmb.usc.edu/people/andrewds/rmap link broken

The project URL http://www.cmb.usc.edu/people/andrewds/rmap listed in http://www.ncbi.nlm.nih.gov/pubmed/19736251 is broken.

Could you get it to redirect here (GitHub) or to http://rulai.cshl.edu/rmap/ instead?

Unused common/load_paired_end_reads.*pp

load_paired_end_reads.cpp load_paired_end_reads.hpp
are not used anywhere in rmap.

To trim or not to trim...

I used trim_galore to perform trimming before mapping.
The error was:

[STARTING END ONE]
[LOADING READ SEQUENCES] 
Incorrect read width:
TTATTTTGACGTAAATTTTTGTTTTGTTTTATGTTTTATATTTTTTATTTGTCGTATAAGTATATAGATAGTCTATTTTTTATGTGGTTTATTTTTT

I guess rmap requires the length of reads to be of equal length.(This read is 3rd in the *_1.fastq(post trimming) with length=97 whereas the 1st and 2nd are 100bp) the I know rmap has a trimming support. Does that mean I should not trime before running rmap and use only rmap's trimmer?

rmap running time

Trying to map 4 million reads, every single job of mine has been failing at its 15 hour walltime for this new dataset. The only thing that has changed is that I am using the new version of rmap. Did this substantially slow it down? Do other people use a longer walltime?

Improve the way rmap loads reads

Currently rmap has the option to only map a certain subset of reads in input fastq file, to avoid physically splitting the file. This is done by the function load_reads_from_fastq_file(), which skips a number of lines according to the parameter. However this does not have any performance advantage over splitting, because the program still have to read those lines before the starting point.

It can be improved by using seek functionality which takes constant time. Say we know the total size of a fastq file, and the user wants to split the file to 10 parts, to only map the 7th part of the file:

set the file pointer to 7/10 of the total bytes;
locate the nearest EOF after;
load the following reads and count how many bytes have been loaded;
stop after loading the read that exceeds the 1/10 total bytes.

The breaking points can be more accurate if read name length is taken into consideration.

rmap will not compile on os x

The Makefile needs that @pjuren fix that we have included in all other Makefiles for smith lab stuff.

Broken dependencies on GenomicRegion

I tried to compile rmapbs and it failed due to a missing header for GenomicRegion.hpp in rmapbs.cpp. I think this is one of those cases where one must be defensive about headers. In any case, this is a bug.

Name processing for PE reads

From FASTQ Description

The sequence identifier
@:::::: :::
is not treated properly by rmapbs-pe

e.g.
@EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG
@EAS139:136:FC706VJ:2:5:1000:12850 2:Y:18:ATCACG
would lead to a name like FRAG:EAS139:136:FC706VJ:2:5:1000:1285

make fails with gcc-4.9

$ make
make[1]: Entering directory `/home/saket/Desktop/rmap-2.1/src'
make[2]: Entering directory `/home/saket/Desktop/rmap-2.1/src/mappers'
make[3]: Entering directory `/home/saket/Desktop/rmap-2.1/src/mappers/rmap'
g++ -Wall -fPIC -fmessage-length=50 -O3 -o rmap rmap.cpp ../../common/SeedMaker.o ../../common/FastRead.o ../../common/load_reads.o ../../common/clip_adaptor_from_reads.o /home/saket/Desktop/rmap-2.1/src/smithlab_cpp//GenomicRegion.o /home/saket/Desktop/rmap-2.1/src/smithlab_cpp//smithlab_os.o /home/saket/Desktop/rmap-2.1/src/smithlab_cpp//smithlab_utils.o /home/saket/Desktop/rmap-2.1/src/smithlab_cpp//OptionParser.o /home/saket/Desktop/rmap-2.1/src/smithlab_cpp//QualityScore.o -I../../common -I/home/saket/Desktop/rmap-2.1/src/smithlab_cpp/
rmap.cpp: In function ‘void 
   iterate_over_seeds(bool, const U&, const 
   std::vector<long unsigned int>&, const 
   std::vector<std::basic_string<char> >&, 
   std::vector<std::pair<unsigned int, unsigned 
   int> >&, std::vector<std::basic_string<char> 
   >&, std::vector<long unsigned int>&, 
   std::vector<T>&, std::vector<long unsigned 
   int>&, std::vector<unsigned int>&, 
   std::vector<MultiMapResult>&, size_t, 
   size_t)’:
rmap.cpp:328:70: error: there are no arguments to 
   ‘read_fasta_file’ that depend on a 
   template parameter, so a declaration of 
   ‘read_fasta_file’ must be available 
   [-fpermissive]
 chrom_files[i].c_str(), tmp_chrom_names, chroms);
                                                ^
rmap.cpp:328:70: note: (if you use 
   ‘-fpermissive’, G++ will accept your code, 
   but allowing the use of an undeclared name is 
   deprecated)
rmap.cpp: In function ‘void 
   identify_chromosomes(bool, std::string, 
   std::string, 
   std::vector<std::basic_string<char> >&)’:
rmap.cpp:388:31: error: ‘isdir’ was not 
   declared in this scope
   if (isdir(chrom_file.c_str())) 
                               ^
rmap.cpp:389:51: error: ‘read_dir’ was not 
   declared in this scope
  read_dir(chrom_file, fasta_suffix, chrom_files);
                                                ^
rmap.cpp:396:51: error: ‘get_filesize’ was 
   not declared in this scope
 < *i << " (" << roundf(get_filesize(*i)/1e06) << 
                                       ^
rmap.cpp: In function ‘int main(int, const 
   char**)’:
rmap.cpp:559:46: error: ‘strip_path’ was not 
   declared in this scope
 ionParser opt_parse(strip_path(argv[0]), "map Ill
                                       ^
rmap.cpp: In instantiation of ‘void iterate_over_seeds(bool, const U&, const std::vector<long unsigned int>&, const std::vector<std::basic_string<char> >&, std::vector<std::pair<unsigned int, unsigned int> >&, std::vector<std::basic_string<char> >&, std::vector<long unsigned int>&, std::vector<T>&, std::vector<long unsigned int>&, std::vector<unsigned int>&, std::vector<MultiMapResult>&, size_t, size_t) [with T = FastRead; U = wildcard_score; size_t = long unsigned int]’:
rmap.cpp:647:47:   required from here
rmap.cpp:328:70: error: ‘read_fasta_file’ was 
   not declared in this scope
 chrom_files[i].c_str(), tmp_chrom_names, chroms);
                                                ^
make[3]: *** [rmap] Error 1
make[3]: Leaving directory `/home/saket/Desktop/rmap-2.1/src/mappers/rmap'
make[3]: Entering directory `/home/saket/Desktop/rmap-2.1/src/mappers/rmapbs'
g++ -Wall -fPIC -fmessage-length=50 -O3 -o rmapbs rmapbs.cpp ../../common/SeedMaker.o ../../common/FastRead.o ../../common/load_reads.o ../../common/clip_adaptor_from_reads.o /home/saket/Desktop/rmap-2.1/src/smithlab_cpp//GenomicRegion.o /home/saket/Desktop/rmap-2.1/src/smithlab_cpp//smithlab_os.o /home/saket/Desktop/rmap-2.1/src/smithlab_cpp//smithlab_utils.o /home/saket/Desktop/rmap-2.1/src/smithlab_cpp//OptionParser.o /home/saket/Desktop/rmap-2.1/src/smithlab_cpp//QualityScore.o -I../../common -I/home/saket/Desktop/rmap-2.1/src/smithlab_cpp/
rmapbs.cpp: In function ‘void 
   iterate_over_seeds(bool, bool, const U&, const 
   std::vector<long unsigned int>&, const 
   std::vector<std::basic_string<char> >&, 
   std::vector<std::pair<unsigned int, unsigned 
   int> >&, std::vector<std::basic_string<char> 
   >&, std::vector<long unsigned int>&, 
   std::vector<T>&, std::vector<long unsigned 
   int>&, std::vector<unsigned int>&, 
   std::vector<MultiMapResult>&, size_t, 
   size_t)’:
rmapbs.cpp:428:70: error: there are no arguments 
   to ‘read_fasta_file’ that depend on a 
   template parameter, so a declaration of 
   ‘read_fasta_file’ must be available 
   [-fpermissive]
 chrom_files[i].c_str(), tmp_chrom_names, chroms);
                                                ^
rmapbs.cpp:428:70: note: (if you use 
   ‘-fpermissive’, G++ will accept your code, 
   but allowing the use of an undeclared name is 
   deprecated)
rmapbs.cpp: In function ‘void 
   identify_chromosomes(bool, std::string, 
   std::string, 
   std::vector<std::basic_string<char> >&)’:
rmapbs.cpp:500:31: error: ‘isdir’ was not 
   declared in this scope
   if (isdir(chrom_file.c_str()))
                               ^
rmapbs.cpp:501:51: error: ‘read_dir’ was not 
   declared in this scope
  read_dir(chrom_file, fasta_suffix, chrom_files);
                                                ^
rmapbs.cpp:508:51: error: ‘get_filesize’ was 
   not declared in this scope
 < *i << " (" << roundf(get_filesize(*i)/1e06) << 
                                       ^
rmapbs.cpp: In function ‘int main(int, const 
   char**)’:
rmapbs.cpp:676:46: error: ‘strip_path’ was 
   not declared in this scope
 ionParser opt_parse(strip_path(argv[0]), "map Ill
                                       ^
rmapbs.cpp: In instantiation of ‘void iterate_over_seeds(bool, bool, const U&, const std::vector<long unsigned int>&, const std::vector<std::basic_string<char> >&, std::vector<std::pair<unsigned int, unsigned int> >&, std::vector<std::basic_string<char> >&, std::vector<long unsigned int>&, std::vector<T>&, std::vector<long unsigned int>&, std::vector<unsigned int>&, std::vector<MultiMapResult>&, size_t, size_t) [with T = FastRead; U = wildcard_score; size_t = long unsigned int]’:
rmapbs.cpp:766:47:   required from here
rmapbs.cpp:428:70: error: ‘read_fasta_file’ 
   was not declared in this scope
 chrom_files[i].c_str(), tmp_chrom_names, chroms);
                                                ^
make[3]: *** [rmapbs] Error 1

I just realized that src/utils is gone

Was it removed at some time point? I wanted to use sigoverlap and found out that the whole utils directory is gone...