Giter VIP home page Giter VIP logo

minimap2's Introduction

GitHub Downloads BioConda Install PyPI Build Status

Getting Started

git clone https://github.com/lh3/minimap2
cd minimap2 && make
# long sequences against a reference genome
./minimap2 -a test/MT-human.fa test/MT-orang.fa > test.sam
# create an index first and then map
./minimap2 -x map-ont -d MT-human-ont.mmi test/MT-human.fa
./minimap2 -a MT-human-ont.mmi test/MT-orang.fa > test.sam
# use presets (no test data)
./minimap2 -ax map-pb ref.fa pacbio.fq.gz > aln.sam       # PacBio CLR genomic reads
./minimap2 -ax map-ont ref.fa ont.fq.gz > aln.sam         # Oxford Nanopore genomic reads
./minimap2 -ax map-hifi ref.fa pacbio-ccs.fq.gz > aln.sam # PacBio HiFi/CCS genomic reads (v2.19 or later)
./minimap2 -ax lr:hq ref.fa ont-Q20.fq.gz > aln.sam       # Nanopore Q20 genomic reads (v2.27 or later)
./minimap2 -ax sr ref.fa read1.fa read2.fa > aln.sam      # short genomic paired-end reads
./minimap2 -ax splice ref.fa rna-reads.fa > aln.sam       # spliced long reads (strand unknown)
./minimap2 -ax splice -uf -k14 ref.fa reads.fa > aln.sam  # noisy Nanopore Direct RNA-seq
./minimap2 -ax splice:hq -uf ref.fa query.fa > aln.sam    # Final PacBio Iso-seq or traditional cDNA
./minimap2 -ax splice --junc-bed anno.bed12 ref.fa query.fa > aln.sam  # prioritize on annotated junctions
./minimap2 -cx asm5 asm1.fa asm2.fa > aln.paf             # intra-species asm-to-asm alignment
./minimap2 -x ava-pb reads.fa reads.fa > overlaps.paf     # PacBio read overlap
./minimap2 -x ava-ont reads.fa reads.fa > overlaps.paf    # Nanopore read overlap
# man page for detailed command line options
man ./minimap2.1

Table of Contents

Users' Guide

Minimap2 is a versatile sequence alignment program that aligns DNA or mRNA sequences against a large reference database. Typical use cases include: (1) mapping PacBio or Oxford Nanopore genomic reads to the human genome; (2) finding overlaps between long reads with error rate up to ~15%; (3) splice-aware alignment of PacBio Iso-Seq or Nanopore cDNA or Direct RNA reads against a reference genome; (4) aligning Illumina single- or paired-end reads; (5) assembly-to-assembly alignment; (6) full-genome alignment between two closely related species with divergence below ~15%.

For ~10kb noisy reads sequences, minimap2 is tens of times faster than mainstream long-read mappers such as BLASR, BWA-MEM, NGMLR and GMAP. It is more accurate on simulated long reads and produces biologically meaningful alignment ready for downstream analyses. For >100bp Illumina short reads, minimap2 is three times as fast as BWA-MEM and Bowtie2, and as accurate on simulated data. Detailed evaluations are available from the minimap2 paper or the preprint.

Installation

Minimap2 is optimized for x86-64 CPUs. You can acquire precompiled binaries from the release page with:

curl -L https://github.com/lh3/minimap2/releases/download/v2.28/minimap2-2.28_x64-linux.tar.bz2 | tar -jxvf -
./minimap2-2.28_x64-linux/minimap2

If you want to compile from the source, you need to have a C compiler, GNU make and zlib development files installed. Then type make in the source code directory to compile. If you see compilation errors, try make sse2only=1 to disable SSE4 code, which will make minimap2 slightly slower.

Minimap2 also works with ARM CPUs supporting the NEON instruction sets. To compile for 32 bit ARM architectures (such as ARMv7), use make arm_neon=1. To compile for for 64 bit ARM architectures (such as ARMv8), use make arm_neon=1 aarch64=1.

Minimap2 can use SIMD Everywhere (SIMDe) library for porting implementation to the different SIMD instruction sets. To compile using SIMDe, use make -f Makefile.simde. To compile for ARM CPUs, use Makefile.simde with the ARM related command lines given above.

General usage

Without any options, minimap2 takes a reference database and a query sequence file as input and produce approximate mapping, without base-level alignment (i.e. coordinates are only approximate and no CIGAR in output), in the PAF format:

minimap2 ref.fa query.fq > approx-mapping.paf

You can ask minimap2 to generate CIGAR at the cg tag of PAF with:

minimap2 -c ref.fa query.fq > alignment.paf

or to output alignments in the SAM format:

minimap2 -a ref.fa query.fq > alignment.sam

Minimap2 seamlessly works with gzip'd FASTA and FASTQ formats as input. You don't need to convert between FASTA and FASTQ or decompress gzip'd files first.

For the human reference genome, minimap2 takes a few minutes to generate a minimizer index for the reference before mapping. To reduce indexing time, you can optionally save the index with option -d and replace the reference sequence file with the index file on the minimap2 command line:

minimap2 -d ref.mmi ref.fa                     # indexing
minimap2 -a ref.mmi reads.fq > alignment.sam   # alignment

Importantly, it should be noted that once you build the index, indexing parameters such as -k, -w, -H and -I can't be changed during mapping. If you are running minimap2 for different data types, you will probably need to keep multiple indexes generated with different parameters. This makes minimap2 different from BWA which always uses the same index regardless of query data types.

Use cases

Minimap2 uses the same base algorithm for all applications. However, due to the different data types it supports (e.g. short vs long reads; DNA vs mRNA reads), minimap2 needs to be tuned for optimal performance and accuracy. It is usually recommended to choose a preset with option -x, which sets multiple parameters at the same time. The default setting is the same as map-ont.

Map long noisy genomic reads

minimap2 -ax map-pb  ref.fa pacbio-reads.fq > aln.sam   # for PacBio CLR reads
minimap2 -ax map-ont ref.fa ont-reads.fq > aln.sam      # for Oxford Nanopore reads
minimap2 -ax map-iclr ref.fa iclr-reads.fq > aln.sam    # for Illumina Complete Long Reads

The difference between map-pb and map-ont is that map-pb uses homopolymer-compressed (HPC) minimizers as seeds, while map-ont uses ordinary minimizers as seeds. Empirical evaluation suggests HPC minimizers improve performance and sensitivity when aligning PacBio CLR reads, but hurt when aligning Nanopore reads. map-iclr uses an adjusted alignment scoring matrix that accounts for the low overall error rate in the reads, with transversion errors being less frequent than transitions.

Map long mRNA/cDNA reads

minimap2 -ax splice:hq -uf ref.fa iso-seq.fq > aln.sam       # PacBio Iso-seq/traditional cDNA
minimap2 -ax splice ref.fa nanopore-cdna.fa > aln.sam        # Nanopore 2D cDNA-seq
minimap2 -ax splice -uf -k14 ref.fa direct-rna.fq > aln.sam  # Nanopore Direct RNA-seq
minimap2 -ax splice --splice-flank=no SIRV.fa SIRV-seq.fa    # mapping against SIRV control

There are different long-read RNA-seq technologies, including tranditional full-length cDNA, EST, PacBio Iso-seq, Nanopore 2D cDNA-seq and Direct RNA-seq. They produce data of varying quality and properties. By default, -x splice assumes the read orientation relative to the transcript strand is unknown. It tries two rounds of alignment to infer the orientation and write the strand to the ts SAM/PAF tag if possible. For Iso-seq, Direct RNA-seq and tranditional full-length cDNAs, it would be desired to apply -u f to force minimap2 to consider the forward transcript strand only. This speeds up alignment with slight improvement to accuracy. For noisy Nanopore Direct RNA-seq reads, it is recommended to use a smaller k-mer size for increased sensitivity to the first or the last exons.

Minimap2 rates an alignment by the score of the max-scoring sub-segment, excluding introns, and marks the best alignment as primary in SAM. When a spliced gene also has unspliced pseudogenes, minimap2 does not intentionally prefer spliced alignment, though in practice it more often marks the spliced alignment as the primary. By default, minimap2 outputs up to five secondary alignments (i.e. likely pseudogenes in the context of RNA-seq mapping). This can be tuned with option -N.

For long RNA-seq reads, minimap2 may produce chimeric alignments potentially caused by gene fusions/structural variations or by an intron longer than the max intron length -G (200k by default). For now, it is not recommended to apply an excessively large -G as this slows down minimap2 and sometimes leads to false alignments.

It is worth noting that by default -x splice prefers GT[A/G]..[C/T]AG over GT[C/T]..[A/G]AG, and then over other splicing signals. Considering one additional base improves the junction accuracy for noisy reads, but reduces the accuracy when aligning against the widely used SIRV control data. This is because SIRV does not honor the evolutionarily conservative splicing signal. If you are studying SIRV, you may apply --splice-flank=no to let minimap2 only model GT..AG, ignoring the additional base.

Since v2.17, minimap2 can optionally take annotated genes as input and prioritize on annotated splice junctions. To use this feature, you can

paftools.js gff2bed anno.gff > anno.bed
minimap2 -ax splice --junc-bed anno.bed ref.fa query.fa > aln.sam

Here, anno.gff is the gene annotation in the GTF or GFF3 format (gff2bed automatically tests the format). The output of gff2bed is in the 12-column BED format, or the BED12 format. With the --junc-bed option, minimap2 adds a bonus score (tuned by --junc-bonus) if an aligned junction matches a junction in the annotation. Option --junc-bed also takes 5-column BED, including the strand field. In this case, each line indicates an oriented junction.

Find overlaps between long reads

minimap2 -x ava-pb  reads.fq reads.fq > ovlp.paf    # PacBio CLR read overlap
minimap2 -x ava-ont reads.fq reads.fq > ovlp.paf    # Oxford Nanopore read overlap

Similarly, ava-pb uses HPC minimizers while ava-ont uses ordinary minimizers. It is usually not recommended to perform base-level alignment in the overlapping mode because it is slow and may produce false positive overlaps. However, if performance is not a concern, you may try to add -a or -c anyway.

Map short accurate genomic reads

minimap2 -ax sr ref.fa reads-se.fq > aln.sam           # single-end alignment
minimap2 -ax sr ref.fa read1.fq read2.fq > aln.sam     # paired-end alignment
minimap2 -ax sr ref.fa reads-interleaved.fq > aln.sam  # paired-end alignment

When two read files are specified, minimap2 reads from each file in turn and merge them into an interleaved stream internally. Two reads are considered to be paired if they are adjacent in the input stream and have the same name (with the /[0-9] suffix trimmed if present). Single- and paired-end reads can be mixed.

Minimap2 does not work well with short spliced reads. There are many capable RNA-seq mappers for short reads.

Full genome/assembly alignment

minimap2 -ax asm5 ref.fa asm.fa > aln.sam       # assembly to assembly/ref alignment

For cross-species full-genome alignment, the scoring system needs to be tuned according to the sequence divergence.

Advanced features

Working with >65535 CIGAR operations

Due to a design flaw, BAM does not work with CIGAR strings with >65535 operations (SAM and CRAM work). However, for ultra-long nanopore reads minimap2 may align ~1% of read bases with long CIGARs beyond the capability of BAM. If you convert such SAM/CRAM to BAM, Picard and recent samtools will throw an error and abort. Older samtools and other tools may create corrupted BAM.

To avoid this issue, you can add option -L at the minimap2 command line. This option moves a long CIGAR to the CG tag and leaves a fully clipped CIGAR at the SAM CIGAR column. Current tools that don't read CIGAR (e.g. merging and sorting) still work with such BAM records; tools that read CIGAR will effectively ignore these records. It has been decided that future tools will seamlessly recognize long-cigar records generated by option -L.

TL;DR: if you work with ultra-long reads and use tools that only process BAM files, please add option -L.

The cs optional tag

The cs SAM/PAF tag encodes bases at mismatches and INDELs. It matches regular expression /(:[0-9]+|\*[a-z][a-z]|[=\+\-][A-Za-z]+)+/. Like CIGAR, cs consists of series of operations. Each leading character specifies the operation; the following sequence is the one involved in the operation.

The cs tag is enabled by command line option --cs. The following alignment, for example:

CGATCGATAAATAGAGTAG---GAATAGCA
||||||   ||||||||||   |||| |||
CGATCG---AATAGAGTAGGTCGAATtGCA

is represented as :6-ata:10+gtc:4*at:3, where :[0-9]+ represents an identical block, -ata represents a deletion, +gtc an insertion and *at indicates reference base a is substituted with a query base t. It is similar to the MD SAM tag but is standalone and easier to parse.

If --cs=long is used, the cs string also contains identical sequences in the alignment. The above example will become =CGATCG-ata=AATAGAGTAG+gtc=GAAT*at=GCA. The long form of cs encodes both reference and query sequences in one string. The cs tag also encodes intron positions and splicing signals (see the minimap2 manpage for details).

Working with the PAF format

Minimap2 also comes with a (java)script paftools.js that processes alignments in the PAF format. It calls variants from assembly-to-reference alignment, lifts over BED files based on alignment, converts between formats and provides utilities for various evaluations. For details, please see misc/README.md.

Algorithm overview

In the following, minimap2 command line options have a dash ahead and are highlighted in bold. The description may help to tune minimap2 parameters.

  1. Read -I [=4G] reference bases, extract (-k,-w)-minimizers and index them in a hash table.

  2. Read -K [=200M] query bases. For each query sequence, do step 3 through 7:

  3. For each (-k,-w)-minimizer on the query, check against the reference index. If a reference minimizer is not among the top -f [=2e-4] most frequent, collect its the occurrences in the reference, which are called seeds.

  4. Sort seeds by position in the reference. Chain them with dynamic programming. Each chain represents a potential mapping. For read overlapping, report all chains and then go to step 8. For reference mapping, do step 5 through 7:

  5. Let P be the set of primary mappings, which is an empty set initially. For each chain from the best to the worst according to their chaining scores: if on the query, the chain overlaps with a chain in P by --mask-level [=0.5] or higher fraction of the shorter chain, mark the chain as secondary to the chain in P; otherwise, add the chain to P.

  6. Retain all primary mappings. Also retain up to -N [=5] top secondary mappings if their chaining scores are higher than -p [=0.8] of their corresponding primary mappings.

  7. If alignment is requested, filter out an internal seed if it potentially leads to both a long insertion and a long deletion. Extend from the left-most seed. Perform global alignments between internal seeds. Split the chain if the accumulative score along the global alignment drops by -z [=400], disregarding long gaps. Extend from the right-most seed. Output chains and their alignments.

  8. If there are more query sequences in the input, go to step 2 until no more queries are left.

  9. If there are more reference sequences, reopen the query file from the start and go to step 1; otherwise stop.

Getting help

Manpage minimap2.1 provides detailed description of minimap2 command line options and optional tags. The FAQ page answers several frequently asked questions. If you encounter bugs or have further questions or requests, you can raise an issue at the issue page. There is not a specific mailing list for the time being.

Citing minimap2

If you use minimap2 in your work, please cite:

Li, H. (2018). Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34:3094-3100. doi:10.1093/bioinformatics/bty191

and/or:

Li, H. (2021). New strategies to improve minimap2 alignment accuracy. Bioinformatics, 37:4572-4574. doi:10.1093/bioinformatics/btab705

Developers' Guide

Minimap2 is not only a command line tool, but also a programming library. It provides C APIs to build/load index and to align sequences against the index. File example.c demonstrates typical uses of C APIs. Header file minimap.h gives more detailed API documentation. Minimap2 aims to keep APIs in this header stable. File mmpriv.h contains additional private APIs which may be subjected to changes frequently.

This repository also provides Python bindings to a subset of C APIs. File python/README.rst gives the full documentation; python/minimap2.py shows an example. This Python extension, mappy, is also available from PyPI via pip install mappy or from BioConda via conda install -c bioconda mappy.

Limitations

  • Minimap2 may produce suboptimal alignments through long low-complexity regions where seed positions may be suboptimal. This should not be a big concern because even the optimal alignment may be wrong in such regions.

  • Minimap2 requires SSE2 instructions on x86 CPUs or NEON on ARM CPUs. It is possible to add non-SIMD support, but it would make minimap2 slower by several times.

  • Minimap2 does not work with a single query or database sequence ~2 billion bases or longer (2,147,483,647 to be exact). The total length of all sequences can well exceed this threshold.

  • Minimap2 often misses small exons.

minimap2's People

Contributors

1pakch avatar apregier avatar blawrence-ont avatar cjw85 avatar fedarko avatar hasindu2008 avatar iiseymour avatar jmarshall avatar jts avatar junaruga avatar kevinxchan avatar kojix2 avatar lh3 avatar marcus1487 avatar markbicknellont avatar martinghunt avatar mbrcic avatar mcshane avatar mikolmogorov avatar mvdbeek avatar nh13 avatar pesho-ivanov avatar piezoid avatar rikuu avatar rzlim08 avatar swvondeylen-ont avatar torhou avatar tseemann avatar xdu-diagnoa avatar zingdle avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

minimap2's Issues

All but one alignment pieces disappear

According to #16 , I use htsbox samview to convert SAM produced from minimap2 since CIGAR are too long.

However there is only one local alignment on IGV (igv version 2.3.97).

Below is the content in SAM
(CIGAR and subsequent columns are hidden for convinience)
(The final piece of alignment is the only one shown on IGV)

@ PG ID:minimap2 PN:minimap2 VN:2.1.1-r341 CL:minimap2 -a -x splice result/ref.mmi result/out.fa
@ SQ SN:gi|732682658|gb|CP009685.1| LN:4636831
utg000001c 16 gi|732682658|gb|CP009685.1| 143874 60 21M1D34M . . .
utg000001c 2064 gi|732682658|gb|CP009685.1| 2384241 60 2316725H5 . . .
utg000001c 2064 gi|732682658|gb|CP009685.1| 1 60 4645502H . . .

I also use bwa-mem to do alignment , using samtools view to convert SAM , and there is no problem at all.
(You can download mentioned files below)
ref
query
sam : minimap2
sam : bwa-mem

It seems there are some problems in htsbox samview (?) ; are there any other options to handle with long CIGAR?

parallelization with -sr

with short reads, the default parallelization isn't keeping more than 3 of 4 cores busy most of the time.

It seems to improve quite a bit by bumping the batch size. I tried doing this from the command-line with K, but I think maybe it gets overridden by the -x sr opts?

If I modify the source as below, it keeps more cores (tested on 12 cpus) busy for a higher percentage of time.

diff --git a/map.c b/map.c
index ac66e81..2273d32 100644
--- a/map.c
+++ b/map.c
@@ -98,7 +98,7 @@ int mm_set_opt(const char *preset, mm_idxopt_t *io, mm_mapopt_t *mo)
                mo->best_n = 20;
                mo->mid_occ = 1000;
                mo->max_occ = 5000;
-               mo->mini_batch_size = 50000000;
+               mo->mini_batch_size = 250000000;
        } else if (strcmp(preset, "splice") == 0 || strcmp(preset, "cdna") == 0) {
                io->is_hpc = 0, io->k = 15, io->w = 5;
                mo->flag |= MM_F_SPLICE | MM_F_SPLICE_FOR | MM_F_SPLICE_REV;

PAF output

Am I right that the lastest commit does not produce PAF output anymore?

pysam aligned length after aligning with -L

This issue is maybe more appropriate at the pysam repo, but I'm not sure.

I noticed that after aligning reads with the -L option for ultra long reads the aligned read length (read.query_alignment_length) in pysam is 0 for some of those ultra-long reads since it uses the CIGAR to determine the aligned read length. I encounter the issue for 3 reads in the ultra long e.coli dataset from Josh Quick and Nick Loman.

The (clipped) cigar strings are 744361S, 760365S and 408232S derived from reads with 744432bp, 760436bp and 408303bp. This is probably a corner case, but sufficient to crash a script I wrote due to a division by zero error. Due to the very short alignment, I'm inclined to drop this read entirely from the dataset but that's a drastic solution tailored to this scripts.

What would be the recommended workaround here to get the aligned read length, without using the CIGAR?

Is the -L solution for long CIGARs temporary until the bam format is adapted? If so, perhaps no changes have to be made to pysam.

Cheers,
Wouter

Warning: no @SQ lines will be outputted

I am working on Oxford nanopore reads(1D). I am using Minimap2 for aligning it to the human genome. I made an index file for the human genome to be used as a reference. when I started the alignment I got following warning.
"For a multi-part index, no @sq lines will be outputted"
The alignment is finished. The .sam file does not contain any @sq headers, as a result, I could not get any mapping statistics using samtools flagstat.
The command I used for building index:

minimap2 -d ref.mmi ref.fa

The command I used for alignment:

minimap2 -ax map-ont ref.fa ont-reads.fa > aln.sam
I am not sure why I am getting this warning.

Alignment Strings

Dear Heng,

would it be possible to optionally output the reference and query alignment string (potentially without hard and soft clipped sequences) in the PAF format?

Thanks,
Felix

got disagree between single sequence align to single and multi sequences align to multi

Hi Li~
the multil sequences file A contain a sequence ID: 000130F_002|quiver|quiver|pilon ,when align file A to ref,get :

000130F_002|quiver|quiver|pilon 6708 196 6692 + Contig44 6708 196 6692 3781 6496 60 tp:A:P cm:i:329 s1:i:3781 s2:i:1648 , the identity is 3781/6496

but when I extract 000130F_002|quiver|quiver|pilon to single sequence file B,and align file B to ref,get:

000130F_002|quiver|quiver|pilon 6708 9 6692 + Contig44 6708 9 6692 6683 6683 60 tp:A:P cm:i:661 s1:i:6683 s2:i:0, the identity is 6683/6683=1

so the both identity is not equal, so why ?

Thank~
Si

No SQ lines

14b8534 (Heng Li 2017-09-14 22:44:10 -0400 258) while ((mi = mm_idx_reader_read(idx_rdr, n_threads)) != 0) {
7d50e64 (Heng Li 2017-10-04 17:32:58 -0400 259) if ((opt.flag & MM_F_OUT_SAM) && idx_rdr->n_parts == 1) {
7d50e64 (Heng Li 2017-10-04 17:32:58 -0400 260) if (mm_idx_reader_eof(idx_rdr)) {
7d50e64 (Heng Li 2017-10-04 17:32:58 -0400 261) mm_write_sam_hdr(mi, rg, MM_VERSION, argc, argv);
7d50e64 (Heng Li 2017-10-04 17:32:58 -0400 262) } else {
7d50e64 (Heng Li 2017-10-04 17:32:58 -0400 263) mm_write_sam_hdr(0, rg, MM_VERSION, argc, argv);
7d50e64 (Heng Li 2017-10-04 17:32:58 -0400 264) if (mm_verbose >= 2)
f4a5d3a (Heng Li 2017-10-05 15:03:03 -0400 265) fprintf(stderr, "[WARNING]\033[1;31m For a multi-part index, no @
7d50e64 (Heng Li 2017-10-04 17:32:58 -0400 266) }
7d50e64 (Heng Li 2017-10-04 17:32:58 -0400 267) }

This suppresses all SQ lines for mappings against references with more than one sequence, right?
Samtools can't work with the output. How is that supposed to be used?

Thanks
Andy

Multimapping reads: no sequence and quality scores for all but first alignment

Hi,

I'm not sure if it's a bug, but for me, this was for sure unexpected behaviour.

I just installed minimap2, so I'm using version 2.0-r301-dirty
I noticed that for a multimapping read (in my example mapping to 6 locations) only the first aligned read gets the sequence and quality in the bam/sam output. The remaining 5 have the CIGAR, followed by the tags. I am using the NA12878 ONT dataset downloaded from here.

A single read provoking this can be found here in my dropbox.

The bam file obtained from this read can be found here in my dropbox.

The command I used was:
minimap2 -ax map10k -t 48 Homo_sapiens.GRCh38.dna.primary_assembly.fa provokingread.fastq.gz > provokingread.sam

Cheers,
Wouter

Cannot open minimap2-created sam file in Geneious

Hello,

I recently attempted to open a sam file I created using Minimap2 in Geneious and was not able to open it. I tried to validate the file using picard ValidateSamFile, and received the following errors:

HISTOGRAM java.lang.String

Error Type Count
ERROR:MISMATCH_CIGAR_SEQ_LENGTH 50105
ERROR:MISSING_READ_GROUP 1
WARNING:ADJACENT_INDEL_IN_CIGAR 1
WARNING:RECORD_MISSING_READ_GROUP 939614

I can provide the files I used if necessary, but is this a known issue or more likely something to do with my specific data? Is there a way to resolve it?

Thank you!

Amelia

minimap2.sh: line 34: 17631 Illegal instruction (core dumped)

Hi,

just can't figure out why minimap2 keeps failing, kindly assist me. The command am running is shown in the last line. Thanks

[M::mm_idx_gen::28.344*1.30] collected minimizers
[M::mm_idx_gen::32.368*2.39] sorted minimizers
[M::main::32.368*2.39] loaded/built the index for 55494 target sequence(s)
[M::mm_mapopt_update::37.048*2.21] mid_occ = 266; max_occ = 1612
[M::mm_idx_stat] kmer size: 15; skip: 5; is_HPC: 0; #seq: 55494
[M::mm_idx_stat::38.676*2.16] distinct minimizers: 63186547 (65.76% are singletons); average occurrences: 2.262; average spacing: 3.042
minimap2.sh: line 34: 17631 Illegal instruction     (core dumped) minimap2 -ax splice -t 23 olivefly_supernova_v1p1_corrvalid_group_gt10_500000bc_74X_seed0.pseudohap.1.fasta Cc.E.5H_all_rnaseq_combined_pass.trimmed_stranded.fasta > Ccap01172013-genome-Cc.E.5H_all_rnaseq_combined_pass.trimmed_stranded.sam

Invalid Overlap while running canu correction on RNASeq files with minimap2

I'm getting 'INVALID OVERLAP' messages from canu when running read correction on RNASeq files with minimap2. Here's one of the outputs:

$ cat  mmap.000015.out 
Running job 15 based on command line options.
Fetch blocks/000006.fasta
Fetch blocks/000007.fasta
[M::main::1.697*1.00] loaded/built the index for 192000 target sequence(s)
[M::mm_mapopt_update::2.170*1.00] mid_occ = 119
[M::mm_idx_stat] kmer size: 17; skip: 11; is_HPC: 0; #seq: 192000
[M::mm_idx_stat::2.481*1.00] distinct minimizers: 21803167 (90.81% are singletons); average occurrences: 1.405; average spacing: 6.096
[M::worker_pipeline::151.206*11.13] mapped 192000 sequences
[M::main] Version: 2.2-r424-dirty
[M::main] CMD: /home/gringer/install/canu/canu-1.6/Linux-amd64/bin/minimap2 -x ava-ont -c -k17 -w11 -t 12 ./blocks/000005.mmi ./blocks/000005.fasta
[M::main] Real time: 151.244 sec; CPU: 1682.280 sec
[M::main::1.601*1.00] loaded/built the index for 192000 target sequence(s)
[M::mm_mapopt_update::2.029*1.00] mid_occ = 119
[M::mm_idx_stat] kmer size: 17; skip: 11; is_HPC: 0; #seq: 192000
[M::mm_idx_stat::2.343*1.00] distinct minimizers: 21803167 (90.81% are singletons); average occurrences: 1.405; average spacing: 6.096
[M::worker_pipeline::236.382*11.22] mapped 192000 sequences
[M::main] Version: 2.2-r424-dirty
[M::main] CMD: /home/gringer/install/canu/canu-1.6/Linux-amd64/bin/minimap2 -x ava-ont -c -k17 -w11 -t 12 ./blocks/000005.mmi queries/000015/000006.fasta
[M::main] Real time: 236.419 sec; CPU: 2652.084 sec
[M::main::1.630*1.00] loaded/built the index for 192000 target sequence(s)
[M::mm_mapopt_update::2.084*1.00] mid_occ = 119
[M::mm_idx_stat] kmer size: 17; skip: 11; is_HPC: 0; #seq: 192000
[M::mm_idx_stat::2.399*1.00] distinct minimizers: 21803167 (90.81% are singletons); average occurrences: 1.405; average spacing: 6.096
[M::worker_pipeline::278.415*11.23] mapped 192000 sequences
[M::main] Version: 2.2-r424-dirty
[M::main] CMD: /home/gringer/install/canu/canu-1.6/Linux-amd64/bin/minimap2 -x ava-ont -c -k17 -w11 -t 12 ./blocks/000005.mmi queries/000015/000007.fasta
[M::main] Real time: 278.465 sec; CPU: 3126.120 sec
INVALID OVERLAP   775356 (len   2736)   768001 (len   1596) hangs   1205   1300 - 2093699   4806 flip 1

The offending minimap2 line for this overlap step seems to be the following (headers from canu source code, minimap/mmapConvert.C have been added):

aiid    alen     bgn     end bori         biid  blen     bgn     end #match   minimizers   alnlen     cm:i:errori
775356  2736    1205    1436    -       768001  1596    4806    5049    192          259        0          tp:A:I  cm:i:0  s1:i:0  s2:i:0  NM:i:67 ms:i:96 AS:i:96 nn:i:0  cg:Z:19M1D4M1D2M1D12M1I4M2D16M1I5M2D3M1I25M6I3M1I7M1D6M3D9M1I2M1I15M3I5M1I5M1D2M2D5M1D4M2D25M1D7M1D3M2D10M2D2M2D2M1D6M2D7M

I can't easily work out if this is a minimap2 issue or a canu issue. It seems a bit odd that the start and end of a match region exceed the sequence length (i.e. blen 1596, bgn 4806; end 5049), but that could just be an unexpected file format change between minimap and minimap2.

Metadata

Canu v1.6 command:

~/install/canu/canunu-1.6/Linux-amd64/bin/canu -correct -nanopore-corrected fastq_runid_*.fastq -p Olivier_GL261_cDNA_2017-Aug-04 -d Olivier_GL261_cDNA_2017-Aug-04 corOutCoverage=all minReadLength=100 minOverlapLength=50 genomeSize=500M overlapper=minimap

The minimap2 executable was copied into the canu directory:

$ ~/install/canu/canu-1.6/Linux-amd64/bin/minimap2  -V
2.2-r424-dirty

Running on Debian Linux:
Linux elegans 4.9.0-3-amd64 #1 SMP Debian 4.9.25-1 (2017-05-02) x86_64 GNU/Linux

minimap2 gDNA stress test

Here are three assembled contigs (generated from nanopore reads from a rodent parasite, Nippostrongylus brasiliensis) that have a highly repetitive structure:

16.fa.gz

doing an all-vs-all mapping of these sequences (1.38Mb total) takes about 20 seconds:

$ time ~/install/minimap2/minimap2 -X -x ava-ont 16.fa 16.fa > /dev/null
[M::mm_idx_gen::0.101*0.99] collected minimizers
[M::mm_idx_gen::0.132*1.45] sorted minimizers
[M::main::0.132*1.45] loaded/built the index for 3 target sequence(s)
[M::mm_mapopt_update::0.158*1.39] mid_occ = 246; max_occ = 674
[M::mm_idx_stat] kmer size: 15; skip: 5; is_HPC: 0; #seq: 3
[M::mm_idx_stat::0.168*1.36] distinct minimizers: 392121 (96.73% are singletons); average occurrences: 1.205; average spacing: 2.919
[M::worker_pipeline::18.873*1.26] mapped 3 sequences
[M::main] Version: 2.0rc1-r274-dirty
[M::main] CMD: /home/gringer/install/minimap2/minimap2 -X -x ava-ont 16.fa 16.fa
[M::main] Real time: 18.883 sec; CPU: 23.788 sec

real    0m18.889s
user    0m23.732s
sys     0m0.064s

I'm not completely sure about the PAF file output, so don't know if what I'm seeing matches what I'm thinking, but this line seems to suggest a match at 16% identity (19567 / 122212), which seems a bit low:

16_tig00006776  116440  410     115255  +       16_tig00006776  116440  1786    116430  19567   122212  0       tp:A:S  cm:i:4232       s1:i:21043

I leave this file here for further exploration.

SAM header interspersed through file

Hi Heng,

Just came across a bizarre issue which I have been debugging all day to try and get at the root cause.

To cut a long story short, it looks like in some circumstances the SAM header gets spread out in chunks throughout the SAM file - almost like there are multiple SAM files concatenated together.

I only see this when aligning to a reference that is effectively a database of fasta files concatenated together.

For example:

minimap2 -a DB.fasta my.fastq > my.sam

I have a file DB.fasta which has 16498 entries in it. When I align my fastq file to this reference the resulting SAM file has 4 separate header locations spread throughout the file. Each section is unique.

To check my sanity I ran the exact same thing with bwa mem and there are no issues.
Also, if I just align the sample to a single reference (with minimap2) such as GRCh38, the resulting SAM is fine.

When trying to view the resulting SAM file in samtools view I get an error saying the file is truncated. Therefore I was only able to see this in vim.

My current install of minimap2 is 2.0-r191-dirty

Let me know if you want some more information.

Example of --frag=yes

Is -x sr --sr --frag=yes intended for paired-end short reads? Could you please give an example of its use?

No query sequence in sam?

Thanks for the quick fix on the last issue. I am also seeing some mapped reads which don't have either the query sequence or qual, yet they seem to be mapped. I thought the query should always be supplied in the sam format in this case or am I wrong? Here is an example from the sam fastq.gz as in issue #3

b37c6230-9e91-4d7c-aa5d-853a2e6129b2_Basecall_Alignment_template        272     chr14_GL000009v2_random 65906   0       19S27M1D2M2D15M4I19M1D22M1D4M1D49M1I14M3D43M2I8M1D4M1D7M1D7M1D14M1D19M1I11M1D12M2D14M1D7M2D4M3D56M1D17M1I2M1D4M1D7M1D7M5D7M1D26M2D9M1D4M1D4M1D11M4D34M4D24M2D5M2D1M1I24M1D9M1D11M1D2M1D1M1D6M1D5M3D24M2D11M4D1M1D7M1D6M3D13M1I6M1I56M1D8M1D9M2D26M2D15M1D55M1D5M1D17M1D35M1D27M2D28M1D39M2D3M2D2M1D51M2I44M1D3M1D2M1D33M1I27M1D20M1D24M1D9M1D37M2D9M1D4M4D47M1D13M3D73M1D5M1D51M1D17M1D10M2D9M2D12M2D31M2D9M1D42M2I1M7D19M1D6M1D20M2D5M1D3M2D41M1I31M1D15M1D28M3D28M1D6M2D20M1D4M1D3M3I14M2D13M2D13M2D30M3D3M1D9M2D7M2D21M2D8M1D8M1D9M2D4M2D4M1D17M1D1M2D2M1D15M1D15M1I4M1D25M1I33M1D49M1D20M2D8M2D24M2I10M2D30M7D5M1D2M4D11M3D48M1D5M3D15M1I8M1D21M2D28M1D23M1I7M1D16M5D6M1D12M1D9M1D7M2D4M1D5M3D7M4D3M1D4M3D10M3D17M1I20M2D5M1D8M2D15M1I2M3D2M1D5M2D8M2D11M1D22M2D2M1D8M1D5M1D7M1D33M1D18M5D22M2D13M3D3M1D14M1D9M1D19M1D16M2D12M1D66M2D50M1D8M1D22M1D5M2D9M1D4M1D7M3D6M1D43M1D26M1D2M1D2M2I5M1D16M1I50M2D4M1D5M1D22M4D2M3D3M1D6M1D11M1D9M2D9M1D9M3D9M4D40M2D1M2D8M1D12M1D7M2D28M3D6M2D13M2D10M1D14M1D8M3D16M1I1M2D19M1D6M4D4M1D40M2D3M2D18M3D127M1I2M1D8M3D3M4D11M1D10M2D15M1D19M4D4M3D11M1D11M1D66M2I18M1D9M1I20M1D9M2D5M1I7M1I6M1I5M4D1M1I2M1I3M1I21M1D7M1D18M1I20M1D9M1D8M5D6M2D17M1D7M1D22M2D3M1D11M6D2M3D6M1I11M2D4M4D1M2D11M1D6M4D5M1I5M2D2M2D5M1I12M1I12M1D5M2D37M1D6M1I3M3D14M1D19M2D5M1D7M1D14M1D11M1D2M2D20M2D7M1D7M1I8M1D14M2D10M1D5M3D9M1D7M1D5M3D70M2D55M1D8M1I11M1D11M2D16M2D7M4D2M1I10M2D8M2D23M2D3M2D14M2I2M1D15M2D3M2D14M1D32M1D4M1D8M1D10M2I4M1D4M2D42M1D25M1I8M1D13M1D8M2D36M1D16M2D13M1D39M1I25M2D10M3D29M2D1M1D13M2D6M3D11M1D18M2D10M1D12M1D40M1D18M4I8M283I22M9I5M2D54M4D7M1I30M2I12M2D23M2D30M2D20M1D8M1D22M1D5M1D4M1D5M1D98M2D5M1D10M2D21M1D6M1D6M1D25M1D10M2I2M1D15M1D20M3D16M4D5M1I2M2I12M1D3M2D3M1D19M1D32M1D20M1D26M3D39M1D5M1D9M1D18M1I7M1D3M1D8M1D20M1D12M1D11M1D8M1D13M1D16M1D11M1D5M1D11M1D39M1D8M2D24M2D4M1D11M1I48M2I39M1D2M1D6M1D9M2I2M6D55M1D1M1D18M1D10M1D2M2D13M2D24M2D10M2D31M1D17M1D24M6D2M1D26M1D23M1D12M1D7M2D14M1I6M4D44M1D3M2D18M1D24M1I18M5D2M2D22M1D5M1D14M1D1M1D16M3D18M5D6M1I28M1D8M2D7M1D3M2D10M1D8M1D10M1D6M1D20M5D39M1D18M1D28M4I9M2D12M2D19M1D4M2D9M1D11M1D27M2D5M1D11M1D22M1D39M2D20M1D26M2D35M1D9M1D7M3D8M2D7M1D24M2D4M2D13M1D9M1D35M1D8M1D6M2D10M1D53M4D13M1D2M1D22M1D25M1D6M1D46M1I2M2D24M2D8M1D9M1D3M2D12M1D31M1D6M1D12M2D20M1D8M4D25M1D22M1D12M2D15M1D9M1D33M1D8M1D46M1D11M2D27M3D4M1D50M1D8M2D10M1D13M2D2M1I5M3D5M2D30M1I4M1D6M1D17M1D13M2D4M1D15M3D18M1D41M1D8M1D1M1D14M1D25M1D25M3D14M4D28M1D8M1D10M4D9M1D54M2D2M1D10M1D39M1D8M1D8M1I15M2D8M1D12M1D3M1I31M2D3M1D6M1D3M1D4M1D12M1I11M1D50M4D4M1D25M1D5M1D18M1D42M3D19M1D5M1D11M3D10M1D11M1D3M2D5M2D8M3I8M1D7M3D17M1D10M5D3M1D4M1D8M1I20M2D4M1D6M1I6M1D18M3D4M1D11M2D13M1D20M4D6M1D23M25S  *       0       0       *       *       cm:i:195        s1:i:2366       NM:i:1633       ms:i:9876       AS:i:10139      nn:i:0

mappy Python 2 vs Python 3 strings

Is mappy intended for both Python2 and Python3? If so there is a bug with Python 3. It handles byte strings differently, so fastx_read is yielding strings with b' at the start and ' at the end, because of the conversion using str(foo). It probably needs a foo.decode() in there instead for python 3.

This also means that when mapping, the coords are off. eg:

$ cat tmp.ref.fa
>ref
GACCTAGGGTAACGTATCGTCGCCCTCGCGGCCGACTCATGGGCTTGCGTACAAACTTGA
CTAACGGTCATAGTGACAGAATCATATACTTTCTGCTTAAGTGATCTAAATGCACGGTCT
ATCACAAGTGGAACGTGCCTAGGAGGATGTTTGCCGAATGAAGCTTCGAATTCAGCTATT
TATAAGTTACACACTTTTAG
$ cat tmp.read.fq
@read
AGGGTAACGTATCGTCGCCCTCGCGGCCGACTCATGGGCTTGCGTACAAACT
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGGGGH

Code:

a = mappy.Aligner('tmp.ref.fa')
for name, seq, qual in mappy.fastx_read('tmp.read.fa'):
    for hit in a.map(seq):
        print(hit)

Python2 output:
0 52 + ref 200 5 57 52 52 6 tp:A:P ts:A:. cg:Z:52M

Python3 output:
2 54 + b'ref' 200 5 57 52 52 6 tp:A:P ts:A:. cg:Z:52M

samtools mpileup segfault

Dear Heng,

I jump a bit ahead to see how me can quickly start working with minimap2. I mapped an assembly on a reference genome (-xasm5 -a), sorted and converted the output to a bam file. Then I passed the bam file to samtools mpileup (-v -u -B) to generate a VCF file. Works for the first to chromosomes in my reference but then fails on the third. The alignment that is causing the fail is 44kb long with about 440 CIGAR edits. Other, way longer alignments with more CIGAR edits have been processed on the first two chromosomes. I am not sure what is happening. Might be related to samtools/htslib (btw. I compiled the latest stable with your "Work-around for supporting CIGARs with >64k operations in BAM files") and I am not even sure the issues belongs here. I uploaded the alignment and bot qry and ref here https://owncloud.tuebingen.mpg.de/index.php/s/44pZgooEWvDgboD. Any idea how and where to start looking for the problem?

Thanks a lot especially for minimap2 👍
Felix

mapping illumina reads *to* minION reads

i would like to map Illumina (query) reads to minION (reference) reads.
Which preset would you recommend: sr or map-ont?

(or some other set of parameters?)

My use-case is metagenomics. I have 0.5X coverage genome skims of 50 plant species (Illumina). Assembling genomes is not currently feasible, but skims are easy and cheapish.

We then use minION to sequence pollen (thus, multiple species). i want to see if Illumina mapped reads can be used to identify each/any of the minION reads to plant species.

Question about SAM output tags

Hi,

I generated "sam" file using "minimap2", and I noticed there are some additional tags in the output compared to the standard "sam" format, e.g. "tp:A:P cm:i:8 s1:i:46 s2:i:0 NM:i:264 ms:i:194 AS:i:194 nn:i:0". I was wondering if you can tell me what do the tags "tp", "cm", "ms" and "nn" mean.
Next, I want to extract the number of matches and mismatches from the cigar alignment. However, the "sam" file contains only the tag M, which stands for both matches and mismatches, so I was wondering if any of these additional tags can help me differentiate between the both. If not, can you please suggest me a way to do that ?

Thank you,
Natasha

--cs=long not accepted; cs='long' not accepted want cs tag operating for sam output

The call:-
% minimap2 -c --cs=long -a -t8 /home/users/allstaff/gouil.q/genomes/Mus_musculus/GRCm38/CAST_EiJ_full_sequence/CAST_EiJ_full_sequence_SNPs_introduced.mmi albacore2.fastq.fa.gz > /home/users/allstaff/woodruff.c/lab_speed/cjw/genomes/Mus_musculus/GRCm38/CAST_EiJ_full_sequence/CAST_EiJ_full_sequence_SNPs_introduced_cslong.sam

The error message:-
minimap2: unrecognized option: cs=long

Assertion `op >= 0 && op <= 2' failed when -S tag passed

Hello, I'm getting an error when passing the -S tag. I can't figure out what exactly is going wrong
My command:
minimap2 -x splice -S -k 14 -N 0 -t 24 ce11.fa 170206_celegans_wtadult_rna.111.fa > albacore_rna_MINIMAP2_ce11_genome_all_rna_k14.paf 2> albacore_rna_MINIMAP2_ce11_genome_all_rna_k14{kmer}.log
And the logfile:

[M::mm_idx_gen::5.1030.97] collected minimizers
[M::mm_idx_gen::5.809
1.72] sorted minimizers
[M::main::5.8091.72] loaded/built the index for 7 target sequence(s)
[M::mm_mapopt_update::6.452
1.65] mid_occ = 162; max_occ = 711
[M::mm_idx_stat] kmer size: 14; skip: 5; is_HPC: 0; #seq: 7
[M::mm_idx_stat::6.694*1.63] distinct minimizers: 16602221 (67.22% are singletons); average occurrences: 2.054; average spacing: 2.941
minimap2: format.c:148: write_cs: Assertion `op >= 0 && op <= 2' failed.

Public/semi-public Direct RNA-seq data?

The minimap2 -x splice preset is tuned for PacBio Iso-seq and Oxford Nanopore cDNA sequencing. The sequencing error rate is relatively low. I need to tune the parameters for long error-prone RNA-seq reads. If you have Nanopore Direct RNA-seq data set for human or mouse and can share with me for testing purpose, please let me know. I will really appreciate. A couple of such data sets should be enough.

Provide Error if reference missing

Hi
Just a quick suggestion as I doubted seriously my sanity today generating constantly empty SAM or PAF files.
If reads are provided but no reference (or the 2nd pairs of reads to map onto) provide a warning or error message.
To me it looked as it succeeded

minimap2 -x ava-pb subreads.fasta -t 10 -a > overlaps.sam

[M::mm_idx_gen::133.043*1.66] collected minimizers
[M::mm_idx_gen::142.990*2.23] sorted minimizers
[M::main::142.990*2.23] loaded/built the index for 292447 target sequence(s)
[M::mm_idx_stat] kmer size: 19; skip: 5; is_HPC: 1; #seq: 292447
[M::mm_idx_stat::145.867*2.20] distinct minimizers: 196522811 (45.73% are singletons); average occurrences: 3.898; average spacing: 4.220
[M::main] Version: 2.0rc1-r232

But then the files were obviously empty
It may help people who dont know what to expect ;)
Cheers

samtools not working on results

Hi, i am using minimap2 locally with the following settings:

minimap2 -ax map10k reference.fa nanopore.fa | samtools view -Sbt reference.fa | samtools sort - -o nanopore.minimap2.sort.bam

When i look at the sam file (samtools view -h nanopore.minimap2.sort.bam), i see that there are mapped reads, however, when i do the flagstat from samtools, it shows 0 reads. and when i use the sorted bam (or sam) in pysam, i am unable to fetch any read from the data.
Could this be a header error?

when i use the same data with BWA mem -ont2d, it works as expected.

Please reopen issue thread #31

Hello, could you please reopen the issue thread #31, I have run in to additional trouble, which I explain in comments in that issue thread.

Visualizing MINIMAP2 Output

Dear Users:

I have used MINIMAP2 to generate alignments between two genome assemblies. I have a sorted and indexed BAM file. Is there a way to generate plots to visualize the whole genome alignments generated.

ThankS!

segmentation fault when using an index file

When using an index file as the target input I get segmentation faults caused by main.c attempting to check the kmer and minimizer lengths in the index agree with those set on the command line. The seg fault occurs on the second pass through the loop when the index is not loaded. I've sent a pull request with a simple fix, but it may be worth checking this isn't causing any other issues, particulalry if there are multiple query files.

mm_align1: Assertion `qs1 >= 0 && rs1 >= 0' failed

Hi,

I tried using the latest commit (r525) today and have run into issues. Below is the stderr.

[M::mm_idx_gen::82.657*1.82] collected minimizers
[M::mm_idx_gen::91.379*2.39] sorted minimizers
[WARNING]^[[1;31m For a multi-part index, no @SQ lines will be outputted.^[[0m
[M::main::91.381*2.39] loaded/built the index for 527123 target sequence(s)
[M::mm_mapopt_update::98.263*2.29] mid_occ = 229
[M::mm_idx_stat] kmer size: 19; skip: 5; is_HPC: 1; #seq: 527123
[M::mm_idx_stat::102.392*2.24] distinct minimizers: 273877548 (41.74% are singletons); average occurrences: 3.321; average spacing: 4.398
minimap2: align.c:470: mm_align1: Assertion `qs1 >= 0 && rs1 >= 0' failed.
/var/spool/PBS/mom_priv/jobs/2440842.pbs.SC: line 29: 36242 Aborted                 (core dumped) minimap2 -x ava-pb -t $THREADS -a $FILEDIR/$FILENAME $FILEDIR/$FILENAME > genome_asm2.raw.sam

I don't encounter this issue using the latest release (2.2 r409). I'm running the program on a HPC which uses SUSE as the OS and PBS as job manager. There should be no issues with memory requirements or file space. If you want more information I'd be happy to provide.

Zac.

Mapping cDNA reads

Update: the "cdna" branch has been merged into "master" and subsequently deleted.


Original post:

This is a feature request from twitter users. A very preliminary implementation is available at the cdna branch. Some notes:

  • perform global alignment (not local hits to each exon)
  • introns reported as long deletions (which made possible by the 2-piece gap penalty)
  • allow up to 100kb intron by default (longer threshold reduces performance)
  • deletions longer than 68bp are considered as introns
  • find small exons if nearby exons are long
  • splicing boundaries are not accurate
  • the cDNA mode is slower due to more DP-based alignment

To use it,

git clone https://github.com/lh3/minimap2
git checkout -b cdna remotes/origin/cdna
make
./minimap2 -a -x cdna ref.fa reads.fa
./minimap2 -a -x cdna -k14 -G 50000 ref.fa reads.fa # parameter tuning

cdna is an experimental branch. If it is not totally crap, I will refine it and merge it to the master. Your feedbacks are warmly welcomed and greatly appreciated. I don't have a good sense about what is a good cDNA alignment for now.

Support for RNA

Hi @lh3,

I'd like to add support for direct RNA nanopore reads to minimap2. These reads have Us in their basecalled sequence rather than Ts. I'm happy to provide a PR but wanted to run the proposed solution by you first - is it enough to convert Us to Ts after reading the sequence here or would you prefer I do something else?

Jared

Windows support

Heng,

thank you so much for this enormously fast and good-quality aligner. At Oxford Nanopore Technologies, we are working on adding minimap2 as a shared library. As we are shipping our software for Linux, OS-X and Windows, we need to build on either of these platforms.

Our current approach is to build the standard minimap2 executable and a shared library which we use internally via a small cmake wrapper, see https://github.com/nanoporetech/ont_minimap2.

In order to compile with Windows, we needed to make small adjustments to the code, which might be useful for other users as well. We have made those changes in https://github.com/nanoporetech/minimap2/tree/msvc14, which we would be happy to contribute to your repository, possibly after aligning with your ideas.

Best regards,
Stefan von Deylen

What parameters best resmble blastn

I built a minimap2 index of the NCBI NT database.

I have a series of metagenomic contigs I want to align against said database. Using blast (megablast), around 40% of my sequences have mappings against the nt reference.

Using minimap2 with -x asm10, I only get ~20% contigs with mappings to my reference.

What parameters in minimap2 would best fit this task? I've tried reducing mismatch and gap penalties without success.

chimeric alignment

Great project!

Are there any plans to support chimeric alignment (similar to GMAP) in addition to spliced long-sequence alignment? This would be great, as GMAP (and to some extent STAR) is to my knowledge the only aligner that can do this currently.

minimap2 suitable for Iso-Seq CCS all-vs-all?

Hi,

Thanks for the great tool. Would you recommend using minimap2 for all vs all alignments of CCS reads? In that case, what would be a recommended parameter combination?

If you think minimap2 has potential to do well in this setting, I could test different parameters and give you feedback on this analysis.

Best,
K

Make ks.comment available in mappy.fastx_reader

Enhancement.
Fastest python reader I know.

diff --git a/python/mappy.pyx b/python/mappy.pyx
index 1f17dc0..11e0b11 100644
--- a/python/mappy.pyx
+++ b/python/mappy.pyx
@@ -154,8 +154,9 @@ def fastx_read(fn):
                if ks.qual.l > 0: qual = ks.qual.s if isinstance(ks.qual.s, str) else ks.qual.s.decode()
                else: qual = None
                name = ks.name.s if isinstance(ks.name.s, str) else ks.name.s.decode()
+               comment = ks.comment.s if isinstance(ks.comment.s, str) else ks.comment.s.decode()
                seq = ks.seq.s if isinstance(ks.seq.s, str) else ks.seq.s.decode()
-               yield name, seq, qual
+               yield name, comment, seq, qual
        cmappy.mm_fastx_close(ks)
 
 def verbose(v=None):

ultra long reads mapping to 1 nt

I've mapped some ONT reads using -k15 and discovered that some of the (very long) reads don't align well (see below). minimap2 aligns them to the right place, but with a score of 'inf' and only at one nucleotide (!). Also, the read positions start and stop at 0 (just like when the reads don't map).
Manual inspection (i.e. BLAST) suggests that these reads should map to the chromosomes (albeit at around 82% identity). Other ultra long reads form the same run with similar base calling scores map correctly.
Is this normal/expected behaviour? I tried changing kmer and chaining parameters to no avail.

Sample output:
htsbox samview -p minimap2.srtd.bam | grep inf
9cb8f5d7-ab9c-48ea-88bd-2477e02d7b60 702394 0 0 - chr20 64444167 17799575 17799575 inf 60 mm:125885;ins:0,0;del:0,0;score:718346
eb5d9390-96a6-42ce-ae87-b99e649e6d80 658312 658312 658312 + chr4 190214555 94057298 94057298 inf 60 mm:115078;ins:0,0;del:0,0;score:695835

About strand information in spliced alignment

Hi Heng,

I just used minimap2 (version 2.2) for spliced alignment of PacBio reads to human genome.
I used “-ax splice” option.

For some reason, the minimap2 output alignments do not contain strand information. We expected that spliced reads would contain the XS tag, but my minimap2 output does not contain any XS tag.

I am guessing that “-u b” option may be able to provide the XS tag for the spliced read. But the “-x splice” option already includes the -ub option according to the minimap2 2.2 manual. So maybe I was missing something else here?

I’d greatly appreciate your advice.

Thank you very much!

MD tag

Hi,
are you planing to provide the MD tag at some point? Or should I rely on samtools to reconstruct this information?

Thanks
Fritz

segfault on pacbio data

I'm trying to align pacbio reads to hg38 and I'm experiencing a segfault which appears to happen in almost the same location every time I run it. Out of five runs, all but one occurred during output of the same read: 84377bee_1744_0. Running minimap on this read and the 3 reads before and after it does NOT cause a segfault.

The data I'm running it on was generated with these commands:

samtools view -b ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/NA12878_PacBio_MtSinai/sorted_final_merged.bam chr22 >NA12878.hg19.pacbio.chr22.bam
samtools fasta -0 NA12878.pacbio.chr22.fa NA12878.hg19.pacbio.chr22.bam

Then I run minimap2 with this command:

minimap2 -ax map-pb hg38.chr22.fa NA12878.pacbio.chr22.fa >NA12878.hg38.pacbio.chr22.sam

I've made the input data (reference and fasta) available here (I believe that it is publicly available):
s3://minimap2-data/NA12878.pacbio.chr22.fa
s3://minimap2-data/hg38.chr22.fa

I observe this segfault on many of the chromosomes in this MtSinai PacBio dataset, and on the code in master, the r2.3 release, and r2.0 release. For this particular input, the map-ont flag does not segfault, but on some of the other "bad" chromosomes, the nanopore flag segfaults also. I'm running this on a "G4" Azure instance.

Inconsistent sequence and quality string for unaligned reads

In some sam outputs, unaligned read sequence and quality doesn't match in the output sam file like this example:

f3920def-68ae-4956-8900-6398064c4f93_Basecall_Alignment_template	4	*	0	0	*	*	0	0	CTTTTATCTGCCACCTCTTCCACCTCACC	#$$$%&'%%#&''&%$#$$$''&&$&'&&2

The sequence in this case is 29 bases so I think the quality is being output incorrectly (it has the extra 2 on the end that isn't in the input). Here is the corresponding read fastq:

@f3920def-68ae-4956-8900-6398064c4f93_Basecall_Alignment_template llssbzms2p35x_20161128_FNFAB45271_MN17073_mux_scan_Hu_Bir_R94_1Dlig_fc5_13909_ch310_read17_strand
CTTTTATCTGCCACCTCTTCCACCTCACC
+f3920def-68ae-4956-8900-6398064c4f93_Basecall_Alignment_template llssbzms2p35x_20161128_FNFAB45271_MN17073_mux_scan_Hu_Bir_R94_1Dlig_fc5_13909_ch310_read17_strand
#$$$%&'%%#&''&%$#$$$''&&$&'&&

running minimap2 as:

minimap2 -ax map10k -t 16 GRCh38_full_analysis_set_plus_decoy_hla.mmi rel3-nanopore-wgs-152889212-FAB45271.fastq.gz

Mapping just this read works so there might be some buffer overflow or other issue on the full set. This read set is available from the NA12878 consortium: http://s3.amazonaws.com/nanopore-human-wgs/rel3-nanopore-wgs-152889212-FAB45271.fastq.gz mapped to the 1000 genome reference: ftp://ftp.ncbi.nih.gov/1000genomes/ftp/technical/reference/GRCh38_reference_genome/. So far I've only seen this on unmapped reads.

Please make a release (+ ideas to make this easier)

Heng,

I know this annoying, especially in rapid development, but it would help all the packaging teams (Homebrew, Conda, Galaxy) if you could make releases (even labelled as pre-release) regularly, or even just run git tag r189 when you update to your r189 version.

@sjackman or others might have smarter ways to achieve this via the command line in a way that isn't too onerous.

Torsten

"too many CIGAR operations" when converting to sorted BAM - workaround?

Hello,

I was running a test of minimap2 on the ONT NA12878 data and ran into an issue piping the output into "samtools sort -o output.bam":

[E::sam_parse1] too many CIGAR operations

I'm fairly certain it's related to this issue in samtools, so I don't think there is a fix for it at this time. My question is whether you know of a work-around other than using the uncompressed SAM format instead of BAM?

Thanks!

p.s. love how fast minimap2 runs compared to some of the other tools I'm testing!

incremental self-alignment

Is it possible somehow to use minimap2 to "add" a set of reads to an existing self-alignment?

like:

minimap2 file1.fq file1.fq > overlap
minimap2 overlap file2.fq > overlap

Thank you!

Understanding the Minimap2 algorithm

Hello Dr. Li,

I am trying to understand the algorithm of Minimap2. Here, so far I understand follows a typical seed-chain-align procedure. However, I did not understand for seeding how the k-mers are selected and how long are they?
Could you help me understand it?

Thank you very much in advance.

Unalbe to do alignment on Ubuntu

There are three servers in our LAB
(1)Ubuntu 5.4.0-6ubuntu1~16.04.4 (2)Ubuntu 4.8.2-19ubuntu1 (3)Red Hat 4.8.5-11
minimap works sucessfully on the 3rd ,but not on the others.
I'm not sure whether it's due to different OS-version or something else.
Below is the content in core dump


(1)
...
[New LWP 25220]
[New LWP 25224]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `./minimap2/minimap2 -a -x map-ont /big7_disk/kuanwei105/homo-sapiens/ref.mmi /b'.
Program terminated with signal SIGILL, Illegal instruction.
#0 _mm_set_epi8 (__q00=, __q01=, __q02=, __q03=,
__q04=, __q05=, __q06=, __q07=, __q08=,
__q09=, __q10=, __q11=, __q12=, __q13=,
__q14=, __q15=) at /usr/lib/gcc/x86_64-linux-gnu/6/include/emmintrin.h:616
616 return extension (__m128i)(__v16qi){
[Current thread is 1 (Thread 0x7ff56e7fc700 (LWP 25223))]


(2)
...
[New LWP 27644]
[New LWP 27653]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `./minimap2/minimap2 -a -x map-ont /big7_disk/kuanwei105/homo-sapiens/ref.mmi /b'.
Program terminated with signal SIGILL, Illegal instruction.
#0 0x000000000041267e in _mm_set_epi8 (__q00=, __q01=, __q02=, __q03=,
__q04=, __q05=, __q06=, __q07=, __q08=,
__q09=, __q10=, __q11=, __q12=, __q13=,
__q14=, __q15=) at /usr/lib/gcc/x86_64-linux-gnu/5/include/emmintrin.h:616
616 return extension (__m128i)(__v16qi){

SAM header interspersed through file (not a duplicate of #15)

Not a duplicate of issue #15

Similar problem, but different cause (and far more duplicated headers)

When passing multiple FASTQ files on the command line, a new SAM header is inserted in the output stream once (AFAICT) for each input FASTA/FASTQ.

Version: 2.2-r409

Weird self alignment

Hi,

I see the following somewhat weird data (SAM format) coming out of minimap2:

L0/46/0_12879	256	L0/46/0_12879	1	0	8963M116I28M116D203M110I9M1I15M1D22M110D3412M	*	0	0	*	*	tp:A:S	cm:i:15	s1:i:126	NM:i:454	ms:i:24372	AS:i:24744	nn:i:0
L0/46/0_12879	256	L0/46/0_12879	1	0	8963M116D28M116I203M110D9M1D15M1I22M110I3412M	*	0	0	*	*	tp:A:S	cm:i:15	s1:i:127	NM:i:454	ms:i:24372	AS:i:24744	nn:i:0

This is weird for two reasons:

  1. This describes an alignments of a read against itself running from end to end, but clearly not the optimal alignment between the two regions specified.
  2. There are two versions.

minimap2 was run with as

minimap2/minimap2 -ax ava-pb dup.fasta dup.fasta

where dup.fasta contains a single read (synthetic E. coli, though I think this should not matter).

The single read in dup.fasta is

>L0/46/0_12879
AGTCCTGCGGAAAGCGCCAGGGCGAGGATTGCTGTGGCAGGTTTACGTAATTGCATATCCAACTCCTTTATCTCTCTGCG
TTAAGAACGCACTGGAATACCCGTTGTGAGTGTTTTGTGTTGTTACGTCTGCAACTTTATTGTGCAGTGTGTGCCTGTTA
GGGAAGGTGCGAATAAGCTGGGGAAATTCTTCTCGGCTGACTCAGTCATTTCATTTCTTCATGTTTGAGCGATTTTTTCT
CCCGTAAATGCCTTGAATCAGCCTATTTAGACCGTTTCTTCGCCATTTAAGGCGTTATCCCCAGTTTTTAGTGAGATCTC
TCCCACTGACGTATCATTTGGTCCGCCCGAAGACAGGTTGGGCCAGCGTGAATAACATCGCCAGTTGGTTATCGTTTTTC
AGCAACCCCTTCGGTATCTGGCTTTCACGTAAGCCGAACTGTCGCTTGATGATGCGAAATGGGTGCTCCACCCCTGGCCC
GGATGCTGGGCTTTCATGTATTCGATGTTGATGGCCGTTTTGTTCTTGCGTGGAATGCTGTTTCAAGGTACTACCTTGCC
GGGGCCGCTCGGCGATCAGCCAGTCCAATATCCACCTCGGCCAGCTCCTCGCGCTGTGGCGCCCCTTGGTAGCCGGCATC
GGCTTAGACAAATTGCTCCTCTCCATGCAGCCAGATTACCCAGCTGATTGAAGGTCATGCTCGTTGGCCGCGAGTGGTGA
CCAGGCTGTGGGTCAGGCCACTCTTGGCATCGACACCAATGTGGGCCTTCATGCCAAAGTGCCACTGATTGCCTTTCTTG
GTCTGATGCATCTCCGGACTCGCGTTGCTGCTCTTTGTTCTTGGTCGAGCTGGGTGCCTCAATGATGGTGGCATCGACCA
AGGTGCCTTGAGTCATCATGACGCCTGCCTTCGGACCAGCCAGTCGATTGATGGTCTTGAACAATTGGCGGGCCAGTTGG
ATGCTGCTCCAGCAGGTGGCGGAAATTCATGATGGTGGTGCGGTCCGGCAAGGCGCTATCCAGGGATAACCGGGCAAACA
GACGCATGGAGGCGATTTCGTACAGAGCATCTTCCATCGCGCCATCGCTCAGGTTGTATCCAATGCTGCATGCAGTGAAT
GCGTAGCATGGTTTCCAGCGGAAAAGGTCGCCGGTCACATTACCAGCCTTGGGGTAAAACGGCTCGCTGACTTCCACCAT
GTGTTTTGCCATGGCAGAATCTGCTCCATGCGGGACAAGAAAATCTCTTTTCTGGTCTGAACGGCGCTTACTGCTGAATT
CACTGTCGGCGAAGGTAAGTTGATGACTCATGATGAACCCTGTTACTATGGCTCCAGATGACAAACATGATCTCATATCA
GGGACTTGTTCGCACCTTCCTTAGGTAACATTTAGTTTGGCTAAATGTAAAGATATTGCTGTTTTATTGTTTGTTTTTGC
GAGATGCGCCGCACCATTCCGAAGCAAAATTCTTAAAATGCACTCTTTTAGTGCTACCGCTGGATTACTGTGGTGCAACT
AGGTTGTACTGATGCTGTTTCAGGGTTGCCTTGTATAACAAAGCAATAGATCGTGCCAAAGTTGGATAGGAAATATGTTA
TCCGGATAATGCACTGATGCCGCATCCGGTGAGCGTGGCCGAAATATGGGATGTATTCCGGCACGATAAGAAGGGATTAT
TTACGTCGCTGACGGCAGACTCATCAACACAGCAGCAAAACCAAAACAATGCCGTCAGCACCCACAGTCGGACCAGTTGC
CGAGTACGTGCGTGATGGTGTGAGTTACCGGTGGTCGGCGTACGTTAGTGGTTAACACCTCGCGGGTGAACTGCGGGATC
ATCGCCTGAATTTCTCACCCTGCGGGCCAATCACCGCCGTAATGCCGTTGTTGGTGCTGCGCAACAGTGGGCAGCGCCAG
CTCCAGCGCACGCATTCGCGCCATCTGGAAGTGTTGCCATGGACCAATAGGTTTACCAAACCACGCATCCGTTGGAGATA
GTCAGCAGATAGTCGGTATCCGGGCGGAAGTTATCGCGCACTTGCTCGCCGAGAATGATCTCGTAGCAAATAGCCGCAGT
AAGCTCAATACCATTTGCCGACAGCGGCGGCTGGATATATGGCCCACGGCTGAACGACGACATCGGCAGATCAAAGAACG
GATGCTAACGGACGCAGAATCGACTCAGCGGGACAAACTCGCCAAACGGCACCAGATGGTTTTTGTTATAGCGATCGGCT
GAGTTCGTAGCTGTACGGGCGCACCTTTACCCAGCGTGATGATGGTCGTTGTAGGTATCGTAGCGGTTCTGCTTATTGAT
GACGCGCCTCGACAATCCCGGTTACCAGCGAGCTACCTTTATCACGCAACTCACCGTCCAGTTGCTTTGAGGAACGGTTG
CTGGTTAATTTCCAGATGCGGTTGATCGCCGACTCCGGCCAGATAATCAACGATGATTTGGCTCCATCAGCGGTGCCGTT
GACGTTGTAGTAAATCTTCAGCGTATTAAGAAGCTGGCCTTCGTCCCATTTCAGCGATTGCGGAATATCGCCCTGAACCA
TCGAAACCTGAATGGTTTTCTCCGGTTGTGGGGTAAACCCACTGGATGAACGTCAGACGGGAAGGGAGACGGCAAACATG
CACGACGGCCACCACCAGCTGGACGCCAGTTGCGTTTGACCAACGCCAGTGCCAGCAGGCCACTAACCATCATCAGCAGG
AAGTTAATGGCTTCCACGCCCATTATCGGTGCCAGCCCTTTTAACGGGACCATCAATCTGGCTATAGTCCGAACTTGTAA
CCACGGGTGAAGCCGGTCAAGTACCCAACCGCGCACGAAACTCGGTCACTTGCCAGAGGGCAGGGGGCGGCAATCGCTAC
GCGCAGCCAGGTGGTTTTCGGCCACAGACGCGACAGCACGCCAGCAAACAGTCCGGTATACAGCGACAAATACGCCGCCA
GCTGCACCACCAGGAAGATGTTAACCGGGCCAGACATTCCGCCAAAGGTCGCGATGCTGACATAGACCCAGTTAATACCG
CTGCCAAAGAGGCCAAATCCCCAGCAAAAGGCCAATAGCGGCAGACTGGAGTGGACGGCGGTTAAAGGTCAACGCCTGGC
AAGCCCCATCAGCGAAATAATCCGCCGCAGGCCAGACGTTCGTAAGGAGAGAAGGCCCATGCGCTTCCGCAGGCACCGAA
TAATAACGCCAGCAGCAGGCGAATGCGCTGGCGTTTCAATTACATGAGGCAAAAGCCATGTAAGTATATCTATCCAGTTT
CGGTTTATTCATCCAGCTTCGGCTGGGGTGAGTATCCGGGATTTTGACATGAACCTGAATAATACGCCGACTGTCGGCCG
TCGCCACTTTGAACTGGATAACCGTCGATGTCGATAGTTTCGCCACGCGCCGGAAAGATCCCAAATGCCTGCATCACCAG
ACCACCGATAGTCGTCGACTTCTTCATCGCTAAAGTGGGTGCCGAACGCTTCGTTGAAGTCTTCAATGGAAGCCCAGTGC
GCGTACGGTCCAGGTATGACGACTCAGCTGACGGAAGTCGATATCATCTTCTTCGTCATACTCGTCTTCTAATACTCACG
CAACAATCAGTTCCAGGATGTCTTCAATGGTCACCAGACCGGAAACCCCATCGAATTCGTCAATAACGTATCGCCATGTG
GTAACAGCTGAGAGCGAAACTCTTTCAGCATCCGGCTACGCGCTTACTTTCAGGAAGCGACAACCGCCTGACGTAACACT
TTGTCCATGCTGAAGGCTTCAGCATCGCTGCGCATAAACGGCAGCAAGTTCGTTTCGCCATCAGAATCCCTATCAATGTG
ATCTTTGTCTTCGCTAATCATCCGGGAAGACGTAGAGTGGGCGGACTCGTATAGATGACATCAAGACATTCGGTCCACGC
GTCTGGTTGCGTTTCAGGGTAATCATGCCTGGGAGCGGGGGATCATGATGTCGCGAGACGCGTTGGTCTGCGATGTCCAT
CACCCCTCTCGAGCATATCGCGCGTATCTTCGTCGATATAGGTCGTTCTGCCCGGAATCACGGATCAGCGCCAGCGTTCG
TCACGGTTTTTCGGTTCCACCGTTGGATAAAGTTGGCTGCAGTAACAGGGAGAAAAATCCCCTTTCTTGTTGCTTATCGT
GTCACTACTGTGTGAATTGTCGTCGCTCATGGCGTGTATGGGTTCTCATGTTAGTTAATCAAAACGCCGTCGTTAATCAC
CAACGGCGGGGACGTCTGCCAGTCAAATGCCTGGCAATGTATTCTTTCTCGGCAATGTACGGATCCTCATAGCCCAGAGC
AAGCATAATCTCTGTTTCGAGGGCTTCCATTTCTGTCTGCTTCGTCATCTTCGATGTGATCGCTAACCTAACAAATGCAG
ACTGCCGTGCACCACCATATGCGCCCAGTCGCGCCTCCAGTGGTTTGCCTTGCGTCCTGAGCTTCCTTCTCAACCACTGT
ACGGCAGATAACCAGATCGCCCAGTAGCGACATATTCCAGTGCCAGGCGGCACTTCAAACGGGAAGGAGAGCACGTTGGT
CGGCTTATCCTTACCGCGATAGGTCAGATTCAGACTGTGGCTTTCGGCGGTATCGACCACGCTGAATCGTCACTTCCGAT
TCTCCTGAAACTGCGGGATCACCGCATTCAGCCATGTCTGAAACTGGCTCTCTTCGCGGTAACCCGGAATTATCTTCACA
TGCCAGTGCTAAATCGAGGATCACCTGACTCATTTTTGTTCCTCTGTTCTTCGCGCTTGCTTCTGCTGCCAGCGCCGCTT
TTCGTTTTTGTCTCGGCTTCTTCCCATGGCTTCATAGGCGTTAACGATACTGCGCCACCACAGGGTGACGAACCACGTCT
TCGCTGTGGAAGAAGTTAAAGACTGATCTCTTCGAACATCGGCCAGCACTTCGATGGCGTGACGTAAGCCTGATTTAGTA
TTACGCGGCCAGGTCCGATCTGTGTGACGTCGCCGGTGGATAACCGCTTTTGAGTTAAAACCGATACGGGTCAGGAACAT
CTTCATCTGTTCGATGGTGGTGTTGCTGGCTCTCATCGAGAATGATTAAACGCGTCGTTCAGCGTACGACCACGCATATA
GGCCATGCGGTGCGACTTCAACTAACGTTGCCGCTCAATCAGTTTCTCGACTTTCTCAAAGCCCAGCATTTCAAACAGCG
CGTCGTACAGCGGGCGCAGATACGGGTCTACTTTCTGGCTTAAATCGCCAGGCGAGGAAGCCCAGTTTTCACCGGCTTCT
ACTGCCGGACGAGTCAGCAGAATACGGCGAATTTCCTCGACGCTCCAGGGCATCAACTGCCGCAGCCACTGCCAGGTAGG
TTTTACCCGTACCCGCCGGGCCAACGCCGAAGGTAATGTCATGGGTCGAGAATATTGGCGATGTACTGCGCCTAGGTTTG
CGTGCGCGGCTTAAAATTACGCCGCGTTTGGTTTTGATATTGACCGCTTTGCCGTACTCCGGCACGCTCTCCGCGCTGTC
TGCTCCAGGACCACGCGCTTCTGTAATTCGACAACGGTAGGCAATCTCGTTCCGGTTCGATATCCTGAATCTGACCCGCG
CATCGGGGTCAGTGATCGACATACAGGCTACGCCAGAATGTCTGACCGCAGCGGGACGCAAATCGGACGGCCGTGTCAGT
TTAAAGTGGTTATCGCGGCGATTGATCTCGATGCCGAGACGGCGTTCGAGCTGCTTGATGTTGTCATCAAACGGGCCGCA
CAGGCTCAACAGACGCGCATTGTCTGCTGGCTCCAGGGTTGATTTCGCGAGTGTCTATGTTCAAACCGTCCTCTTATCTG
TATGCCGCCGGAAGCTGAACATTCACCGGCCTATAAGGAAATTATTCACGCCACAGGAAAAAGGCGCAAGCGATTGCAAT
ATAAGATGGGGATAAAGAGAGAAAAAACAAGGCCCGACCGGAACGGCAGGCCTGAGAATTACGGCTGATAATAACCCACG
CCAAAGGTCGTTTTCTTTGACGGGTACGGGCAATCACTGATTCCCGGTGTTTCTGCCACGCGCCAGACCCATTTCATCTT
CAGTACGCACCACTTTACCGCGCAGAGAGTTCGGGTAGACGTCGGTAATTTCTACATCGACGAATTTACCGATCATATCC
GGCGTGCCTTCGAAGTTGACCACGCGGTTATTTTCCGTACGCCCGGAAAGCTCGCATGATGCTCTTCACGCGATGTACCT
TCCTACCAGATATACGCGGGTGGTGCCGAGCATCCGGCGGCTCCACGCCATCGCTTGCTGATTAATTGCGCTCTTGCATG
AATATACAGACGCTGCTTCTTCTCTTCTTCCGGAACATCATCAACCATATCGGCGGCCTGGATGTACCCGGACGTGCAGA
TAAGATAAAGTGTAGCTCATGTCGAAATTGACGTCGGCAATCAGCTTCATCGTTTTTCTCGAAGTCTTCGGTGGTTTCGC
CATGGGAAGCCAACGTATGAAATCAGAACTGATCTGAATATCTGGACGCGCCGCACGCAGTTTACGGATGATCGCTTTGT
ACTCCAGCGCCGTAATGGGTACGGCCCATCAGGTTCAGAATGCAGATCGGAACCGCTCTGTACCGGCAGATGCAGGAAGC
TCACCAGCTCCGGCGTGTCGCGATACACTTCGATGATATCGCTCGGTGAATTCGATACGGATGGCTCGGTGGTAAAGCGA
ATACGATCGATCCCGTCGATCGCAGCAACCAGACGCAGCAGATCGGCAAACGATCCGGTGGTGCCGGTCGTAGTTTTCAC
CACGCCAGGCGTTCACGTTCTGACTCGTAGCAGGTTGACTTCACGCACGCCCTGAGCCGCAAGCTGGTGCATATCTCAAA
CAGAATATCGTCGGACGGACGGCTGACCTCTTCACCACGGGTGTGAAGGCACCACGCAGTAGGTGCAATATTTATTGCAG
CCTTCCATGATGGAGACAAACGCGGTCGGCCCTTCGGCGCGCGGTTCCGGTAGGACGGTCAAACTATCTCGATTTCCGGG
AAGCTGATATCTACAACCGGGCTGCGGTCGCCACGCACGGAGTTGATCATCTGCCGGCAGACGGTGCAGCGTGTTGCGGC
CCAAAAATAATATCGACATCAGTGGGCGCGCTGGCGAATGTGCTCGCCTTCTTGCGATGCCACGCAGCCACCGACGCCGA
TAATCAGGTCTGGATTCTTCTCTTTTAACAGTTTCCAGCGACCTCAACTGATGGAAGACTTTTTCCTGAGCCTTCTCGCG
GGTTGAGCAGGTGATTCAGCAGCAGCACATCCGCTTTCTGTCCGCCTACGTCGGTCAGTTGATAGCCGTGGGTGGCATCC
ACAGATCGGCCATCTTCGAATGAATCGTACTCGTTCATCTGACAGCCCCAGGTTTTAATATGGAGTTTTTTGGGTCATCG
ACTTGCTCTTGCGAAATAGTAGCCAGGAATGCAGGGCGTCATAGTGTAATGCTTTGCTGACCGTTGTGACCAGTATGAGC
GTTATCAGCCCTTAGGGGTAAAAATCCTGTAAACTTAAAGCAGTATTGCTAACAGGATGATTGACCATGACAAATCAACC
AACGGAAATTGCCATTGTCGGCGGAGGAATGGTCGGCGGCGCACTGGCGCTGGGGCTGGCACAGCACGGATTTGCGGTAA
CGGTGATCGGAGCACGCAGAACCAGCGCCGTTTGTCGCTGATAGCCAACGGACGTCGGATCTCGGCGATCAGCGCGGCTT
CGGTATACATTGCTTAAAGGGTTAGGGTCTGGGATGCAGTACAGGCTATGCGTTGCCATCCTTACCGCAGACTGGAAACG
TGGGGAGTGGGAAACGGCGCATGTGGTGTTTGACGCCGCTTGAACTTAAGCTACCGCTGCTTGGCTATATGGTGGAAAAC
ACTGTCCTGCAACAGGCGTTGTGGCAGGCGCTGGAAGCCGCATCCGAAAGTAACGTTATCGTCGTGCCAGGCTCGCTGAT
TGCGCTGCATCGCCATGATGATCTTCAGGAGCTGGAGACTGAAAGGCGGGAAGTGATTTCGCGCGAAGCTGGTGATTGGT
ACCGACGGCGCAAATTCGCAGGTGCGGCAGATGGCGGGAATTGGCGTTCATGACATGGCAGTATGGCGCAGTCGTGCATG
GTTGATTAGCGTCCAGTGCGAGAACGATCCCGGCGACAGCACCTGGCAGCAATTTACTCCGGACGGGACCGCGTGCGTTT
CTGCCGTTGTTTGATAACTGGCATACGCTTAGGGATTGGGTATGTGACTCTGCCCGGCGTCGTGTATGTCGCCAGTTGCA
GAATATGAGTATGCGCACAGCTCCAGAGCGGAAATCGCGAAGCATTTCCCGTCGCGTTCTGGGTTACGTTACACCGCTTG
TCCGCTGGTGCGTTTCCGCTGACGCGTCGCCATGCGTAGCAGTACAGTGCAGCAGGGCTTGCGCTGGTGGGCGATGCCGC
GCATACCATCCATCCGCTGGCGGGGCAGGGAGTGAATCTTGGTTATTCGTGATGTCGATGCCCTGACTTGATGTTCTGGT
CAACGCCCGCAGCTACGGCGAAGCGTGGGCCAGTTATCACTGTCCTGCAAGCGGTACCAGATGCGGCGCATGGCGGATAA
CTTCATTATGCAAAGCGGTATGGATCTGTTTATGCACGGATTCAGCAATAATCTGCCACCACTGCGTTTTATGCGTAATC
TACGGGTTAATGGCGGCGGAGCGTGCTGGCGTGTTGAAACGTCAGGCGCTGAAATATGCGTTAAGGGTTGTAGCCTTACA
ACATTGCCGGGATGACGTGCCTAACCGTAGGTCGGATAAGACGCGGCAGCGTCGCATCCGACATTGAAGGATAAGACGTG
TCAACGATCGCATTCGACATTGAATGAACGCAGAAAAGCAAAAAGGCTCGCCAGAAGCGAGCTTTTTTAATGTGGCTGGG
GTACGAGGATTCGAACCTCGGAATGCCGGAATCAGAATCCCGTGCCTTACCGCTTGGCGATACCCCAACTGGGTGCACTT
AACTAAGGTAAGCGTCTTGACATAAATTGGCTGGGGTAGCGAGGATTCGAACCTCGGAATGCCGGAATCAGAATCCGGTG
CCTTACACGCTTGGCGATACCCCAACAAATTGGTTTTGAATTTGCCGAACATATTCGATACATTCAGAATTTGGTGGCTA
CGACGGGATTCGAACCTGTGACCCCATCATTATGAGTGATGTGCATCTAACCAACTGACGCTATCGTAGCCAGATTGTTT
CTTCGATGGCTGGGGTACCTGGATTCGAACCAGGGAATGCCGGTATCAAAAACCGGGTGCCTTACCGCTTGGCGATACCC
CAATAACCGGGTCGGTGAACCGCTTACTCGAAGAAGATGGCTGGGGTACCTGGATTCGAACAGGGAATGCCCGGTATCAA
AAACCGGTGCCTTACCGCTTGGCGATACCCCATCCGGTACAACGCTTTCGTGGTGAATGGTGCGGAGAGGCGAGACTTGG
AACTCGCACACCTTGCGGGCGCCAGAACCTAAATCTGGTGCGTCTACTCAATTTCGCCACTCCCGCAAAAAAAAGATGTG
TGGCTACGACGGGATTCGAACTGTGCACGCCCACCATTATGAGTGATGTGCTCTAACCAACTGAGCTACGTAGCCATCTT
TTTTTTCGCGATACCTTATCGGCGTTGCGGGGGCGCGATTATGCGTCGTAGAGCCTTAGCAGTCGTCAACCGTCTTTTTC
AAGGAAAATTGCTCGAAAGTGACTGTTTGGTTAGGTTGGAACAGCGTGGCGCTATATTCGTCAATTATTGTTTACTTTGT
GTTTGTTTCCAACCCTACAGCCCATTCTTTTGTCATACAGGATGAAATTCGGAATTTAACAATAGTGGTGGTGAAATTAA
TCTATGAAATACTGGCCTACAGTGGATGAGTTGTCAAACAGTGATGTGGCAAACCCGGAACATTTCCTTACTGCATATCC
AGAATCAACAAGCTACCTCAATAACTGTAAACAGCCCCGGATTTCACCGGGGCTGTTTCGCATTTCTTACTTATACGCCG
ACTGAGTGAACCACCAACCGCGCGACCAGACGGATCGTCCATTTTCTTGAACGCTTTCATCCCATTCGACTCGCTTTAGC
GGTAAGAACAAGCGACGGAAGCGGACGCCCGGCACGCACTCAGCGGCGCTCGGAAGCGGGAATAGTCTTCAAAGATCTCC
CGATACAAGTACGCTTCTTTAGAGGTTGGCGGTGTTGTACGGGAAAGCGGAAGCGCGGCAGTTTCGCAGTTGCTGATCAG
AAACCTTGCTGCGCAGCCACTTCTTTCAGGGTGTCGATCCATACTGTAACCAGACGCCATCGGGAGAACTGCTCTTTCTG
CCGCCAGGCCAGCTTGCAGGCAGATACGCTTCAAAACATTCACGCAGGATGTGTTTTTCCATTTTGCCGTTACCGCACAT
TTTATCCTGTGGGTTAATACGCATCAGCCACATCAAGGAATTTGTTTGTCGAGGAACGGAACGCGTGCTTCCACGCACCG
CAGGCTGACTCGCTTTGTTGGCACGCGCGCAGTCATACATATGCAAGGGCCAGCAGTTTACGCACCGTCCTCCTCATGCA
GTTCTTTGGCATTCGGGGCTTTGTGGAAGTAAAGATAACCGCCGAACACTTTCATCAGCAACCTTCACCGGACAGCACCA
TTTTAATGCCCATCGCCTTTGATCTTACGCGACATTAAATACATCGGTGTTGAAGCGCGAATAGTGGTCACATCATAAGG
TTTCGATAGTGGTAAATCACGTACGCGGATGGCATCCAGACCTTCCTGTACAGTGAAGTGAATTTCGTGATGCACCGTGC
CCAGATGGTTTGCCACTTCCTGGGCTGCTTTCAGATCCGGTGAACCCGGCAGACCTACCAGCAAAGGTAGTGTAACTGCC
GGCCACCAGGGCTTCAGAGCGTTCCTGATCTTGCCACGCGACGGGCGCGTATTTCTTGGTGATAGCGGAAATAATTGAGG
AATCCAGACCACCAGAAAGCAGCACACGCGTAAGGCACACAGACATCAGATGGCCTTTTAACTGAATCTTCCAGTGCCGA
AGCACTCGTTTTTGTCAGGTCACGTTATCTTTCACCGCTATCGTAGTCGAACCAAGTCGGCGATGATAGTAAGAACGGAT
TTCGCCGTCCTGCGCTCCACAAATAGCTCCCCGCCGGGAACTCTTTAATCGTGCGGCAAACTGGCACCAGCGCTTTCATT
TCTGAGGCCACATACAGCTGACCGTGTTCGTCATACCCCATACACAGTGGGATGATCCCCAGATGCGTCGCGACCAATCA
GGTAGGCATCTTTTTCGCTCGTCGTACAGTGCAAAGGCAAACATGCCCTGCAAGTCGTCGAGAAATTCCGGCCCTTTCTT
CCTGATACAGCGCGAGGATCACTTCACAGTCAGACCCGGTCTGGAACTGGTAACGAATCGCCATATTCGGCGCGCGAATG
CCTGGTGGTTGTAGATTTCACCGTTTACCTGCCAGTACGTGGGTTTTTTGTTAGGTTGTATGAGAGGTTGCGCCCCCGCG
TTAACGTCAACAATTGACAACCGTTCGTGGGCGAGAATGGCGTTATCGCTGGCATAAATACCGTGACCAGTCCGGGCCAC
GATGACGCATGCAGGCGTGACAGCTCGAGGGGCTTTCTTACGCAGCTAAGTGCGTCTGTTTTGATATCAGAATAGCGCCA
AAAATTGAACACATAACCTTCTCCGTTAACCTGGTATTTGTTGCTTGTTGTGTTTGCTTGTTTAAAAAAATGCCGCAAAG
CAGCACTGTGCGCAGTCCGATTTGGATGGGTGAAAAAATAAAGAAAAAGTAATTGGATAGACTCTTGTGGATTTGGTGCA
TAAAAAGGTCTGGTGTGAGGATATATTTATTGATTGAATCGATAATTTTTAGCGGGTTTTATTGAATGTTATATTTTACT
TGGGGGCCAAATTTGCTGACAAAGTGCGAGTTTGTTCATGCCGGAATGCGGCGTGAACGCCTTATCCGGCCACAAAAGGC
ATGAAAATTCAATATATTAGCAGGAGCTGCGTAGGCCGTGATAAGCGAGCGCCATCAGGCAGTTTGGCGTTTAGTCATCA
GAGCCAACCACGTCCGCAGACGTGGTTGCTATTCGAAACGTCGATTTCAGCGACTGACCGGGTAAATCCAGCTGGGGCGA
AAAGGCATACCTGTCGATATCGTCGAGCGACGAAACACCAGAATGCACCAGAATCGTCTCCAGACCTGCCTGGAAGCCGG
CCAGAATACGGTACGCAGGTTATCGCCGACAATCACCGTTTCTATCCGAATGCGCCTGCATTATGGTTTAATGCTGCGCG
GATGATCCACGGGCTGGGCTTACAAACATAAGAACGGTTCTGCGCCCGGAAGATTTTCTCAATCCCTGCACAACAACGCG
CGCACAAGCGGGATAAAAACCGCGCGCCGTGGGTGATCCGGATTGGTGGCGATAAAACGTGCACCGTTAGCGACGAAATA
GGCTGCTTTATGCAATCATGTCCCAGTTGTAGGAACCGCGTTTCGCCCAACAATCACGAAAATCAAGGGTTCACATCGGT
AATAGTGAAACCGAGCTTTGTACAGTTCATGAATCAGTGCGCCTTCGCCCACCACATACGCTTTTCTTGCCTTCCTGGCG
ACTGGAGGAATCGTGCAGTCGCCATCGGCAGAGGTATAAAACACGCTGTGCAGGTACATCGACACCTGCGGTGGCAAAGC
GGTTCGCCACGATCTTGCCCAGTCTGCGAAGGATAGTTGGTCTAGCGAACAGCAGCGGCAGGCCTTTATCCATAATCCC

Best,
German

Adding Read Group Information

Maybe I missed it in the manual, is there an argument for adding a Read Group tag to reads during alignment? This could be useful because some phasing software that leverages long reads require a read group tag.

I realize samtools has this capability to add in read group information to a sam/bam file, but it's one step less to include it in the aligning step.

bwa has this option -R STR read group header line such as '@RG\tID:foo\tSM:bar' [null], which makes adding read group information easy.

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.