marbl / mummer3 Goto Github PK

MUMmer3

License: Other

Makefile 1.16% Perl 16.79% Shell 0.32% Awk 0.08% C 26.42% C++ 54.36% Objective-C 0.89%

mummer3's Introduction

-=- MUMmer3.x README -=-

** NOTE **
A comprehensive HTML user manual is available in the docs/web/manual
subdirectory or at http://mummer.sourceforge.net/manual

MUMmer is now an open source package!  Please contact us if you would like
to contribute to the MUMmer project.  For more information or the latest
release please visit the MUMmer homepage at http://mummer.sourceforge.net

Please refer to the INSTALL file for installation instructions.  This file
contains brief descriptions of all executables in the base directory and
general information about the MUMmer package.



-- DESCRIPTION --
   MUMmer is a system for rapidly aligning entire genomes.  The current
version (release 3.0) can find all 20 base pair maximal exact matches between
two bacterial genomes of ~5 million base pairs each in 20 seconds, using 90 MB
of memory, on a typical 1.8 GHz Linux desktop computer.  MUMmer can also align
incomplete genomes; it handles the 100s or 1000s of contigs from a shotgun
sequencing project with ease, and will align them to another set of contigs or
a genome, using the nucmer utility included with the system.  The promer
utility takes this a step further by generating alignments based upon the
six-frame translations of both input sequences.  promer permits the alignment
of genomes for which the proteins are similar but the DNA sequence is too
divergent to detect similarity.  See the nucmer and promer readme files in the
"docs/" subdirectory for more details.  MUMmer is open source, so all we ask
is that you cite our most recent paper in any publications that use this
system:

        (Version 3.0 described)
  Versatile and open software for comparing large genomes.
  S. Kurtz, A. Phillippy, A.L. Delcher,
  M. Smoot, M. Shumway, C. Antonescu, and S.L. Salzberg.
  Genome Biology (2004), 5:R12.

        (Version 2.1 described)
  Fast algorithms for large-scale genome alignment and comparison.
  A.L. Delcher. A. Phillippy, J. Carlton, and S.L. Salzberg.
  Nucleic Acids Research 30:11 (2002), 2478-2483.

        (Version 1.0 described)
  Alignment of Whole Genomes.
  A.L. Delcher, S. Kasif,
  R.D. Fleischmann, J. Peterson, O. White, and S.L. Salzberg.
  Nucleic Acids Research, 27:11 (1999), 2369-2376.


-- RUNNING MUMmer3.0 --
   MUMmer3.0 is comprised of many various utilities and scripts.  For general
purposes, the scripts "run-mummer1", "run-mummer3", "nucmer", and "promer"
will be all that is needed.  See their descriptions in the "RUNNING THE MUMmer
SCRIPTS" section, or refer to their individual documentation in the "docs/"
subdirectory.  Refer to the "RUNNING THE MUMmer UTILITIES" section for a brief
description of all of the utilities in this directory.

Simple use case:
   Given a file containing a single reference sequence (ref.seq) in
FASTA format and another file containing multiple sequences in FastA
format (qry.seq) type the following at the command line:

   './nucmer  -p <prefix>  ref.seq  qry.seq'

   To produce the following files:
        <prefix>.delta

or

   './run-mummer3.csh  ref.seq  qry.seq  <prefix>'

   To produce the following files:
        <prefix>.out
        <prefix>.gaps
        <prefix>.align
        <prefix>.errorsgaps

   Please read the utility-specific documentation in the "docs/" subdirectory
for descriptions of these files and information on how to change the
alignment parameters for the scripts (minimum match length, etc.), or see
the notes below in the "RUNNING THE MUMmer SCRIPTS" section for a brief
explanation.

   To see a simple gnuplot output, if you have gnuplot installed, run
the perl script 'mummerplot' on the output files. This script can be run
on mummer output (.out), or nucmer/promer output (.delta). Edit the
<prefix>.gp file that is created to change colors, line thicknesses, etc. or
explore the <prefix>.[fr]plot file to see the data collection.

   './mummerplot  -p <prefix>  <prefix>.out'

   Or you can use the web viewer for completed microbial genomes:
http://www.tigr.org/CMR



-- RUNNING THE MUMmer SCRIPTS --
   Because of MUMmer's modular design, it may be necessary to use a number
of separate programs to produce the desired output.  The MUMmer scripts
attempt to simplify this process by wrapping various utilities into packages
that can perform standard alignment requests.  Listed below are brief
descriptions and usage definitions for these scripts.  Please refer to the
"docs/" subdirectory for a more detailed description of each script.


   ** nucmer **

        DESCRIPTION:
        nucmer is for the all-vs-all comparison of nucleotide sequences
        contained in multi-FastA data files.  It is best used for highly
        similar sequence that may have large rearrangements.  Common use
        cases are: comparing two unfinished shotgun sequencing assemblies,
        mapping an unfinished sequencing assembly to a finished genome, and
        comparing two fairly similar genomes that may have large
        rearrangements and duplications.  Please refer to "docs/nucmer.README"
        for more information regarding this script and its output, or type
        'nucmer -h' for a list of its options.

        USAGE:
        nucmer  [options]  <reference>  <query>

        [options]    type 'nucmer -h' for a list of options.
        <reference>  specifies the multi-FastA sequence file that contains
                     the reference sequences, to be aligned with the queries.
        <query>      specifies the multi-FastA sequence file that contains
                     the query sequences, to be aligned with the references.

        OUTPUT:
        out.delta    the delta encoded alignments between the reference and
                     query sequences.  This file can be parsed with any of
                     the show-* programs which are described in the "RUNNING
                     THE MUMmer UTILITIES" section.

        NOTES:
        All output coordinates reference the forward strand of the involved
        sequence, regardless of the match direction. Also, nucmer now uses
        only matches that are unique in the reference sequence by default,
        use the '--mum' or '--maxmatch' options to change this behavior.


   ** promer **

        DESCRIPTION:
        promer is for the protein level, all-vs-all comparison of nucleotide
        sequences contained in multi-FastA data files.  The nucleotide input
        files are translated in all 6 reading frames and then aligned to one
        another via the same methods as nucmer.  It is best used for highly
        divergent sequences that may have moderate to high similarity on the
        protein level.  Common use cases are: identifying syntenic regions
        between highly divergent genomes, comparative genome annotation i.e.
        using an already annotated genome to help in the annotation of a
        newly sequenced genome, and the general comparison of two fairly
        divergent genomes that have large rearrangements and may only be
        similar on the protein level. Please refer to "docs/promer.README"
        for more information regarding this script and its output, or type
        'promer -h' for a list of its options.

        USAGE:
        promer  [options]  <reference>  <query>

        [options]    type 'promer -h' for a list of options.
        <reference>  specifies the multi-FastA sequence file that contains
                     the reference sequences, to be aligned with the queries.
        <query>      specifies the multi-FastA sequence file that contains
                     the query sequences, to be aligned with the references.

        OUTPUT:
        out.delta    the delta encoded alignments between the reference and
                     query sequences.  This file can be parsed with any of
                     the show-* programs which are described in the "RUNNING
                     THE MUMmer UTILITIES" section.

        NOTES:
        All output coordinates reference the forward strand of the involved
        sequence, regardless of the match direction, and are measured in
        nucleotides with the exception of the delta integers which are
        measured in amino acids (1 delta int = 3 nucleotides). Also, promer
        now uses only matches that are unique in the reference sequence by
        default, use the '--mum' or '--maxmatch' options to change this
        behavior.


   ** run-mummer1 **

        DESCRIPTION:
        This script is taken directly from MUMmer1.0 and is best used to
        align two sequences in which there is high similarity and no re-
        arrangements.  Common use cases are: aligning two finished bacterial
        chromosomes.  Please refer to "docs/run-mummer1.README" for the
        original documentation for this script and its output.

        USAGE:
        run-mummer1  <seq1>  <seq2>  <tag>  [-r]

        <seq1>  specifies the file with the first sequence in FastA format.
                No more than one sequence is allowed.
        <seq2>  specifies the file with the second sequence in FastA format.
                No more than one sequence is allowed.
        <tag>   specifies the prefix to be used for the output files.
        [-r]    is an optional parameter that will reverse complement the
                second sequence.

        OUTPUT:
        out.align       the out.gaps file interspersed with the alignments
                        of the gaps.
        out.errorsgaps  the out.gaps file with an extra column stating the
                        number of errors contained in each gap.
        out.gaps        an ordered (clustered) list of matches with position
                        information, and gap distances between each match.
        out.out         a list of all maximal unique matches between the two
                        input sequences ordered by their start position in the
                        second sequence.

        NOTES:
        All output coordinates reference their respective strand.  This means
        that if the -r switch is active, coordinates that reference the
        second sequence will be relative to the reverse complement of the
        second sequence.  Please use nucmer or promer if this coordinate
        system is confusing.
            Eventually, this script's components will be rewritten to work
        with the new MUMmer format standards and phased out in favor of the
        new components and wrapping script.


   ** run-mummer3 **

        DESCRIPTION:
        This script is the improved version of the MUMmer1.0 run-mummer1
        script.  It uses a new clustering algorithm that appropriately
        handles multiple sequence rearrangements and inversions.  Because
        of this, it can handle more divergent sequences better than
        run-mummer1.  In addition, it allows a multi-FastA query file for
        1-vs-many sequence comparisons.  Please refer to
        "docs/run-mummer3.README" for more detailed documentation of this
        script and its output.

        USAGE:
        run-mummer3  <reference>  <query>  <prefix>

        <reference>  specifies the file with the reference sequence in FastA
                     format.  No more than one sequence is allowed.
        <query>      specifies the multi-FastA sequence file that contains
                     the query sequences.
        <prefix>     specifies the file prefix for the output files.

        OUTPUT:
        out.align       the out.gaps file interspersed with the alignments
                        of the gaps.
        out.errorsgaps  the out.gaps file with an extra column stating the
                        number of errors contained in each gap.
        out.gaps        an ordered (clustered) list of matches with position
                        information, and gap distances between each match.
        out.out         a list of all maximal unique matches between the two
                        input sequences ordered by their start position in the
                        second sequence.

        NOTES:
        All output coordinates reference their respective strand.  This means
        that for all reverse matches, the coordinates that reference the
        query sequence will be relative to the reverse complement of the
        query sequence.  Please use nucmer or promer if this coordinate
        system is confusing.


   ** dnadiff **

        DESCRIPTION:
        This script is a wrapper around nucmer that builds an
        alignment using default parameters, and runs many of nucmer's
        helper scripts to process the output and report alignment
        statistics, SNPs, breakpoints, etc. It is designed for
        evaluating the sequence and structural similarity of two
        highly similar sequence sets. E.g. comparing two different
        assemblies of the same organism, or comparing two strains of
        the same species.  Please refer to "docs/dnadiff.README" for
        more information regarding this script and its output, or type
        'dnadiff -h' for a list of its options.

        USAGE: dnadiff  [options]  <reference>  <query>
          or   dnadiff  [options]  -d <delta file>

        <reference>       Set the input reference multi-FASTA filename
        <query>           Set the input query multi-FASTA filename
           or
        <delta file>      Unfiltered .delta alignment file from nucmer

        OUTPUT:
        .report  - Summary of alignments, differences and SNPs
        .delta   - Standard nucmer alignment output
        .1delta  - 1-to-1 alignment from delta-filter -1
        .mdelta  - M-to-M alignment from delta-filter -m
        .1coords - 1-to-1 coordinates from show-coords -THrcl .1delta
        .mcoords - M-to-M coordinates from show-coords -THrcl .mdelta
        .snps    - SNPs from show-snps -rlTHC .1delta
        .rdiff   - Classified ref breakpoints from show-diff -rH .mdelta
        .qdiff   - Classified qry breakpoints from show-diff -qH .mdelta
        .unref   - Unaligned reference IDs and lengths (if applicable)
        .unqry   - Unaligned query IDs and lengths (if applicable)

        NOTES:
        The report file generated by this script can be useful for
        comparing the differences between two similar genomes or
        assemblies. The other outputs generated by this script are in
        unlabeled tabular format, so please refer to the utility
        specific documentation for interpreting them. A full
        description of the report file is given in "docs/dnadiff.README".


-- RUNNING THE MUMmer UTILITIES --
   The MUMmer package consists of various utilities that can interact with
the 'mummer' program.  'mummer' performs all maximal and maximal unique
matching, and all other utilities were designed to process the input and
output of this program and its related scripts, in order to extract
additional information from the output.  Listed below are the descriptions
and usage definitions for these utilities.


   ** annotate **

        DESCRIPTION:
        This program reads the output of the 'gaps' program and adds alignment
        information to it.  Part of the original MUMmer1.0 pipeline and can
        only be used on the output of the 'gaps' program.

        USAGE:
        annotate  <gapsfile>  <seq2>

        <gapsfile>  the output of the 'gaps' program.
        <seq2>      the file containing the second sequence in the comparison.

        OUTPUT:
        stdout           the 'gaps' output interspersed with the alignments of
                         the gaps between adjacent MUMs.  An alignment of a
                         gap comes after the second MUM defining the gap, and
                         alignment errors are marked with a '^' character.
        witherrors.gaps  the 'gaps' output with an appended column that lists
                         the number of alignment errors for each gap.

        NOTES:
        This program will eventually be dropped in favor of the combineMUMs
        or nucmer match extenders, but persists for the time being.


   ** combineMUMs **

        DESCRIPTION:
        This program reads the output of the 'mgaps' program and adds alignment
        information to it.  Part of the MUMmer3.0 pipeline and can only be
        used on the output of the 'mgaps' program. This -D option alters this
        behavior and only outputs the positions of difference, e.g. SNPs.

        USAGE:
        combineMUMs  [options]  <reference>  <query>  <mgapsfile>

        [options]    type 'combineMUMs -h' for a list of options.
        <reference>  the FastA reference file used in the comparison.
        <query>      the multi-FastA reference file used in the comparison.
        <mgapsfile>  the output of the 'mgaps' program run on the match
                     list produced by 'mummer' for the reference and query
                     files.

        OUTPUT:
        stdout           the 'mgaps' output interspersed with the alignments
                         of the gaps between adjacent MUMs.  An alignment of a
                         gap comes after the second MUM defining the gap, and
                         alignment errors are marked with a '^' character.  At
                         the end of each cluster is a summary line (keyword
                         "Region") noting the bounds of the cluster in the
                         reference and query sequences, the total number of
                         errors for the region, the length of the region and
                         the percent error of the region.
        witherrors.gaps  the 'mgaps' output with an appended column that lists
                         the number of alignment errors for each gap.


   ** delta-filter **

        DESCRIPTION:

        This program filters a delta alignment file produced by either
        nucmer or promer, leaving only the desired alignments which
        are output to stdout in the same delta format as the
        input. Its primary function is the LIS algorithm which
        calculates the longest increasing subset of alignments. This
        allows for the calculation of a global set of alignments
        (i.e. 1-to-1 and mutually consistent order) with the -g option
        or locally consistent with -1 or -m. Reference sequences can
        be mapped to query sequences with -r, or queries to references
        with -q. This allows the user to exclude chance and repeat
        induced alignments, leaving only the "best" alignments between
        the two data sets. Filtering can also be performed on length,
        identity, and uniquenes.

        USAGE:
        delta-filter  [options]  <deltafile>

        [options]    type 'delta-filter -h' for a list of options.
        <deltafile>  the .delta output file from either nucmer or promer.

        OUTPUT:
        stdout  The same delta alignment format as output by nucmer and promer.

        NOTES:
        For most cases the -m option is recommended, however -1 is
        useful for applications that require a 1-to-1 mapping, such as
        SNP finding. Use the -q option for mapping query contigs to
        their best reference location.


   ** exact-tandems **

        DESCRIPTION:
        This script finds exact tandem repeats in a specified FastA sequence
        file.  It is a post-processor for 'repeat-match' and provides a simple
        interface and output for tandem repeat detection.

        USAGE:
        exact-tandems  <file>  <min match>

        <file>       the single sequence in FastA format to search for repeats.
        <min match>  the minimum match length for the tandems.

        OUTPUT:
        stdout  4 columns, the start of the tandem repeat, the total extent
                of the repeat region, the length of each repetitive unit, and
                to total copies of the repetitive unit involved.


   ** gaps **

        DESCRIPTION:
        This program reads a list of unique matches between two strings and
        outputs the longest consistent set of matches, followed by all the
        other matches.  Part of the MUMmer1.0 pipeline and the output of the
        'mummer' program needs to be processed (to strip all non-match lines)
        before it can be passed to this program.

        USAGE:
        gaps  <seq1>  [-r]  <  <matchlist>

        <seq1>       The first sequence file that the match list represents.
        <matchlist>  A simple list of matches and NO header lines or other
                     mumbo jumbo.  The columns of the match list should be
                     start in the reference, start in the query, and length
                     of the match.
        [-r]         Simply puts the string "reverse" on the header of the
                     output so 'annotate' knows to reverse the second
                     sequence.

        OUTPUT:
        stdout  an ordered set of the input matches, separated by headers.
                The first set is the longest consistent set of matches and
                the second set is all other matches.

        NOTES:
        This program will eventually be rewritten to be interchangeable with
        'mgaps', so that it may be plugged into the nucmer or promer
        pipelines.


   ** mapview **

        DESCRIPTION:
        mapview is a utility program for displaying sequence alignments as
        provided by MUMmer, nucmer or promer. This program takes the output
        from these alignment routines and converts it to a FIG, PDF or PS
        file for visual analysis. It can also break the output into multiple
        files for easier viewing and printing. Please refer to
        "docs/mapview.README" for a more detailed description and explination.

        USAGE:
        mapview  [options]  <coords file>  [UTR coords]  [CDS coords]

        [options]       type 'mapview -h' for a list of options.
        <coords file>   show-coords output file
        [UTR coords]    UTR coordinate file in GFF format
        [CDS coords]    CDS coordinate file in GFF format

        OUTPUT:
        Default output format is an xfig file, however this can be changed to
        a postscript of PDF file with the -f option. See 'mapview -h' for a
        list of available formatting options.

        NOTES:
        The produce the coords file input, 'show-coords' must be run with the
        -r -l options. To reduce redundant matches in promer output, run
        show-coords with the -k option. To generate output formats other than
        xfig, the fig2dev utility must be available from the system path. For
        very large reference genomes, FIG format may be the only option that
        will allow the entire display to be stored in one file, as fig2dev has
        problems if the output is too large.


   ** mgaps **

        DESCRIPTION:
        This program reads a list of matches between a single-FastA reference
        and a multi-FastA query file and outputs clusters of matches that lie
        on similar diagonals and within a reasonable distance.  Part of the
        MUMmer3.0 pipeline and the output of 'mummer' need not be processed
        before passing it to this program, so long as 'mummer' was run on a
        1-vs-many or 1-vs-1 dataset.

        USAGE:
        mgaps  [options]  <  <matchlist>

        [options]    type 'mgaps -h' for a list of options.
        <matchlist>  A list of matches separated by their sequence FastA tags.
                     The columns of the match list should be start in
                     reference, start in query, and length of the match.

        OUTPUT:
        stdout  An ordered set of the input matches, separated by headers.
                Individual clusters are separated by a '#' character and
                sets of clusters from different sequences are separated by
                the FastA header tag for the query sequence.

        NOTES:
        It is often very helpful to adjust the clustering parameters.  Check
        'mgaps -h' for the list of parameters and check the source for a
        better idea of how each parameter affects the result.  Often, it is
        helpful to run this program a number of times with different
        parameters until the desired result is achieved.


   ** mummer **

        DESCRIPTION:
        This is the core program of the MUMmer package.  It is the suffix-tree
        based match finding routine, and the main part of every MUMmer script.
        For a detailed manual describing how to use this program, please refer
        to "docs/maxmat3man.pdf" or in LaTeX format "docs/maxmat3man.tex". By
        default, 'mummer' now finds maximal matches regardless of their
        uniqueness. Limiting the output to only unique matches can be specified
        as a command line switch.

        USAGE:
        mummer  [options]  <reference>  <query> ...

        [options]    type 'mummer -help' for a list of options.
        <reference>  specifies the single or multi-FastA sequence file that
                     contains the reference sequence(s), to be aligned with
                     the queries.
        <query>      specifies the multi-FastA sequence file that contains
                     the query sequences, to be aligned with the references.
                     Multiple query files are allowed, up to 32.

        OUTPUT:
        stdout  a list of exact matches. Varies depending on input, refer to
                the manual specified in the description above.

        NOTES:
        Many thanks to Stefan Kurtz for the latest mummer version. 'mummer'
        now behaves like the old 'mummer2' program by default. The -mum switch
        forces it to behave like 'mummer1', the -mumreference switch forces it
        to behave like 'mummer2' while the -maxmatch switch forces it to behave
        like the old 'max-match' program.


   ** mummerplot **

        DESCRIPTION:
        mummerplot is a perl script that generates gnuplot scripts and data
        collections for plotting with the gnuplot utility.  It can generate
        2-d dotplots and 1-d coverage plots for the output of mummer, nucmer,
        promer or show-tiling. It can also color dotplots with an identity
        color gradient.

        USAGE:
        mummerplot  [options]  <matchfile>

        [options]    type 'mummerplot -h' for a list of options.
        <matchfile>  the output of 'mummer', 'nucmer', 'promer', or
                     'show-tiling'. 'mummerplot' will automatically determine
                     the format of the data it was given and produce the plot
                     accordingly.

        OUTPUT:
        out.gp     The gnuplot script, type 'gnuplot out.gp' to evaluate the
                   the gnuplot script.
        out.fplot
        out.rplot
        out.hplot  The forward, reverse and highlighted match information for
                   plotting with gnuplot.

        out.ps
        out.png    The plotted image file, postscript or png depending on the
                   selected terminal type.

        NOTES:
        For alignments with multiple reference or query sequences, be sure to
        use the -r -q or -R -Q options to avoid overlaying multiple plots in
        the same space. For better looking color gradient plots, try the
        postscript terminal and avoid the png terminal.


   ** nucmer2xfig **

        DESCRIPTION:
        Script for plotting nucmer hits against a reference sequence. See top
        of script for more information, or see if 'mummerplot' or 'mapview'
        has the functionality required as they are properly maintained.


   ** repeat-match **

        DESCRIPTION:
        Finds exact repeats within a single sequence.

        USAGE:
        repeat-match  [options]  <seq>

        [options]  type 'repeat-match -h' for a list of options.
        <seq>      the single sequence in FastA format to search for repeats.

        OUTPUT:
        stdout  3 columns, the start of the first copy of the repeat, the
                start of the second copy of the repeat, and the length of the
                repeat respectively.

        NOTES:
        REPuter (freely available for universities) may be better suited for
        most repeat matching, but 'repeat-match' is open-source and has some
        functionality that REPuter does not so we include it along with the
        MUMmer package.


   ** show-aligns **

        DESCRIPTION:
        This program parses the delta alignment output of nucmer and promer
        and displays all of the pairwise alignments from the two sequences
        specified on the command line.

        USAGE:
        show-aligns  [options]  <deltafile>  <IdR>  <IdQ>

        [options]    type 'show-aligns -h' for a list of options.
        <deltafile>  the .delta output file from either nucmer or promer.
        <IdR>        the FastA header tag of the desired reference sequence.
        <IdQ>        the FastA header tag of the desired query sequence.

        OUTPUT:
        stdout  each alignment header and footer describes the frame of the
                alignment in each sequence, and the start and finish
                (inclusive) of the alignment in each sequence.  At the
                beginning of each line of aligned sequence are two numbers, the
                top is the coordinate of the first reference base on that line
                and the bottom is the coordinate of the first query base on
                that line.  ALL coordinates reference the forward strand of the
                DNA sequence, even if it is a protein alignment.  A gap caused
                by an insertion or deletion is filled with a '.' character.
                Errors in a DNA alignment are marked with a '^' below the
                error.  Errors in an amino acid alignment are marked with a
                whitespace in the middle consensus line, while matches are
                marked with the consensus base and similarities are marked with
                a '+' in the consensus line.


   ** show-coords **

        DESCRIPTION:
        This program parses the delta alignment output of nucmer and promer
        and displays the coordinates, and other useful information about the
        alignments.

        USAGE:
        show-coords  [options]  <deltafile>

        [options]    type 'show-coords -h' for a list of options.
        <deltafile>  the .delta output file from either nucmer or promer.

        OUTPUT:
        stdout  run 'show-coords' without the -H option to see the column
                header tags.  Here is a description of each tag.  Note that
                some of the below tags do not apply to nucmer data, and that
                all coordinates are inclusive and relative to the forward DNA
                strand.

        [S1]    Start of the alignment region in the reference sequence.

        [E1]    End of the alignment region in the reference sequence.

        [S2]    Start of the alignment region in the query sequence.

        [E2]    End of the alignment region in the query sequence.

        [LEN 1] Length of the alignment region in the reference sequence,
        measured in nucleotides.

        [LEN 2] Length of the alignment region in the query sequence, measured
        in nucleotides.

        [% IDY] Percent identity of the alignment, calculated as the
        (number of exact matches) / ([LEN 1] + insertions in the query).

        [% SIM] Percent similarity of the alignment, calculated like the above
        value, but counting positive BLOSUM matrix scores instead of exact
        matches.

        [% STP] Percent of stop codons of the alignment, calculated as
        (number of stop codons) / (([LEN 1] + insertions in the query) * 2).

        [LEN R] Length of the reference sequence.

        [LEN Q] Length of the query sequence.

        [COV R] Percent coverage of the alignment on the reference sequence,
        calculated as [LEN 1] / [LEN R].

        [COV Q] Percent coverage of the alignment on the query sequence,
        calculated as [LEN 2] / [LEN Q].

        [FRM]   Reading frame for the reference sequence and the reading frame
        for the query sequence respectively.  This is one of the columns
        absent from the nucmer data, however, match direction can easily be
        determined by the start and end coordinates.

        [TAGS]  The reference FastA ID and the query FastA ID.

                There is also an optional final column (turned on with the -w
        or -o option) that will contain some 'annotations'. The -o option will
        annotate alignments that represent overlaps between two sequences,
        while the -w option is antiquated and should no longer be used.
        Sometimes, nucmer or promer will extend adjacent clusters past one
        another, thus causing a somewhat redundant output, this option will
        notify users of such rare occurrences.

        NOTES:
        The -c and -l options are useful when comparing two sets of assembly
        contigs, in that these options help determine if an alignment spans an
        entire contig, or is just a partial hit to a different read.  The -b
        option is useful when the user wishes to identify sytenic regions
        between two genomes, but is not particularly interested in the actual
        alignment similarity or appearance.  This option also disregards match
        orientation, so should not be used if this information is needed.


   ** show-diff **

        DESCRIPTION:
        This program classifies alignment breakpoints for the
        quantification of macroscopic differences between two
        genomes. It takes a standard, unfiltered delta file as input,
        determines the best mapping between the two sequence sets, and
        reports on the breaks in that mapping.

        USAGE:
        show-diff  [options]  <deltafile>

        [options]    type 'show-diff -h' for a list of options.
        <deltafile>  the .delta output file from nucmer

        OUTPUT:
        stdout  Classified breakpoints are output one per line with
                the following types and column definitions. The first
                five columns of every row are seq ID, feature type,
                feature start, feature end, and feature length.

        Feature Columns

        IDR GAP gap-start gap-end gap-length-R gap-length-Q gap-diff
        IDR DUP dup-start dup-end dup-length
        IDR BRK gap-start gap-end gap-length
        IDR JMP gap-start gap-end gap-length
        IDR INV gap-start gap-end gap-length
        IDR SEQ gap-start gap-end gap-length prev-sequence next-sequence

        Feature Types

        [GAP] A gap between two mutually consistent ordered and
        oriented alignments. gap-length-R is the length of the
        alignment gap in the reference, gap-length-Q is the length of
        the alignment gap in the query, and gap-diff is the difference
        between the two gap lengths. If gap-diff is positive, sequence
        has been inserted in the reference. If gap-diff is negative,
        sequence has been deleted from the reference. If both
        gap-length-R and gap-length-Q are negative, the indel is
        tandem duplication copy difference.

        [DUP] A duplicated sequence in the reference that occurs more
        times in the reference than in the query. The coordinate
        columns specify the bounds and length of the
        duplication. These features are often bookended by BRK
        features if there is unique sequence bounding the duplication.

        [BRK] An insertion in the reference of unknown origin, that
        indicates no query sequence aligns to the sequence bounded by
        gap-start and gap-end. Often found around DUP elements or at
        the beginning or end of sequences.

        [JMP] A relocation event, where the consistent ordering of
        alignments is disrupted. The coordinate columns specify the
        breakpoints of the relocation in the reference, and the
        gap-length between them. A negative gap-length indicates the
        relocation occurred around a repetitive sequence, and a
        positive length indicates unique sequence between the
        alignments.

        [INV] The same as a relocation event, however both the
        ordering and orientation of the alignments is disrupted. Note
        that for JMP and INV, generally two features will be output,
        one for the beginning of the inverted region, and another for
        the end of the inverted region.

        [SEQ] A translocation event that requires jumping to a new
        query sequence in order to continue aligning to the
        reference. If each input sequence is a chromosome, these
        features correspond to inter-chromosomal translocations.

        NOTES:
        The estimated number of features, take inversions for example,
        represents the number of breakpoints classified as bordering
        an inversion. Therefore, since there will be a breakpoint at
        both the beginning and the end of an inversion, the feature
        counts are roughly double the number of inversion events. In
        addition, all counts are estimates and do not represent the
        exact number of each evolutionary event.

        Summing the fifth column (ignoring negative values) yeilds an
        estimate of the total inserted sequence in the
        reference. Summing the fifth column after removing DUP
        features yields an estimate of the total amount of unique
        (unaligned) sequence in the reference. Note that unaligned
        sequences are not counted, and could represent additional
        "unique" sequences. Use the 'dnadiff' script if you must
        recover this information. Finally, the -q option switches
        references for queries, and uses the query coordinates for the
        analysis.


   ** show-snps **

        DESCRIPTION:
        This program reports polymorphism contained in a delta encoded
        alignment file output by either nucmer or promer. It catalogs
        all of the single nucleotide polymorphisms (SNPs) and
        insertions/deletions within the delta file
        alignments. Polymorphisms are reported one per line, in a
        delimited fashion similar to show-coords. Pairing this program
        with the appropriate MUMmer tools can create an easy to use
        SNP pipeline for the rapid identification of putative SNPs
        between any two sequence sets.

        USAGE:
        show-snps  [options]  <deltafile>

        [options]    type 'show-snps -h' for a list of options.
        <deltafile>  the .delta output file from either nucmer or promer.

        OUTPUT:
        stdout  Standard output has column headers with the following
                meanings. Not all columns will be output by default,
                see 'show-snps -h' for switch to control the output.

        [P1]    SNP position in the reference.

        [SUB]   Character in the reference.

        [SUB]   Character in the query.

        [P2]    SNP position in the query.

        [BUFF]  Distance from this SNP to the nearest mismatch (end of
        alignment, indel, SNP, etc) in the same alignment.

        [DIST]  Distance from this SNP to the nearest sequence end.

        [R]     Number of repeat alignments which cover this reference
        position, >0 means repetitive sequence.

        [Q]     Number of repeat alignments which cover this query
        position, >0 means repetitive sequence.

        [LEN R] Length of the reference sequence.

        [LEN Q] Length of the query sequence.

        [CTX R] Surrounding context sequence in the reference.

        [CTX Q] Surrounding context sequence in the query.

        [FRM]   Reading frame for the reference sequence and the
        reading frame for the query sequence respectively. Simply
        'forward' 1, or 'reverse' -1 for nucmer data.

        [TAGS]  The reference FastA ID and the query FastA ID.

        NOTES:
        It is often helpful to run this with the -C option to assure
        reported SNPs are only reported from uniquely aligned regions.


   ** show-tiling **

        DESCRIPTION:
        This program attempts to construct a tiling path out of the query
        contigs as mapped to the reference sequences.  Given the delta
        alignment information of a few long reference sequences and many small
        query contigs, 'show-tiling' will determine the best location on a
        reference for each contig.  Note that each contig may only be tiled
        once, so repetitive regions may cause this program some difficulty.
        This program is useful for aiding in the scaffolding and closure of an
        unfinished set of contigs, if a suitable, high similarity, reference
        genome is available.  Or, if using promer, 'show-tiling' will help
        in the identification of syntenic regions and their contig's mapping
        the the references.

        USAGE:
        show-tiling  [options]  <deltafile>

        [options]    type 'show-tiling -h' for a list of options.
        <deltafile>  the .delta output file from either nucmer or promer.

        OUTPUT:
        stdout  Standard output has 8 columns: start in reference, end in
                reference, gap between this contig and the next, length of this
                contig, alignment coverage of this contig, average percent
                identity of the alignments for this contig, orientation of this
                contig, contig ID. All matches to a reference are headed by the
                FASTA tag of that reference.  Output with the -a option is the
                same as 'show-coords -cl' when run on nucmer data.

        NOTES:
        When run with the -x option, 'show-tiling' will produce an XML output
        format that can be accepted by TIGR's open source scaffolding software
        'Bambus' as contig linking information.


-- CONTACT INFORMATION --

Please address questions and bug reports to: <[email protected]>

Last Revised May 12, 2005

mummer3's People

Contributors

Stargazers

Watchers

Forkers

tseemann brittanymareeott chenzhiw evanbiederstedt whitel nathanhaigh rushkinbond charansuneel

mummer3's Issues

dnadiff question

Hi marbl,

I am trying to identify genomic difference in two genomes using MUMmer. I use the command 'dnadiff -d test.delta -p testdnadiff'.

And then I get the follow report file in testdnadiff.report :

[Feature Estimates]
Breakpoints 124397 124404
Relocations 2091 1865
Translocations 3575 3606
Inversions 518 505

Insertions 32663 37843
InsertionSum 58149586 59674780
InsertionAvg 1780.29 1576.90

TandemIns 596 612
TandemInsSum 221566 212787
TandemInsAvg 371.76 347.69
...

I think it is a very useful summary, but how can I know where the difference like which position the translocations are located on ?

Regards,
JiaMing

64-bit mode by default?

Hi @brianwalenz @aphillippy

Not a high priority issue:

I wonder if it would make sense to have MUMmer3 compile in 64-bit mode by default (i.e. add this to the Makefile) or perhaps add a line in the README recommending that users to do this.

For potential users, I think it might save some time/reduce potential confusion.

I ran into the following issue:

suffix tree construction failed: textlen=3209286560 larger than maximal textlen=536870908
ERROR: mummer and/or mgaps returned non-zero

which is solved by just recompiling with make CPPFLAGS="-O3 -DSIXTYFOURBITS"

If it's worth your time, I could make a quick PR.

Best, Evan

mummer and/or mgaps returned non-zero

Hi, I want to compare to genomes using MUMmer3 and meet an error.

Here is my code:

$MUMMER3/nucmer --maxmatch -c 100 -b 500 -l 50 --prefix Wm82_NN Gmax_275_v2.0.fa NN1138-2.v1.0.fa

Here is my error:

1: PREPARING DATA
2,3: RUNNING mummer AND CREATING CLUSTERS

reading input file "Wm82_NN.ntref" of length 978496462

construct suffix tree for sequence of length 978496462

(maximum reference length is 2305843009213693948)

(maximum query length is 18446744073709551615)

process 9784964 characters per dot

#....................................................................................................

CONSTRUCTIONTIME /gss1/home/hjb20181119/panyongpeng/NN1138-2/sv_detection/MUMmer3.23/mummer Wm82_NN.ntref 423.83

reading input file "/gss1/home/hjb20181119/panyongpeng/N24852/01.SV_detection/Wm82_NN/NN1138-2.v1.0.fa" of length 959747921

matching query-file "/gss1/home/hjb20181119/panyongpeng/N24852/01.SV_detection/Wm82_NN/NN1138-2.v1.0.fa"

against subject-file "Wm82_NN.ntref"

ERROR: mummer and/or mgaps returned non-zero

and the Wm82_NN.mgaps file is 0 byte and no Wm82_NN.delta file.

Could you help me fix it ? Thank you very much.

Run nucmer clustering without reference?

Hi,
When performing clustering of a single file with nucmer, should I use itself as reference?

Edit: I'm trying to follow this (from https://peerj.com/articles/3817/):
Contigs from all samples were clustered with nucmer (Delcher, Salzberg & Phillippy, 2003) at ≥95% ANI across ≥80% of their lengths, as in (Brum et al., 2015; Gregory et al., 2016), to generate a pool of non-redundant “population contigs”.

obs: ANI=average nucleotide identity

What are the nucmer parameters to set in order to use the values mentioned?
Thank you

postnuc: postnuc.cc:1427: long int revC(long int, long int): Assertion `Len - Coord + 1 > 0' failed.

dnadiff sequence.fasta reads.fasta -p 50long
Building alignments
1: PREPARING DATA
2,3: RUNNING mummer AND CREATING CLUSTERS
# reading input file "50long.ntref" of length 4641653
# construct suffix tree for sequence of length 4641653
# (maximum reference length is 536870908)
# (maximum query length is 4294967295)
# process 46416 characters per dot
#....................................................................................................
# CONSTRUCTIONTIME /usr/bin/mummer 50long.ntref 2.31
# reading input file "/50long/e.coli.correctedReads.fasta" of length 112457574
# matching query-file "/50long/e.coli.correctedReads.fasta"
# against subject-file "50long.ntref"
# COMPLETETIME /usr/bin/mummer 50long.ntref 126.30
# SPACE /usr/bin/mummer 50long.ntref 112.41
4: FINISHING DATA
postnuc: postnuc.cc:1427: long int revC(long int, long int): Assertion `Len - Coord + 1 > 0' failed.
Aborted (core dumped)
ERROR: postnuc returned non-zero
ERROR: Failed to run nucmer, aborting.

Pseudomolecule length does not match show-tiling

Hello,

The nucleotide length of the output from show-tiling does not match the pseudomolecule length.

With the command wc -c the length of the pseudomolecule is 133,508

However, the alignment to the reference extends to 165,369. I am working with chloroplast genomes

I have attached the show-tiling output.

I hope you may have some insight into this discrepancy.

Thank you for your time.

Enable "-k" option for nucmer

Hi,

would it be possible to enable the -k option for nucmer too?
One of my contigs displays several short alignments to different chromosomes of the reference (probably false positives or repeats). It also shows a couple of long alignments to one chromosome. However, many short alignments overlap (on the contig sequence) with the long ones. That's why I'd like to remove the short ones with the "-k" option, or at least filter them according to alignment length and not similarity (I might have 1000 50bp-alignments with 99% identity, so it wouldn't work).

Is there a way to do this when using nucmer and show-coords?

Thanks

ERROR: Query input does not match delta file ERROR: Failed to run show-snps, aborting.

Excause me. I had run "dnadiff reference.fasta reads.fasta corrected" and then some error occured. The error is as following:

Building alignments
Filtering alignments
Extracting alignment coordinates
Analyzing SNPs
ERROR: Query input does not match delta file
ERROR: Failed to run show-snps, aborting.

And I had looked up the information on the internet,but there had no appropriate answer. Could you help me solve this problem?

Thanks.

ERROR: Could not parse input from 'stdin'.

Hi, I am trying to run nucmer but I get the following error when I try to run it with my data:

1: PREPARING DATA
2,3: RUNNING mummer AND CREATING CLUSTERS

reading input file "out.ntref" of length 32921472

construct suffix tree for sequence of length 32921472

(maximum reference length is 536870908)

(maximum query length is 4294967295)

process 329214 characters per dot

#....................................................................................................

CONSTRUCTIONTIME /Users/sophiematthews/software/mummer/MUMmer3.23/mummer out.ntref 25.92

reading input file "/Users/sophiematthews/Masters/mummer/star10_gamb/nucmer/../chr10" of length 27243750

matching query-file "/Users/sophiematthews/Masters/mummer/star10_gamb/nucmer/../chr10"

against subject-file "out.ntref"

COMPLETETIME /Users/sophiematthews/software/mummer/MUMmer3.23/mummer out.ntref 71.19

SPACE /Users/sophiematthews/software/mummer/MUMmer3.23/mummer out.ntref 58.33

4: FINISHING DATA
ERROR: Could not parse input from 'stdin'.
Please check the filename and format, or file a bug report
ERROR: postnuc returned non-zero

I have run this many times before on different datasets and have not had a problem, I have checked all my input files and there does not seem to be anything that would prevent this from running - could you advise?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.