Giter VIP home page Giter VIP logo

alexandrumeterez / metagraph Goto Github PK

View Code? Open in Web Editor NEW

This project forked from ratschlab/metagraph

1.0 0.0 0.0 70.33 MB

Scalable annotated de Bruijn graphs for DNA indexing, alignment, and assembly

Home Page: http://metagraph.ethz.ch

License: GNU General Public License v3.0

Python 12.62% C++ 81.21% Shell 4.18% CMake 0.88% HTML 0.21% CSS 0.03% JavaScript 0.60% SWIG 0.01% Makefile 0.16% Dockerfile 0.11%

metagraph's Introduction

Metagenome Graph Project

GitHub release (latest by date) bioconda downloads install with conda install with docker install from source documentation

MetaGraph is a tool for scalable construction of annotated genome graphs and sequence-to-graph alignment.

The default index representations in MetaGraph are extremely scalable and support building graphs with trillions of nodes and millions of annotation labels. At the same time, the provided workflows and their careful implementation, combined with low-level optimizations of the core data structures, enable exceptional query and alignment performance.

Main features:

  • Large-scale indexing of sequences
  • Python API for querying in the server mode
  • Encoding k-mer counts (e.g., expression values) and k-mer coordinates (positions in source sequences)
  • Sequence alignment against very large annotated graphs
  • Scalable cleaning of very large de Bruijn graphs (to remove sequencing errors)
  • Support for custom alphabets (e.g., {A,C,G,T,N} or amino acids)
  • Algorithms for differential assembly

Design choices in MetaGraph:

  • Use of succinct data structures and efficient representation schemes for extremely high scalability
  • Algorithmic choices that work efficiently with succinct data structures (e.g., always prefer batched operations)
  • Modular support of different graph and annotation representations
  • Use of generic and extensible interfaces to support adding custom index representations / algorithms with little code overhead.

Documentation

Online documentation is available at https://metagraph.ethz.ch/static/docs/index.html. Offline sources are here.

Install

Conda

Install the latest release on Linux or Mac OS X with Anaconda:

conda install -c bioconda -c conda-forge metagraph

Docker

If docker is available on the system, immediately get started with

docker pull ghcr.io/ratschlab/metagraph:master
docker run -v ${HOME}:/mnt ghcr.io/ratschlab/metagraph:master \
    build -v -k 10 -o /mnt/transcripts_1000 /mnt/transcripts_1000.fa

and replace ${HOME} with a directory on the host system to map it under /mnt in the container.

To run the binary compiled for the Protein alphabet, just add --entrypoint metagraph_Protein:

docker run -v ${HOME}:/mnt --entrypoint metagraph_Protein ghcr.io/ratschlab/metagraph:master \
    build -v -k 10 -o /mnt/graph /mnt/protein.fa

As you see, running MetaGraph from docker containers is very easy. Also, the following command (or similar) may be handy to see what directory is mounted in the container or other sort of debugging of the command:

docker run -v ${HOME}:/mnt --entrypoint ls ghcr.io/ratschlab/metagraph:master /mnt

All different versions of the container image are listed here.

Install From Sources

To compile from source (e.g., for builds with custom alphabet or other configurations), see documentation online.

Typical workflow

  1. Build de Bruijn graph from Fasta files, FastQ files, or KMC k-mer counters:
    ./metagraph build
  2. Annotate graph using the column compressed annotation:
    ./metagraph annotate
  3. Transform the built annotation to a different annotation scheme:
    ./metagraph transform_anno
  4. Query annotated graph
    ./metagraph query

Example

DATA="../tests/data/transcripts_1000.fa"

./metagraph build -k 12 -o transcripts_1000 $DATA

./metagraph annotate -i transcripts_1000.dbg --anno-filename -o transcripts_1000 $DATA

./metagraph query -i transcripts_1000.dbg -a transcripts_1000.column.annodbg $DATA

./metagraph stats -a transcripts_1000.column.annodbg transcripts_1000.dbg

Print usage

./metagraph

Build graph

  • Simple build

./metagraph build -v --parallel 30 -k 20 --mem-cap-gb 10 \
                        -o <GRAPH_DIR>/graph <DATA_DIR>/*.fasta.gz \
2>&1 | tee <LOG_DIR>/log.txt
  • Build with disk swap (use to limit the RAM usage)

./metagraph build -v --parallel 30 -k 20 --mem-cap-gb 10 --disk-swap <GRAPH_DIR> \
                        -o <GRAPH_DIR>/graph <DATA_DIR>/*.fasta.gz \
2>&1 | tee <LOG_DIR>/log.txt

Build from k-mers filtered with KMC

K=20
./KMC/kmc -ci5 -t4 -k$K -m5 -fm <FILE>.fasta.gz <FILE>.cutoff_5 ./KMC
./metagraph build -v -p 4 -k $K --mem-cap-gb 10 -o graph <FILE>.cutoff_5.kmc_pre

Annotate graph

./metagraph annotate -v --anno-type row --fasta-anno \
                           -i primates.dbg \
                           -o primates \
                           ~/fasta_zurich/refs_chimpanzee_primates.fa

Convert annotation to Multi-BRWT

  1. Cluster columns
./metagraph transform_anno -v --linkage --greedy \
                           -o linkage.txt \
                           --subsample R \
                           -p NCORES \
                           primates.column.annodbg

Requires N*R/8 + 6*N^2 bytes of RAM, where N is the number of columns and R is the number of rows subsampled.

  1. Construct Multi-BRWT
./metagraph transform_anno -v -p NCORES --anno-type brwt \
                           --linkage-file linkage.txt \
                           -o primates \
                           --parallel-nodes V \
                           -p NCORES \
                           primates.column.annodbg

Requires M*V/8 + Size(BRWT) bytes of RAM, where M is the number of rows in the annotation and V is the number of nodes merged concurrently.

Query graph

./metagraph query -v -i <GRAPH_DIR>/graph.dbg \
                        -a <GRAPH_DIR>/annotation.column.annodbg \
                        --discovery-fraction 0.8 --labels-delimiter ", " \
                        query_seq.fa

Align to graph

./metagraph align -v -i <GRAPH_DIR>/graph.dbg query_seq.fa

Assemble sequences

./metagraph assemble -v <GRAPH_DIR>/graph.dbg \
                        -o assembled.fa \
                        --unitigs

Assemble differential sequences

./metagraph assemble -v <GRAPH_DIR>/graph.dbg \
                        --unitigs \
                        -a <GRAPH_DIR>/annotation.column.annodbg \
                        --diff-assembly-rules diff_assembly_rules.json \
                        -o diff_assembled.fa

See metagraph/tests/data/example.diff.json and metagraph/tests/data/example_simple.diff.json for sample files.

Get stats

Stats for graph

./metagraph stats graph.dbg

Stats for annotation

./metagraph stats -a annotation.column.annodbg

Stats for both

./metagraph stats -a annotation.column.annodbg graph.dbg

Developer Notes

Makefile

The Makefile in the top level source directory can be used to build and test metagraph more conveniently. The following arguments are supported:

  • env: environment in which to compile/run ("": on the host, docker: in a docker container)
  • alphabet: compile metagraph for a certain alphabet (e.g. DNA or Protein, default DNA)
  • additional_cmake_args: additional arguments to pass to cmake.

Examples:

# compiles metagraph in a docker container for the `DNA` alphabet
make build-metagraph env=docker alphabet=DNA

License

Metagraph is distributed under the GPLv3 License (see LICENSE). Please find further information in the AUTHORS and COPYRIGHTS files.

metagraph's People

Contributors

karasikov avatar hmusta avatar akahles avatar alexandrumeterez avatar danieldanciu avatar gideonite avatar jendas1 avatar heracle avatar sara-jvz avatar chrisbarber avatar dependabot[bot] avatar boydgreenfield avatar thomastzhou avatar

Stargazers

Andreia Ocănoaia avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.