Giter VIP home page Giter VIP logo

herro's Introduction

HERRO

HERRO (Haplotype-aware ERRor cOrrection) is a highly-accurate, haplotype-aware, deep-learning tool for error correction of Nanopore R10.4.1, Kit 14, Ultra-long (UL) reads.

Requirements

  • Linux OS (tested on RHEL 8.6 and Ubuntu 22.04)
  • Zstandard
  • Python (and conda) for data preprocessing

Compile from source

Installation

  1. Clone the repository
git clone https://github.com/dominikstanojevic/herro.git
cd herro
  1. Create conda environment
conda env create --file scripts/herro-env.yml
  1. Build herro binary (singularity or compile from source)

    a. Download singularity image:

    1. Setup aws profile using these credentials. A guide for aws-cli profile setup can be found here.

    2. Download the image

    aws s3 cp s3://herro.store.genome.sg/sif/herro.sif herro.sif --profile <herro_profile>

    b. Build singularity image (requires sudo)

    sudo singularity build herro.sif herro-singularity.def

    Run the tool (see Usage) with: singularity run --nv --bind <host_path>:<dest_path> herro.sif inference <args>

    c. Compile

    When compiling from source, ensure that libtorch and rustup are downloaded and installed.

    export LIBTORCH=<libtorch_path>
    export LD_LIBRARY_PATH=$LIBTORCH/lib:$LD_LIBRARY_PATH
    RUSTFLAGS="-Ctarget-cpu=native" cargo build -q --release

    Path to the resulting binary: target/release/herro

Model Download

  1. Setup aws profile using these credentials. A guide for aws-cli profile setup can be found here.

  2. Download model:

aws s3 cp s3://herro.store.genome.sg/models/model_v0.1.pt . --profile <herro_profile>

Usage

  1. Preprocess reads
scripts/preprocess.sh <input_fastq> <output_prefix> <number_of_threads> <parts_to_split_job_into>

Note: Porechop loads all reads into memory, so the input may need to be split into multiple parts. Set <parts_to_split_job_into> to 1 if splitting is not needed. In Dorado v0.5, adapter trimming was added, so adapter trimming and splitting using Porechop and duplex tools will probably be removed in the future.

  1. minimap2 alignment and batching

Although minimap2 can be run from the herro binary (omit --read-alns or use --write-alns to store batched alignments for future use), the preferred method is to initially run minimap2 and then utilize it to generate alignment batches. These batches will be used as input for the herro binary.

scripts/create_batched_alignments.sh <output_from_reads_preprocessing> <read_ids> <num_of_threads> <directory_for_batches_of_alignments> 

Note: Read ids can be obtained with seqkit: seqkit seq -ni <reads> > <read_ids>

  1. Error-correction
herro inference --read-alns <directory_alignment_batches> -t <feat_gen_threads_per_device> -d <gpus> -m <model_path> -b <batch_size> <preprocessed_reads> <fasta_output> 

Note: GPUs are specified using their IDs. For example, if the value of the parameter -d is set to 0,1,3, herro will use the first, second, and fourth GPU cards. Parameter -t is given per device - e.g., if -t is set to 8 and 3 GPUs are used, herro will create 24 feature generation theads in total. Recommended batch size is 64 for GPUs with 40 GB (possibly also for 32 GB) of VRAM and 128 for GPUs with 80 GB of VRAM.

Results on HG002 data

HG002 data was assembled using hifiasm and compared to HiFi reads. Results for uncorrected reads are not given since they produce poor assembly. Currently, data is not publicly available.

Assembly results and comparison with Hifi reads and uncorrected UL are given in the table below. Assemblies were perform using:

  1. Hifi reads/Duplex ONT reads/Corrected UL reads
  2. Uncorrected Ultra-long ONT reads as UL reads
  3. Parental Illumina data

Hifiasm command used for all experiments:

hifiasm -o <output_prefix> -t <num_threads> --ul <UL_reads> --ul-cut 10000 -1 <parent1_yak> -2 <parent2_yak> <HiFi/Duplex/Corrected UL reads>

Results

HG002 Assembly Results

Results on Error-corrected HG002 experimental, high-accuracy, UL data

Experimental high-accuracy, UL HG002 error-corrected reads can be found in the s3 bucket. Raw data used for the error-correction can be found here. Assemblies were done in the same way as in the previous section.

Download

  1. Setup aws profile using these credentials. A guide for aws-cli profile setup can be found here.

  2. Download error-corrected reads:

aws s3 cp s3://herro.store.genome.sg/data/corrected/HG002.experimentalUL.corrected.fasta.gz . --profile <herro_profile>

Results

Assembly results and comparison with Hifi reads and uncorrected UL are given in the table below. Assemblies were perform using:

  1. Hifi/HQ Uncorrected UL/Corrected UL reads
  2. HQ Uncorrected Ultra-long ONT reads as UL reads
  3. Parental Illumina data

HG002 HQ Assembly Results

Acknowledgements

This work has been supported by AI Singapore 100 Experiments (100E) Programme under the project AI-driven De Novo Diploid Assembler (AISG2-100E-2021-076) in collaboration with Agency for Science, Technology and Research (A*STAR), and Oxford Nanopore Technologies plc. (ONT).

herro's People

Contributors

dominikstanojevic avatar andrewzhang217 avatar jelber2 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.