Giter VIP home page Giter VIP logo

hgvs-rs's Introduction

Crates.io Crates.io Crates.io CI codecov DOI

hgvs-rs

This is a port of biocommons/hgvs to the Rust programming language. The data::cdot::* code is based on a port of SACGF/cdot to Rust.

Running Tests

The tests need an instance of UTA to run. Either you setup a local copy (with minimal dataset in tests/data/data/*.pgd.gz) or use the public one. You will have to set the environment variables TEST_UTA_DATABASE_URL and TEST_UTA_DATABASE_SCHEMA appropriately. To use the public database:

export TEST_UTA_DATABASE_URL=postgres://anonymous:[email protected]:/uta
export TEST_UTA_DATABASE_SCHEMA=uta_20210129

Note that seqrepo-rs is used for access to the genome contig sequence. It is inconvenient to provide sub sets of sequences in SeqRepo format. Instead, we use a build-cache/read-cache approach that is also used by biocommons/hgvs.

To build the cache, you will first need a download of the seqrepo as described in biocommons/biocommons.seqrepo Quickstart. Then, you configure the running of tests for hgvs-rs as follows:

export TEST_SEQREPO_CACHE_MODE=write
export TEST_SEQREPO_PATH=path/to/seqrepo/instance
export TEST_SEQREPO_CACHE_PATH=tests/data/seqrepo_cache.fasta

When running the tests with cargo test, the cache file will be (re-)written. Note that you have to use cargo test --release -- --test-threads 1 --include-ignored when writing the cache for enforcing a single test writing to the cache at any time. If you don't want to regenerate the cache then you can use the following settings. With these settings, the cache will only be read.

export TEST_SEQREPO_CACHE_MODE=read
export TEST_SEQREPO_CACHE_PATH=tests/data/seqrepo_cache.fasta

After either this, you can run the tests.

cargo test

Creating Reduced UTA Databases

The script tests/data/data/bootstrap.sh allows to easily build a reduced set of the UTA database given a list of genes. The process is as follows:

  1. You edit bootstrap.sh to include the HGNC gene symbols of the transcripts that you want to use.
  2. You run the bootstrap script. This will download the given UTA dump and reduce it to the information related to these transcripts.
$ bootstrap.sh http://dl.biocommons.org/uta uta_20210129

The *.pgd.gz file is added to the Git repository via git-lfs and in CI, this minimal database will be used.

Some Timing Results

(I don't want to call it "benchmarks" yet.)

Deserialization of large cdot JSON files.

Host:

  • CPU: Intel(R) Xeon(R) E-2174G CPU @ 3.80GHz
  • Disk: NVME (WDC CL SN720 SDAQNTW-1T00-2000)

Single Running Time Results (no repetitions/warm start etc.)

  • ENSEMBL: 37s
  • RefSeq: 67s

This includes loading and deserialization of the records only.

hgvs-rs's People

Contributors

holtgrewe avatar github-actions[bot] avatar dependabot[bot] avatar varfish-bot avatar tedil avatar

Stargazers

Ian Maurer avatar Sam Nalty avatar Marius Fersigan avatar Mihai Todor avatar Joey Tsui avatar Jonas Marcello avatar Hongwei Ye avatar  avatar Seth avatar

Watchers

 avatar Oliver Stolpe avatar  avatar Andrej Baláž avatar  avatar

hgvs-rs's Issues

Apache License Compliance

We need to fulfill the license by putting back the original apache license footers and stating clearly how/that the Rust code is derives from the hgvs python code.

byte index out of bounds for GRCh37:10:5139789:A:AGTG

Describe the bug
When annotating GRCh37:10:5139789:A:AGTG with mehari, hgvs::mapper crashes.

To Reproduce
Steps to reproduce the behavior:

  1. Annotate said variant.
  2. See the error trace at bottom

Expected behavior
Annotation should work.

Screenshots
N/A

Additional context

thread 'main' panicked at 'byte index 1 is out of bounds of ``', /opt/conda/conda-bld/mehari_1695048327300/_build_env/.cargo/registry/src/index.crates.io-6f17d22bba15001f/hgvs-0.11.0/src/mapper/altseq.rs:701:56
stack backtrace:
   0: rust_begin_unwind
             at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/panicking.rs:593:5
   1: core::panicking::panic_fmt
             at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/panicking.rs:67:14
   2: core::str::slice_error_fail_rt
   3: core::str::slice_error_fail
             at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/str/mod.rs:87:9
   4: hgvs::mapper::altseq::AltSeqToHgvsp::build_hgvsp
   5: hgvs::mapper::variant::Mapper::c_to_p::{{closure}}
   6: alloc::vec::in_place_collect::<impl alloc::vec::spec_from_iter::SpecFromIter<T,I> for alloc::vec::Vec<T>>::from_iter
   7: core::iter::adapters::try_process
   8: hgvs::mapper::variant::Mapper::c_to_p
   9: hgvs::mapper::assembly::Mapper::c_to_p
  10: mehari::annotate::seqvars::csq::ConsequencePredictor::build_ann_field
  11: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::try_fold
  12: alloc::vec::in_place_collect::<impl alloc::vec::spec_from_iter::SpecFromIter<T,I> for alloc::vec::Vec<T>>::from_iter
  13: core::iter::adapters::try_process
  14: mehari::annotate::seqvars::csq::ConsequencePredictor::predict
  15: mehari::annotate::seqvars::run_with_writer
  16: mehari::annotate::seqvars::run
  17: tracing_core::dispatcher::with_default
  18: mehari::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

This happens when calling c_to_p on the following variant:

NM_001253909.2:c.416_417insGTG
CdsVariant {
    accession: Accession {
        value: "NM_001253909.2",
    },
    gene_symbol: None,
    loc_edit: CdsLocEdit {
        loc: Certain(
            CdsInterval {
                start: CdsPos {
                    base: 416,
                    offset: None,
                    cds_from: Start,
                },
                end: CdsPos {
                    base: 417,
                    offset: None,
                    cds_from: Start,
                },
            },
        ),
        edit: Certain(
            Ins {
                alternative: "GTG",
            },
        ),
    },
}

Fix intronic dup prediction issue from DNA11-dbSNP.tsv

The following need fixing:

#rs35803309	NC_000007.13:g.21723128_21723129insT	NM_001277115.2:c.5461-273dupT
#rs35803309	NC_000007.13:g.21723128_21723129insT	NM_003777.3:c.5482-273dupT
#rs146960178	NC_000007.13:g.21723129_21723130insC	NM_001277115.2:c.5461-269dupC
#rs146960178	NC_000007.13:g.21723129_21723130insC	NM_003777.3:c.5482-269dupC
thread 'mapper::variant::test::dnah11_db_snp_full' panicked at 'assertion failed: `(left == right)`: NM_001277115.2:c.5461-274_5461-273insT != NM_001277115.2:c.5461-273dupT (g>t; rs35803309; HGVSg=NC_000007.13:g.21723128_21723129insT)

Diff < left / right > :
<NM_001277115.2:c.5461-273dupT
>NM_001277115.2:c.5461-274_5461-273insT

', src/mapper/variant.rs:1723:13

I commented them out in the TSV file for now.

Add support for selenoproteins

Is your feature request related to a problem? Please describe.
We currently do not handle selenoproteins well. We should.

Describe the solution you'd like
Add support for this when loading cdot JSON.

Describe alternatives you've considered
N/A

Additional context
N/A

Use features to opt-in into sqlite3 and libpostgres dependencies

Is your feature request related to a problem? Please describe.
At the moment, we always have a runtime dependency on libsqlite3 and libpostgres because of the dependency on slite3 for seqrepo and the UTA provider.

Describe the solution you'd like
We should move both to separate features which makes these dependencies optional.

Describe alternatives you've considered
N/A

Additional context
N/A

Crash on GRCh37:1:89449508:A:ATTTTTTTTTTTT

Current main crashes:

2023-04-18T14:00:49.545332Z TRACE var = Var { chrom: "1", pos: 89449508, reference: "A", alternative: "ATTTTTTTTTTTT" } thread 'main' panicked at 'byte index 18446744073709551615 is out of bounds ofMVEADRPGKLFIGGLNTETNEKALETVFGKYGRIVEVLLIKDRETNKSRGFAFVTFESPADAKDAARDMNGKSLDGKAIKVEQATKPSFERGRHGPPPPPRSRGPPRGFGAGRGGSGGTRGPPSRGGHMDDGGYSMNFNMSSSRGPLPVKRGPPPRSGGPSPKRSAPSGLVRSSSGMGGRAPLSRGRDSYGGPPRREPLPSRRDVYLSPRDDGYSTKDSYSSRDYPSSRDTRDYAPPPRDYTYRDYGHSSSRDDYP[...]', /data/cephfs-1/home/users/holtgrem_c/.cargo/registry/src/github.com-1ecc6299db9ec823/hgvs-0.6.1/src/mapper/altseq.rs:987:25 note: run with RUST_BACKTRACE=1environment variable to display a backtrace

Port over the various extensive test files

asssembly mapper

  • test_hgvs_variantmapper_near_discrepancies.py

variant mapper

  • test_variantmapper_cp_real.py
  • test_variantmapper_cp_sanity.py
  • test_variantmapper_gcp.py
  • test_clinvar.py
  • test_grammar_full.py

Tune translate_cds implementation

A lot of time in mehari is consumed in the translate_cds implementation.

We should properly tune and benchmark this function in hgvs-rs.

Allow for configuring how Display works

Currently, we can only suppress the reference bases with the NoRef newtype.

We should rather allow for configuring the display with a wrapper type that carries configuration and allows for similar configuration as the Python module, including:

  • suppress reference nucleotides
  • switch between one and three letter amino acids

Add proper error handling with enum

We should have a good overview of the anyhow-based errors that we create.

We should create a proper enum based error handling for the library. This will allow downstream software to react to errors more gracefully.

Split provider interface

The HGVS conversion code and variant normalization code does not need the full UTA interface. We should split the interface into a basic and an extended/full interface part. This will allow for easier implementation.

Switch to using biocommons-bioutils-rs crate for assembly info

Is your feature request related to a problem? Please describe.
The assembly info code from biocommons/bioutils is currently in hgvs-rs. This is problematic as hgvs-rs is a bit heavyweight and annonars depends on hgvs-rs although it only needs the assembly enums.

Describe the solution you'd like
Switch over to using the biocommons-bioutils-rs crate.

Describe alternatives you've considered
N/A

Additional context
N/A

Support representing repeated sequences from varnomen

Is your feature request related to a problem? Please describe.
We can parse and represent the most common small variants pretty well. However, repeated sequence is currently out of scope.

Describe the solution you'd like
Add support parsing and representing of the varnome repeated sequences.

Projecting such variants could be made possible by translating them into sequence representation. This could become another issue, though.

Describe alternatives you've considered
N/A

Additional context
The code for the parser/representation is in the parser module documented here on docs.rs.

Adding support for parsing and representation would be simple enough:

  • extend the representation data structure
  • extend the nom-based parser for parsing
  • add the necessary display code
  • add appropriate tests for all

Cleanup for first release

  • remove superflous debug/dev log messages
  • make depend on released seqrepo-rs
  • make necessary changes to Cargo.toml

Potentially incorrect prection results

Prediction of GRCh37:17:41197803:AC:A for transcript NM_007294.4 is NP_009230.2:p.Ter700GluextTer35 but NP_009225.1:p.Cys1828LeufsTer6 is predicted by VariantValidator.org

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.