The hgvs-rs from varfish-org

Apache License Compliance

We need to fulfill the license by putting back the original apache license footers and stating clearly how/that the Rust code is derives from the hgvs python code.

Port over assembly infos from bioutils

Allow predictions even for transcripts that have more than one stop codon

Is your feature request related to a problem? Please describe.
Similar to varfish-org/mehari#224, we shoul allow for sensible prediction if a transcript has more than one stop codon.

Describe the solution you'd like
N/A

Describe alternatives you've considered
N/A

Additional context

varfish-org/mehari#224

Further tune translate_cds code

A lot of time is apparently spent in accessing the lazy_static data structures through the lock.

byte index out of bounds for GRCh37:10:5139789:A:AGTG

Describe the bug
When annotating GRCh37:10:5139789:A:AGTG with mehari, hgvs::mapper crashes.

To Reproduce
Steps to reproduce the behavior:

Annotate said variant.
See the error trace at bottom

Expected behavior
Annotation should work.

Screenshots
N/A

Additional context

thread 'main' panicked at 'byte index 1 is out of bounds of ``', /opt/conda/conda-bld/mehari_1695048327300/_build_env/.cargo/registry/src/index.crates.io-6f17d22bba15001f/hgvs-0.11.0/src/mapper/altseq.rs:701:56
stack backtrace:
   0: rust_begin_unwind
             at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/panicking.rs:593:5
   1: core::panicking::panic_fmt
             at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/panicking.rs:67:14
   2: core::str::slice_error_fail_rt
   3: core::str::slice_error_fail
             at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/str/mod.rs:87:9
   4: hgvs::mapper::altseq::AltSeqToHgvsp::build_hgvsp
   5: hgvs::mapper::variant::Mapper::c_to_p::{{closure}}
   6: alloc::vec::in_place_collect::<impl alloc::vec::spec_from_iter::SpecFromIter<T,I> for alloc::vec::Vec<T>>::from_iter
   7: core::iter::adapters::try_process
   8: hgvs::mapper::variant::Mapper::c_to_p
   9: hgvs::mapper::assembly::Mapper::c_to_p
  10: mehari::annotate::seqvars::csq::ConsequencePredictor::build_ann_field
  11: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::try_fold
  12: alloc::vec::in_place_collect::<impl alloc::vec::spec_from_iter::SpecFromIter<T,I> for alloc::vec::Vec<T>>::from_iter
  13: core::iter::adapters::try_process
  14: mehari::annotate::seqvars::csq::ConsequencePredictor::predict
  15: mehari::annotate::seqvars::run_with_writer
  16: mehari::annotate::seqvars::run
  17: tracing_core::dispatcher::with_default
  18: mehari::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

This happens when calling c_to_p on the following variant:

NM_001253909.2:c.416_417insGTG

CdsVariant {
    accession: Accession {
        value: "NM_001253909.2",
    },
    gene_symbol: None,
    loc_edit: CdsLocEdit {
        loc: Certain(
            CdsInterval {
                start: CdsPos {
                    base: 416,
                    offset: None,
                    cds_from: Start,
                },
                end: CdsPos {
                    base: 417,
                    offset: None,
                    cds_from: Start,
                },
            },
        ),
        edit: Certain(
            Ins {
                alternative: "GTG",
            },
        ),
    },
}

Fix intronic dup prediction issue from DNA11-dbSNP.tsv

The following need fixing:

#rs35803309	NC_000007.13:g.21723128_21723129insT	NM_001277115.2:c.5461-273dupT
#rs35803309	NC_000007.13:g.21723128_21723129insT	NM_003777.3:c.5482-273dupT
#rs146960178	NC_000007.13:g.21723129_21723130insC	NM_001277115.2:c.5461-269dupC
#rs146960178	NC_000007.13:g.21723129_21723130insC	NM_003777.3:c.5482-269dupC

thread 'mapper::variant::test::dnah11_db_snp_full' panicked at 'assertion failed: `(left == right)`: NM_001277115.2:c.5461-274_5461-273insT != NM_001277115.2:c.5461-273dupT (g>t; rs35803309; HGVSg=NC_000007.13:g.21723128_21723129insT)

Diff < left / right > :
<NM_001277115.2:c.5461-273dupT
>NM_001277115.2:c.5461-274_5461-273insT

', src/mapper/variant.rs:1723:13

I commented them out in the TSV file for now.

Implement cdot data provider

Implement transcript data provider based on cdot.

The JSON files do not store sequences so we must have a backing seqrepo.

Port over variant normalization

Add support for selenoproteins

Is your feature request related to a problem? Please describe.
We currently do not handle selenoproteins well. We should.

Describe the solution you'd like
Add support for this when loading cdot JSON.

Describe alternatives you've considered
N/A

Additional context
N/A

Problem with annotating p.Met1? variants

We are affected by the same issue as hgvs Python is:

biocommons/hgvs#651

Use features to opt-in into sqlite3 and libpostgres dependencies

Is your feature request related to a problem? Please describe.
At the moment, we always have a runtime dependency on libsqlite3 and libpostgres because of the dependency on slite3 for seqrepo and the UTA provider.

Describe the solution you'd like
We should move both to separate features which makes these dependencies optional.

Describe alternatives you've considered
N/A

Additional context
N/A

Port over clinvar tests

This is a follow-up of #21.

We will probably have to wait until the following upstream issue has been fixed.

biocommons/hgvs#380

Complete validation

Validation has been postponed for now. We need to implement it later on.

Replace all usages of unwrap

Either replace by expect() explaining the invariant or raising an error (anyhow Error for now).

Implement AssemblyMapper

The aim is not to be feature complete but to make this example run through:

https://hgvs.readthedocs.io/en/stable/examples/manuscript-example.html#project-transcript-variant-nm-182763-2-c-688-403c-t-to-grch37-primary-assembly-using-splign-alignments

tests

test_hgvs_variantmapper_near_discrepancies.py
test_hgvs_assemblymapper.py

Implement VariantMapper

Port over the HGVS variant description parser from Python hgvs package

We need the HGVS parsing functionality and data structures to represent HGVS variant descriptions.

For this, we should port over the HGVS parser module from hgvs. Parsing should be based on the nom package.

Address stop codon recoding / stop codon readthrough (SCR)

Is your feature request related to a problem? Please describe.
We currently do not properly handle SCR, e.g., in SELENON.

Describe the solution you'd like
Address this issue.

Describe alternatives you've considered
N/A

Additional context

Crash on GRCh37:1:89449508:A:ATTTTTTTTTTTT

Current main crashes:

2023-04-18T14:00:49.545332Z TRACE var = Var { chrom: "1", pos: 89449508, reference: "A", alternative: "ATTTTTTTTTTTT" } thread 'main' panicked at 'byte index 18446744073709551615 is out of bounds ofMVEADRPGKLFIGGLNTETNEKALETVFGKYGRIVEVLLIKDRETNKSRGFAFVTFESPADAKDAARDMNGKSLDGKAIKVEQATKPSFERGRHGPPPPPRSRGPPRGFGAGRGGSGGTRGPPSRGGHMDDGGYSMNFNMSSSRGPLPVKRGPPPRSGGPSPKRSAPSGLVRSSSGMGGRAPLSRGRDSYGGPPRREPLPSRRDVYLSPRDDGYSTKDSYSSRDYPSSRDTRDYAPPPRDYTYRDYGHSSSRDDYP[...]', /data/cephfs-1/home/users/holtgrem_c/.cargo/registry/src/github.com-1ecc6299db9ec823/hgvs-0.6.1/src/mapper/altseq.rs:987:25 note: run with RUST_BACKTRACE=1environment variable to display a backtrace

Port over the various extensive test files

asssembly mapper

test_hgvs_variantmapper_near_discrepancies.py

variant mapper

Add chrMT genetic code

Is your feature request related to a problem? Please describe.
We currently don't have the vertebrate genetic code.

Describe the solution you'd like
Add the vertebrate genetic code translation table.

Describe alternatives you've considered
N/A

Additional context

https://en.wikipedia.org/wiki/Vertebrate_mitochondrial_code

Tune translate_cds implementation

A lot of time in mehari is consumed in the translate_cds implementation.

We should properly tune and benchmark this function in hgvs-rs.

Finish porting over test_hgvs_grammar.py

Allow for configuring how Display works

Currently, we can only suppress the reference bases with the NoRef newtype.

We should rather allow for configuring the display with a wrapper type that carries configuration and allows for similar configuration as the Python module, including:

suppress reference nucleotides
switch between one and three letter amino acids

Switch from lazy_static to once_cell

Implement Display for HgvsVariant and related types

We need this to convert the variants back to strings.

Add proper error handling with enum

We should have a good overview of the anyhow-based errors that we create.

We should create a proper enum based error handling for the library. This will allow downstream software to react to errors more gracefully.

Port over access to the UTA data structure

Split provider interface

The HGVS conversion code and variant normalization code does not need the full UTA interface. We should split the interface into a basic and an extended/full interface part. This will allow for easier implementation.

Switch to using biocommons-bioutils-rs crate for assembly info

Is your feature request related to a problem? Please describe.
The assembly info code from biocommons/bioutils is currently in hgvs-rs. This is problematic as hgvs-rs is a bit heavyweight and annonars depends on hgvs-rs although it only needs the assembly enums.

Describe the solution you'd like
Switch over to using the biocommons-bioutils-rs crate.

Describe alternatives you've considered
N/A

Additional context
N/A

Implement AlignmentMapper

Implement translation between HGVS and VCF

Support representing repeated sequences from varnomen

Is your feature request related to a problem? Please describe.
We can parse and represent the most common small variants pretty well. However, repeated sequence is currently out of scope.

Describe the solution you'd like
Add support parsing and representing of the varnome repeated sequences.

Projecting such variants could be made possible by translating them into sequence representation. This could become another issue, though.

Describe alternatives you've considered
N/A

Additional context
The code for the parser/representation is in the parser module documented here on docs.rs.

Adding support for parsing and representation would be simple enough:

extend the representation data structure
extend the nom-based parser for parsing
add the necessary display code
add appropriate tests for all

Add dependabot automerge Github action

https://github.com/bihealth/seqrepo-rs/blob/main/.github/workflows/automerge.yml

Cleanup for first release

remove superflous debug/dev log messages
make depend on released seqrepo-rs
make necessary changes to Cargo.toml

Move away from linked-hash-map

indexmap is better maintained

Potentially incorrect prection results

Prediction of GRCh37:17:41197803:AC:A for transcript NM_007294.4 is NP_009230.2:p.Ter700GluextTer35 but NP_009225.1:p.Cys1828LeufsTer6 is predicted by VariantValidator.org

varfish-org / hgvs-rs Goto Github PK

hgvs-rs's Introduction

hgvs-rs

Running Tests

Creating Reduced UTA Databases

Some Timing Results

Deserialization of large cdot JSON files.

hgvs-rs's People

Contributors

Stargazers

Watchers

Forkers

hgvs-rs's Issues

Recommend Projects

Recommend Topics

Recommend Org