Giter VIP home page Giter VIP logo

nimble-aligner's People

Watchers

 avatar  avatar  avatar

Forkers

crisrobles

nimble-aligner's Issues

Export alignments to BAM

Export alignments to BAM for debugging and integration test purposes. This is also necessary if we decide we want to split the workflow into align->report in a separate step rather than the current align and report.

Make NUM_CORES effective

Currently, the CPU is being drastically underutilized. Ensure that CPU utilization of the other cores is reasonable.

Release build failing

The release hook we were using is deprecated, so we need to take another pass at building a release artifact

Improve alignment speed

  • Hash reference library, iterate reads instead of references
  • Multithreading
  • Look at forward/reverse read algorithms other than aligning them separately

Fix incorrect results

Currently, the results from the aligner on real datasets do not match the results we're expecting.

Set up continuous integration

  • Docker image with subset of data on DockerHub
  • Unit tests on all modules
  • Integration test suite
  • Github action to run tests

Progress meter?

This is low priority, but when we run nimble on big data, it logs the "Loading read sequences", and then it's a black hole for a long time. It might be nice if nimble could periodically log a 'progressed XXX sequences' type message. Ideally this would be connected to an optional command line argument for 'logInterval' or something, where maybe that defaults to 100K? Perhaps "-1" turns off all logging?

Add option in library definition to filter on MAPQ

The background is that 10x data are generally first aligned to a reference genome (like the human genome or macaque genome). This makes a BAM file that has alignments and unaligned reads. Each alignment generally has a mapping quality, indicating the confidence of mapping to the genome. Zero MAPQ generally means unmapped (though we should verify in STAR/10x BAMs). It might be useful in some cases to only have nimble consider reads that did not otherwise have confident matches to the organism's genome. In theory this might reduce noise, and it would avoid a potential concern about double-counting reads.

Here are relevant docs on the 10x-specific alignment-level flags. In addition, MAPQ should be in the BAM:
https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/output/bam

and here is STAR's doc (STAR is the aligner 10x cellranger uses internally). see 4.2.1 for how they use MAPQ, including multimapping reads:
https://physiology.med.cornell.edu/faculty/skrabanek/lab/angsd/lecture_notes/STARmanual.pdf

My interpretation of this is that reads with a confident single match are encoded by STAR as 255. anything less than 255 this is multi-mapped, defined by the formula in STAR's docs in 4.2.1. Practically speaking, I think we should:

  • implement a library level setting for "omitAlignmentsWithMapQAbove=XX". This filter would only apply for nimble if the input is a BAM file, not FASTQs. In practice, we could set 255 as this value, which would mean any single-aligned alignment would be discarded, and nimble would only inspect multi-mapped for or unmapped reads.

See the note in issue #68 about debugging output and including MAPQ.

As always, it would be very helpful for nimble to maintain some internal information about what it's doing and report that. In this case, simply counting the number of alignments discarded for MAPQ and reporting that figure to STDERR would be valuable.

Parse console arguments

Properly parse console arguments. Most configuration happens in the reference library, but we have a few flags we should parse:

  • number of CPU cores
  • subsetting the values in the dataset for development purposes (should try to exclude this from release builds via a compilation flag)
  • Until there's an augmented reference library, the fastq files with reads and the fastq file for the library

Export ambiguous reads instead of collapsing them

Currently, the tool doesn't properly handle ambiguous reads. We should report ambiguous allele-level reads and return those, instead of returning incorrect collapses. The Python tool can then filter on them with Numpy.

Improved insight into alignment-level data

There are many reasons nimble could produce alignments beyond what the user wants. This includes: 1) sub-optimal alignment settings (like too low mapping score), 2) a reference library with sets of references that will simply always result in ambiguous calls (like a reference with two nearly identical sequences). It would help the user if nimble-align could create an optional debugging output with a lot of raw information to help the user debug. The proposal is:

  • Add option to nimble and nimble python (which is passed to nimble-aligner), which is a filepath to an optional debug output. This would be a gzipped TSV.
  • Nimble would output one line per alignment. This output should have: 1) basic alignment metadata, including ref name. If start/stop are not possible, this is OK, 2) alignment stats, like score, # mismatches, etc, 3) the actual complete input read sequence(s), and 4) the sample name or cellbarcode. If the read is ambiguous, it would be useful if the reference name was the actual term used in the regular output (i.e. KIR3DL1/KIR3DL2 or SIVmac239/SIVmac251).

This would enable things like:

  • If a given reference is producing lots of low counts across cells, the user could extract all raw reads that mapped to that reference and 1) inspect the histogram of mapping scores, 2) take all raw reads and manually align to the reference to see where they map. for example, maybe all the noise is mapping to one specific region, 3) look at which regions of the input read map to the reference vs. which do not. On the whole, the point is to provide a lot of raw information, which could be inspected to better understand what the aligner is actually doing.

Note: while I assumed this would be a rust output, if nimble python code has access to read-level alignment data, this could be equally appropriate in python. In fact, if any filtering happens in python, it might be more appropriate there.

Test/implement bulk-seq alignment filtering and reporting

For alignment:
-Filter by alignment score
-Toggle perfect alignments
-Discard alignments with >1 match during alignment and/or during reporting

-Filter by # mismatches
-Discard alignments with differing references for forward/reverse
-Discard alignments with mismatches > 0

For filtration:
-Group on hits and report % per-hit
-Aggregate on lower-resolution typing
-Filter report data by thresholding percentage value

Error handling

There are many unwrap()s in the codebase. There should be proper error handling throughout the application.

could not parse max_hits_to_report as int64

I am guessing this error is because the input nimble config omits this param, so it cant parse null into int?

thread 'main' panicked at 'Error -- could not parse max_hits_to_report as int64', src/reference_library.rs:53:10
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace

Can either python or rust be made more tolerant to missing params (maybe default values)?

Consider allowing score_threshold to be adaptive based on read length

Currently score_threshold is an integer. different length input reads might produce difference distributions of scores (i.e. a 150bp read has the potential to score much higher than a 100bp input). One way to overcome this would be for the user to optionally supply score_threshold as 0-1. If the score is above 1, treat it like we do now. If less than one, make the score threshold (by read), equal to score*read length. so if it is 0.85 and the read is 100bp, the effective score is 85. a 150bp read is 127.5

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.