bimberlab / nimble-aligner Goto Github PK
View Code? Open in Web Editor NEWnimble-aligner is the backend for nimble, a tool that executes lightweight, flexible alignments to generate supplemental alignment data
License: MIT License
nimble-aligner is the backend for nimble, a tool that executes lightweight, flexible alignments to generate supplemental alignment data
License: MIT License
Export alignments to BAM for debugging and integration test purposes. This is also necessary if we decide we want to split the workflow into align->report in a separate step rather than the current align and report.
Find and test alignment algorithms to find one that's acceptable for our use case. Check for parallelization and gapped alignment.
In particular, compare:
We should support CRAM as input in addition to BAM.
Currently, the CPU is being drastically underutilized. Ensure that CPU utilization of the other cores is reasonable.
The release hook we were using is deprecated, so we need to take another pass at building a release artifact
Currently, the results from the aligner on real datasets do not match the results we're expecting.
We should test these libraries and read in the basic bulk data in preparation for finding an alignment algorithm.
This is low priority, but when we run nimble on big data, it logs the "Loading read sequences", and then it's a black hole for a long time. It might be nice if nimble could periodically log a 'progressed XXX sequences' type message. Ideally this would be connected to an optional command line argument for 'logInterval' or something, where maybe that defaults to 100K? Perhaps "-1" turns off all logging?
The background is that 10x data are generally first aligned to a reference genome (like the human genome or macaque genome). This makes a BAM file that has alignments and unaligned reads. Each alignment generally has a mapping quality, indicating the confidence of mapping to the genome. Zero MAPQ generally means unmapped (though we should verify in STAR/10x BAMs). It might be useful in some cases to only have nimble consider reads that did not otherwise have confident matches to the organism's genome. In theory this might reduce noise, and it would avoid a potential concern about double-counting reads.
Here are relevant docs on the 10x-specific alignment-level flags. In addition, MAPQ should be in the BAM:
https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/output/bam
and here is STAR's doc (STAR is the aligner 10x cellranger uses internally). see 4.2.1 for how they use MAPQ, including multimapping reads:
https://physiology.med.cornell.edu/faculty/skrabanek/lab/angsd/lecture_notes/STARmanual.pdf
My interpretation of this is that reads with a confident single match are encoded by STAR as 255. anything less than 255 this is multi-mapped, defined by the formula in STAR's docs in 4.2.1. Practically speaking, I think we should:
See the note in issue #68 about debugging output and including MAPQ.
As always, it would be very helpful for nimble to maintain some internal information about what it's doing and report that. In this case, simply counting the number of alignments discarded for MAPQ and reporting that figure to STDERR would be valuable.
Currently, the program only does paired-end alignment. It should also handle single-case alignment.
Properly parse console arguments. Most configuration happens in the reference library, but we have a few flags we should parse:
Currently, all of the alignments for a given read against the reference genome will be counted if they pass the score threshold. Instead, we only want to count the best alignments, not all of them.
Example here:
MHC Alignment.xlsx
Currently, the tool doesn't properly handle ambiguous reads. We should report ambiguous allele-level reads and return those, instead of returning incorrect collapses. The Python tool can then filter on them with Numpy.
There are many reasons nimble could produce alignments beyond what the user wants. This includes: 1) sub-optimal alignment settings (like too low mapping score), 2) a reference library with sets of references that will simply always result in ambiguous calls (like a reference with two nearly identical sequences). It would help the user if nimble-align could create an optional debugging output with a lot of raw information to help the user debug. The proposal is:
This would enable things like:
Note: while I assumed this would be a rust output, if nimble python code has access to read-level alignment data, this could be equally appropriate in python. In fact, if any filtering happens in python, it might be more appropriate there.
Once the alignment algorithm returns results, test out heuristics to collapse the results into summarized form.
For alignment:
-Filter by alignment score
-Toggle perfect alignments
-Discard alignments with >1 match during alignment and/or during reporting
-Filter by # mismatches
-Discard alignments with differing references for forward/reverse
-Discard alignments with mismatches > 0
For filtration:
-Group on hits and report % per-hit
-Aggregate on lower-resolution typing
-Filter report data by thresholding percentage value
There are many unwrap()s in the codebase. There should be proper error handling throughout the application.
I am guessing this error is because the input nimble config omits this param, so it cant parse null into int?
thread 'main' panicked at 'Error -- could not parse max_hits_to_report as int64', src/reference_library.rs:53:10
note: run with RUST_BACKTRACE=1
environment variable to display a backtrace
Can either python or rust be made more tolerant to missing params (maybe default values)?
Currently, the cargo.toml points at the latest versions of debrujin and debrujin-mapping. We should select a version number/commit hash.
Currently score_threshold is an integer. different length input reads might produce difference distributions of scores (i.e. a 150bp read has the potential to score much higher than a 100bp input). One way to overcome this would be for the user to optionally supply score_threshold as 0-1. If the score is above 1, treat it like we do now. If less than one, make the score threshold (by read), equal to score*read length. so if it is 0.85 and the read is 100bp, the effective score is 85. a 150bp read is 127.5
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.