bimberlab / nimble-aligner Goto Github PK

nimble-aligner is the backend for nimble, a tool that executes lightweight, flexible alignments to generate supplemental alignment data

License: MIT License

Rust 100.00%

bio alignment pseudo-alignment rust nimble rna-seq multiplatform fast ohsu-vgti rust-bio

nimble-aligner's People

Watchers

Forkers

crisrobles

nimble-aligner's Issues

Export alignments to BAM

Export alignments to BAM for debugging and integration test purposes. This is also necessary if we decide we want to split the workflow into align->report in a separate step rather than the current align and report.

Integrate support for 10x datasets

rust-poc: Find and implement alignment algorithm

Find and test alignment algorithms to find one that's acceptable for our use case. Check for parallelization and gapped alignment.

In particular, compare:

Support for CRAM input

We should support CRAM as input in addition to BAM.

Parse .bam files and treat each UMI as a unit to be aligned against the reference library

Get number of records in reference genome/samples to be aligned from the files themselves rather than hardcoding them

Make NUM_CORES effective

Currently, the CPU is being drastically underutilized. Ensure that CPU utilization of the other cores is reasonable.

Release build failing

The release hook we were using is deprecated, so we need to take another pass at building a release artifact

Improve alignment speed

Hash reference library, iterate reads instead of references
Multithreading
Look at forward/reverse read algorithms other than aligning them separately

Fix incorrect results

Currently, the results from the aligner on real datasets do not match the results we're expecting.

Set up continuous integration

Docker image with subset of data on DockerHub
Unit tests on all modules
Integration test suite
Github action to run tests

rust-poc: test I/O libraries for genetic file formats and read bulk data in preparation for alignment

For FASTQ and FASTA, we have https://github.com/rust-bio/rust-bio
For BAM, we have https://github.com/rust-bio/rust-htslib

We should test these libraries and read in the basic bulk data in preparation for finding an alignment algorithm.

Progress meter?

This is low priority, but when we run nimble on big data, it logs the "Loading read sequences", and then it's a black hole for a long time. It might be nice if nimble could periodically log a 'progressed XXX sequences' type message. Ideally this would be connected to an optional command line argument for 'logInterval' or something, where maybe that defaults to 100K? Perhaps "-1" turns off all logging?

Add option in library definition to filter on MAPQ

The background is that 10x data are generally first aligned to a reference genome (like the human genome or macaque genome). This makes a BAM file that has alignments and unaligned reads. Each alignment generally has a mapping quality, indicating the confidence of mapping to the genome. Zero MAPQ generally means unmapped (though we should verify in STAR/10x BAMs). It might be useful in some cases to only have nimble consider reads that did not otherwise have confident matches to the organism's genome. In theory this might reduce noise, and it would avoid a potential concern about double-counting reads.

Here are relevant docs on the 10x-specific alignment-level flags. In addition, MAPQ should be in the BAM:
https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/output/bam

and here is STAR's doc (STAR is the aligner 10x cellranger uses internally). see 4.2.1 for how they use MAPQ, including multimapping reads:
https://physiology.med.cornell.edu/faculty/skrabanek/lab/angsd/lecture_notes/STARmanual.pdf

My interpretation of this is that reads with a confident single match are encoded by STAR as 255. anything less than 255 this is multi-mapped, defined by the formula in STAR's docs in 4.2.1. Practically speaking, I think we should:

implement a library level setting for "omitAlignmentsWithMapQAbove=XX". This filter would only apply for nimble if the input is a BAM file, not FASTQs. In practice, we could set 255 as this value, which would mean any single-aligned alignment would be discarded, and nimble would only inspect multi-mapped for or unmapped reads.

See the note in issue #68 about debugging output and including MAPQ.

As always, it would be very helpful for nimble to maintain some internal information about what it's doing and report that. In this case, simply counting the number of alignments discarded for MAPQ and reporting that figure to STDERR would be valuable.

Handle single-end sequences

Currently, the program only does paired-end alignment. It should also handle single-case alignment.

Parse console arguments

Properly parse console arguments. Most configuration happens in the reference library, but we have a few flags we should parse:

number of CPU cores
subsetting the values in the dataset for development purposes (should try to exclude this from release builds via a compilation flag)
Until there's an augmented reference library, the fastq files with reads and the fastq file for the library

All read alignments being counted instead of only the best alignments

Currently, all of the alignments for a given read against the reference genome will be counted if they pass the score threshold. Instead, we only want to count the best alignments, not all of them.

Example here:
MHC Alignment.xlsx

Export ambiguous reads instead of collapsing them

Currently, the tool doesn't properly handle ambiguous reads. We should report ambiguous allele-level reads and return those, instead of returning incorrect collapses. The Python tool can then filter on them with Numpy.

Improved insight into alignment-level data

There are many reasons nimble could produce alignments beyond what the user wants. This includes: 1) sub-optimal alignment settings (like too low mapping score), 2) a reference library with sets of references that will simply always result in ambiguous calls (like a reference with two nearly identical sequences). It would help the user if nimble-align could create an optional debugging output with a lot of raw information to help the user debug. The proposal is:

Add option to nimble and nimble python (which is passed to nimble-aligner), which is a filepath to an optional debug output. This would be a gzipped TSV.
Nimble would output one line per alignment. This output should have: 1) basic alignment metadata, including ref name. If start/stop are not possible, this is OK, 2) alignment stats, like score, # mismatches, etc, 3) the actual complete input read sequence(s), and 4) the sample name or cellbarcode. If the read is ambiguous, it would be useful if the reference name was the actual term used in the regular output (i.e. KIR3DL1/KIR3DL2 or SIVmac239/SIVmac251).

This would enable things like:

If a given reference is producing lots of low counts across cells, the user could extract all raw reads that mapped to that reference and 1) inspect the histogram of mapping scores, 2) take all raw reads and manually align to the reference to see where they map. for example, maybe all the noise is mapping to one specific region, 3) look at which regions of the input read map to the reference vs. which do not. On the whole, the point is to provide a lot of raw information, which could be inspected to better understand what the aligner is actually doing.

Note: while I assumed this would be a rust output, if nimble python code has access to read-level alignment data, this could be equally appropriate in python. In fact, if any filtering happens in python, it might be more appropriate there.

rust-poc: Collapse alignment results

Once the alignment algorithm returns results, test out heuristics to collapse the results into summarized form.

Test/implement bulk-seq alignment filtering and reporting

For alignment:
-Filter by alignment score
-Toggle perfect alignments
-Discard alignments with >1 match during alignment and/or during reporting

-Filter by # mismatches
-Discard alignments with differing references for forward/reverse
-Discard alignments with mismatches > 0

For filtration:
-Group on hits and report % per-hit
-Aggregate on lower-resolution typing
-Filter report data by thresholding percentage value

Refactor main.rs business logic into another files to make it unit-testable

Error handling

There are many unwrap()s in the codebase. There should be proper error handling throughout the application.

could not parse max_hits_to_report as int64

I am guessing this error is because the input nimble config omits this param, so it cant parse null into int?

thread 'main' panicked at 'Error -- could not parse max_hits_to_report as int64', src/reference_library.rs:53:10
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace

Can either python or rust be made more tolerant to missing params (maybe default values)?

Anchor debrujin and debrujin-mapping dependencies

Currently, the cargo.toml points at the latest versions of debrujin and debrujin-mapping. We should select a version number/commit hash.

Consider allowing score_threshold to be adaptive based on read length

Currently score_threshold is an integer. different length input reads might produce difference distributions of scores (i.e. a 150bp read has the potential to score much higher than a 100bp input). One way to overcome this would be for the user to optionally supply score_threshold as 0-1. If the score is above 1, treat it like we do now. If less than one, make the score threshold (by read), equal to score*read length. so if it is 0.85 and the read is 100bp, the effective score is 85. a 150bp read is 127.5

bimberlab / nimble-aligner Goto Github PK

nimble-aligner's People

Watchers

Forkers

nimble-aligner's Issues

Recommend Projects

Recommend Topics

Recommend Org