bimberlab / nimble Goto Github PK

View Code? Open in Web Editor NEW

1.0 2.0 1.0 192 KB

nimble — execute lightweight, flexible alignments on arbitrary reference libraries

License: MIT License

Python 95.74% Dockerfile 3.18% HTML 1.09%

bio python nimble aligner fast multiplatform pseudo-alignment rna-seq 10x rust

nimble's People

Contributors

Stargazers

Watchers

Forkers

vertexbufferobject

nimble's Issues

Add reference library to the reference.json instead of requiring the aligner to receive both reference.fasta and reference.json

align with no parameters panics instead of printing usage

group_on, when the target field is null

This might technically be a nimble-align issue. Nimble supports a 'group_on' param, which groups hits based on some arbitrary field in the library. I noticed that we have hits like ",Mamu-DRB1*03g" I assume this means one of the raw hits lacked a value for lineage, and therefore it's reporting null?

If a given record lacks a value for the group_on field, I suggest we coalesce and report the primary name, so we at least use some value.

Enforce lack of comma in reference names, and also group_on fields

nimble reports multi-hit results, using commas to separate results. We should enforce the following:

The primary name of any reference cannot contain comma
When using a field for group_on, we should either enforce lack of comma, or auto-replace comma with underscore

Move baseline from 0->20 in score filter

Add more output to the debug log

Sort mult-ref allele hits so the string is deterministic

When reporting multi-ref hits, nimble collapses them into a CSV string, like:

NKG2A,NGK2C

then problem is that order is random, so we can have two separate hits reported, like this:

NKG2A,NGK2C
NKG2C,NGK2A

I propose nimble always sorts the names in a multi-ref hit before joining on comma. This makes a deterministic name. If whatever code does this operation can easily do a natural sort we should, since often the names have numbers, like: KIR3DL2,KIR3DL10, and the output would be a more human intuitive result.

Add strandedness filters to alignment pipeline

Broadly, the goal of this is to ensure we capture the type of each input read (i.e. first or second of the pair), and to do this whether from FASTQ or BAM. We also want to record the orientation of the alignment relative to the reference and then use this information for filtration. I would divide this into two projects:

Task 1: Capture pair information:

FASTQ: When reading FASTQ, the reads should be in the orientation as created by the sequencer. No immediate changes needed. The first FASTQ is the forward read and the second is the reverse read. The sequence of the read in the BAM is always reported in the orientation as it was generated by the sequencer.

BAM: When reading a BAM input, use the BAM's "first in pair" or "second in pair" flag to determine if this BAM alignment came from a forward or reverse read. Carry that information forward for later filtering. Additionally, the BAM reports read sequence post-alignment and it always reports the read sequence in the orientation of the reference genome. Therefore when reading from a BAM file, use the BAM "read reverse stand" flag. If it is reversed, then the sequence reported in the BAM is the reverse-complement of the original sequence (as it would have been in the FASTQ). therefore reverse the read sequence first.

Said another way, the objective of this task is to always normalize the input read sequences to match their orientation as produced by the sequencer.

Task 2: Capture orientation of the alignment

As i understand it, the pseudoaligner doesnt understand reverse-complementing anyway. Therefore you manually make a reverse-complement of each reference and align each read against the original sequence and the reverse complement one. Therefore I suggest just tracking which version of the reference each alignment is against. Carry this forward for downstream filtering.

As far as representing orientation for a pair of hits, typically this is something like: FF, FR, RF, RR. The first letter is the alignment orientation of the forward read and the second is the orientation of the reverse read. We should also have the ability to capture the situation where the forward or reverse does not have a hit. So maybe, F, R, and U would be the characters for each?

Said another way, the broad goal is to make sure we capture the orientation of the alignment for each read and carry this downstream.

Task 3: Execute filtering.

One piece of this filter acts against each pair of reads. As I am aware, 100% of the time (though I would suggest making this a config param), the forward and reverse read of a given pair should be in opposite orientations on a single reference. Let's say a single read pair has this set of alignments:

RefA: F/R = allowable
RefB: R/F = allowable
RefA: R/R = not allowable
RefA: U/R = allowable
RefC: F/U = allowable
RefA: R/F = allowable
RefR: R/R = not allowable

Note that the RefA F/R and RefB R/F are simultaneously allowable.

Both 10x and the older unstranded libraries should have this level of filtering applied, but again I would make this a configurable option (probably with a default to use it).

Filtering Step 2:

In certain cases, a type of input data is stranded, so that forward reads should always align in one orientation relative to the ref and the reverse to another. The user would need to specify this at runtime in the config.

In 10x genomics 5-prime libraries (the kind we use), the forward read should always align in the forward direction and the reverse should always align in the reverse. The 10x 3p case (which we dont currently use, but is a valid input type) is enforced reverse, and the opposite of above.

I think implementation of this is the same as the suggestion for allowable FF, FR, UF list of allowable orientations. Therefore, unstranded libraries might allow: FR, FU, RF, RU, UR, UF

and then 10x genomics 5' would just be: FR, FU, UR
and then 10x genomics 3' would be the opposite: RF, RU, UF

Notes:

It might be worth having an option to report counts for reverse hits. I'd maybe output new new feature called $REFNAME-AS (so SIVmac239-AS). "AS" = anti-sense. Therefore even if the library type (like 10x) required a forward orientation read pair alignment, there would be value in having some way to optionally report these and to understand they're being captured. The reason for knowing this might be technical (suggesting a nimble problem), but there could also be real biology behind this. Implementing this option might make sense as a library-level "ReportAntisenseAlignment" flag. Normally this is false. If true, rather than treat the reverse-complemented reference as the same as the forward, I would suggest re-naming it right away and treating them as two completely unique hits downstream. If we did that, all other filter logic would probably be unchanged from what's described above. This has implications for lineage-level collapsing though, since the ref names in the output no longer are identical to the reference metadata.

Allow for chained filters via library.json config

If a filter creates multiple rows of the same read, it should collapse down to one row

Awareness of temp files

nimble can create large temp files in some instances. we should:

make sure that's easily configurable (like a command-line argument)
make sure it defaults to $TEMP_DIR
possibly log the temp file location to stderr during 'nimble align'

Implement collapse of ambiguous reads from the aligner

Implement collapse of the ambiguous results in accordance with the filtering doc

Create basic Dockerfile

We should make a basic Dockerfile for nimble, which installs nimble and downloads the latest nimble-align. It should probably use ENTRY_POINT to point to nimble.

This workflow should virtually work as-is to make and publish a nightly docker image.

https://github.com/BimberLab/DISCVRSeq/blob/master/.github/workflows/docker.yml

Having both of these will be useful for running on exacloud.

One thing to consider: since nimble can make large temp files, the location used for those temp files should be configurable, perhaps as an argument or maybe environment variable. For docker usage, we probably do not want those written inside the container.

Empty string as the feature name?

I noticed that in the nimble output, there are rows with either empty feature names, or what seems like an empty value concatenated with others (i.e. trailing comma: "NKG2A,NKG2D,"). This shouldnt happen. my best guess is that perhaps this is due to collapsing by lineage? Perhaps some references lack a value for lineage? In this situation, we should coalesce from the lineage field back to the original name. I dont know if this is actually the cause, and I guess it's possible that instead some reads had zero alignments, and it's treating no alignments like empty string?

Anyway, see this example, which used the NCR library, coalescing on lineage:

https://prime-seq.ohsu.edu/experiment/Labs/Bimber/653/showFile.view?rowId=6903002&inline=true

Add validation to BAM/FASTQ input

My comment is primary a concern for BAMs, but if it can be added to both code paths, great:

The input data fed to nimble should be unique pairs of sequence reads, where each readname is represented once per pair.
A BAM file is a format to store alignments, which are not strictly the same as reads. A given read can in theory be present more than once in a BAM, if there are two alignments. I dont think cellranger does this, but the BAM would be technically valid if it did.
Nimble already sorts the BAMs prior to input based on UMI and read name. It then iterates the BAM.

Would it be practical for the reader code of nimble to remember the name of the last read it encountered, and throw an exception if the next read has the same name? If simple to implement, this would provide cheap insurance against a category of issue that would be easy to have, and hard to identify.

Don't sort .bam files in-place

check file extensions to ensure correct usage

Why does CSV need to list sequence length?

The user provides the NT sequence as a string or FASTA. The example CSV has a column for NT sequence length. Cant we figure this out in the python code and put that into the compiled library automatically?

Ensure align-debug output contains columns for all information relevant to filtering and decisions

This should include at minimum (and possibly others):

- reference name(s)
- read/umi name
- cell barcode / samplename
- MAPQ (pulled from input BAM alignment, if input is a BAM).
- pseudo-aligner score
- forward/reverse orientation of the read (this is looking forward into enforcing strandedness)
- whether the read is first or second of pair (applies to both FASTQ and BAM inputs)

Re-enable --debug

At the moment, passing --debug produces an empty output file. It should contain the serialized DebugResult structure so we can introspect our alignment results.

Tool should log version and aligner version when running

When running nimble align, it should log its version and the version of nimble-align (or at least have a -verbose flag to opt-into this).

Implement a reference-library level option for 'max-hits-to-report'

A given read could have multiple passing alignments against different references. We either report those of as a CSV string, or collapse to lineage and report those hits (if still ambiguous) as a CSV string. We should implement some type of max-hits-to-report flag in the library def (or other spot TBD). If a given final reference combination has more reference than this threshold, those reads are simply stored as not matching. Or we could make a new category for 'Multi-Ref' or something.

We should apply this filter after lineage-coalescing, if that option is being used.

Examples:

max-hits-to-report' = 5
raw hit = alleles A, B, C, D, G, H, R
collapsing on lineage not selected
initial call: A,B,C,D,G,H,R
number of hits is 7, which is greater than 5, so this reference is filtered

max-hits-to-report' = 5
raw hit = Lineage1-AlleleA, Lineage1-AlleleB, Lineage1-AlleleC, Lineage2-AlleleD, Lineage2-AlleleG, Lineage2-AlleleF
collapsing on lineage is used.
initial call: Lineage1,Lineage2
number of hits is 2, so is allowed even though 7 raw alleles were matched

generate step should sanity check the input and fail more gracefully

If the user can provide a CSV, and if we expect certain columns to exist in that CSV, the code should do some validation on those expectations and fail in a reasonable way if the input in invalid. One example is missing the column name.

It might be nice if the code was more tolerant on the input structure (case-insensitive on headers), perhaps supporting aliases (like sequence_name -> name), but this is less critical.

Align should work with files instead of only paths

Currently, test.bam fails as an input path for generating the sorted bam file, but ./test.bam does not.

Check if discard_nonzero_mismatch is still in use

Support .zip files

Fix duplicated columns in library generation

Update nimble's I/O to always override outputs, rather than append

Ensure samtools sort call on the UR flag keeps paired-end reads together

Another option would be to rewrite the paired-end parser for the .bam file process module to use the first_in_pair and second_in_pair flags instead of parsing positionally, at a cost of efficiency

Don't copy sequence data to reference library

Look into why gzip output seems broken

Do extra read checking

Ensure that each read, as we iterate through a UMI, is actually the read we expect it to be (direction, in the correct pair, etc)

Handle odd UMIs

Fastq.gz not parsing appropriately

The bam workflow gets run when you pass --input R1.fastq.gz --input R2.fastq.gz instead of fastq::align

--align-debug flag

raw alignment-level detail

Option to save results as mtx file instead of TSV

we can deal with this after we get the base functionality working; however, it is probable that working with single-cell scale data might create large outputs. the output is basically a sparse matrix. it is common to save these to disk in sparse matrix formats (i think something like this: https://docs.scipy.org/doc/scipy/reference/generated/scipy.io.mmwrite.html), where you have one mtx.gz file which is the melted key/value pairs of the matrix, and then one txt.gz with the rownames, and one txt.gz with the colnames. If my scipy link wasnt correct, i'm sure there's some out of the box python thing to do this.

Add clean step to Dockerfile to reduce image size

immunogenotyper -> nimble

I noticed a couple spots still referencing 'immunogenotyper', which probably needs to be 'nimble'

https://github.com/BimberLab/nimble/wiki/Quickstart

https://github.com/BimberLab/nimble/blob/af8d08cd8c1e06854ac977817ef15fddf671ca08/nimble/__main__.py

and also a few references to the devsebb github:

https://github.com/BimberLab/nimble/blob/af8d08cd8c1e06854ac977817ef15fddf671ca08/setup.py
https://github.com/BimberLab/nimble/blob/af8d08cd8c1e06854ac977817ef15fddf671ca08/nimble/usage.py

Allow nimble create to work using CSV input?

The current workflow is:

nimble generate
do manual editing of a JSON file
nimble compile
nimble align

there is no good reason step 2 should exist. I'd propose:

nimble generate <FASTA | CSV | FASTA+CSV>
nimble compile
nimble align

In the first step, the input patterns could be:

Just give a FASTA. This implies you have essentially no metadata beyond the name, which is a valid use case.
Alternately, you give it a CSV alone. This CSV has a required 'name' column and a required 'sequence' column. In a simple case, the sequence is a literal string. Any other fields in the CSV become metadata
A version of the above is to give it a CSV with a sequence field. This can either be a literal string, or something formatted like a URI. if that matches genbank://XXXX, we download (lazily during the compile step) the actual sequence. Note: this URI format should allow a substring, like genbank://XXXX:23-2039, which tells nimble to subset the input.
In a final case, the user provides a CSV and FASTA. Maybe we allow the user to omit 'sequence', and in this case we require the FASTA to have a record where the header matches the 'name' of the CSV?