bimberlab / nimble Goto Github PK
View Code? Open in Web Editor NEWnimble — execute lightweight, flexible alignments on arbitrary reference libraries
License: MIT License
nimble — execute lightweight, flexible alignments on arbitrary reference libraries
License: MIT License
This might technically be a nimble-align issue. Nimble supports a 'group_on' param, which groups hits based on some arbitrary field in the library. I noticed that we have hits like ",Mamu-DRB1*03g" I assume this means one of the raw hits lacked a value for lineage, and therefore it's reporting null?
If a given record lacks a value for the group_on field, I suggest we coalesce and report the primary name, so we at least use some value.
nimble reports multi-hit results, using commas to separate results. We should enforce the following:
When reporting multi-ref hits, nimble collapses them into a CSV string, like:
NKG2A,NGK2C
then problem is that order is random, so we can have two separate hits reported, like this:
NKG2A,NGK2C
NKG2C,NGK2A
I propose nimble always sorts the names in a multi-ref hit before joining on comma. This makes a deterministic name. If whatever code does this operation can easily do a natural sort we should, since often the names have numbers, like: KIR3DL2,KIR3DL10, and the output would be a more human intuitive result.
Broadly, the goal of this is to ensure we capture the type of each input read (i.e. first or second of the pair), and to do this whether from FASTQ or BAM. We also want to record the orientation of the alignment relative to the reference and then use this information for filtration. I would divide this into two projects:
Task 1: Capture pair information:
FASTQ: When reading FASTQ, the reads should be in the orientation as created by the sequencer. No immediate changes needed. The first FASTQ is the forward read and the second is the reverse read. The sequence of the read in the BAM is always reported in the orientation as it was generated by the sequencer.
BAM: When reading a BAM input, use the BAM's "first in pair" or "second in pair" flag to determine if this BAM alignment came from a forward or reverse read. Carry that information forward for later filtering. Additionally, the BAM reports read sequence post-alignment and it always reports the read sequence in the orientation of the reference genome. Therefore when reading from a BAM file, use the BAM "read reverse stand" flag. If it is reversed, then the sequence reported in the BAM is the reverse-complement of the original sequence (as it would have been in the FASTQ). therefore reverse the read sequence first.
Said another way, the objective of this task is to always normalize the input read sequences to match their orientation as produced by the sequencer.
Task 2: Capture orientation of the alignment
As i understand it, the pseudoaligner doesnt understand reverse-complementing anyway. Therefore you manually make a reverse-complement of each reference and align each read against the original sequence and the reverse complement one. Therefore I suggest just tracking which version of the reference each alignment is against. Carry this forward for downstream filtering.
As far as representing orientation for a pair of hits, typically this is something like: FF, FR, RF, RR. The first letter is the alignment orientation of the forward read and the second is the orientation of the reverse read. We should also have the ability to capture the situation where the forward or reverse does not have a hit. So maybe, F, R, and U would be the characters for each?
Said another way, the broad goal is to make sure we capture the orientation of the alignment for each read and carry this downstream.
Task 3: Execute filtering.
One piece of this filter acts against each pair of reads. As I am aware, 100% of the time (though I would suggest making this a config param), the forward and reverse read of a given pair should be in opposite orientations on a single reference. Let's say a single read pair has this set of alignments:
RefA: F/R = allowable
RefB: R/F = allowable
RefA: R/R = not allowable
RefA: U/R = allowable
RefC: F/U = allowable
RefA: R/F = allowable
RefR: R/R = not allowable
Note that the RefA F/R and RefB R/F are simultaneously allowable.
Both 10x and the older unstranded libraries should have this level of filtering applied, but again I would make this a configurable option (probably with a default to use it).
Filtering Step 2:
In certain cases, a type of input data is stranded, so that forward reads should always align in one orientation relative to the ref and the reverse to another. The user would need to specify this at runtime in the config.
In 10x genomics 5-prime libraries (the kind we use), the forward read should always align in the forward direction and the reverse should always align in the reverse. The 10x 3p case (which we dont currently use, but is a valid input type) is enforced reverse, and the opposite of above.
I think implementation of this is the same as the suggestion for allowable FF, FR, UF list of allowable orientations. Therefore, unstranded libraries might allow: FR, FU, RF, RU, UR, UF
and then 10x genomics 5' would just be: FR, FU, UR
and then 10x genomics 3' would be the opposite: RF, RU, UF
Notes:
It might be worth having an option to report counts for reverse hits. I'd maybe output new new feature called $REFNAME-AS (so SIVmac239-AS). "AS" = anti-sense. Therefore even if the library type (like 10x) required a forward orientation read pair alignment, there would be value in having some way to optionally report these and to understand they're being captured. The reason for knowing this might be technical (suggesting a nimble problem), but there could also be real biology behind this. Implementing this option might make sense as a library-level "ReportAntisenseAlignment" flag. Normally this is false. If true, rather than treat the reverse-complemented reference as the same as the forward, I would suggest re-naming it right away and treating them as two completely unique hits downstream. If we did that, all other filter logic would probably be unchanged from what's described above. This has implications for lineage-level collapsing though, since the ref names in the output no longer are identical to the reference metadata.
nimble can create large temp files in some instances. we should:
Implement collapse of the ambiguous results in accordance with the filtering doc
We should make a basic Dockerfile for nimble, which installs nimble and downloads the latest nimble-align. It should probably use ENTRY_POINT to point to nimble.
This workflow should virtually work as-is to make and publish a nightly docker image.
https://github.com/BimberLab/DISCVRSeq/blob/master/.github/workflows/docker.yml
Having both of these will be useful for running on exacloud.
One thing to consider: since nimble can make large temp files, the location used for those temp files should be configurable, perhaps as an argument or maybe environment variable. For docker usage, we probably do not want those written inside the container.
I noticed that in the nimble output, there are rows with either empty feature names, or what seems like an empty value concatenated with others (i.e. trailing comma: "NKG2A,NKG2D,"). This shouldnt happen. my best guess is that perhaps this is due to collapsing by lineage? Perhaps some references lack a value for lineage? In this situation, we should coalesce from the lineage field back to the original name. I dont know if this is actually the cause, and I guess it's possible that instead some reads had zero alignments, and it's treating no alignments like empty string?
Anyway, see this example, which used the NCR library, coalescing on lineage:
https://prime-seq.ohsu.edu/experiment/Labs/Bimber/653/showFile.view?rowId=6903002&inline=true
My comment is primary a concern for BAMs, but if it can be added to both code paths, great:
Would it be practical for the reader code of nimble to remember the name of the last read it encountered, and throw an exception if the next read has the same name? If simple to implement, this would provide cheap insurance against a category of issue that would be easy to have, and hard to identify.
The user provides the NT sequence as a string or FASTA. The example CSV has a column for NT sequence length. Cant we figure this out in the python code and put that into the compiled library automatically?
This should include at minimum (and possibly others):
At the moment, passing --debug produces an empty output file. It should contain the serialized DebugResult structure so we can introspect our alignment results.
When running nimble align, it should log its version and the version of nimble-align (or at least have a -verbose flag to opt-into this).
A given read could have multiple passing alignments against different references. We either report those of as a CSV string, or collapse to lineage and report those hits (if still ambiguous) as a CSV string. We should implement some type of max-hits-to-report flag in the library def (or other spot TBD). If a given final reference combination has more reference than this threshold, those reads are simply stored as not matching. Or we could make a new category for 'Multi-Ref' or something.
We should apply this filter after lineage-coalescing, if that option is being used.
Examples:
max-hits-to-report' = 5
raw hit = alleles A, B, C, D, G, H, R
collapsing on lineage not selected
initial call: A,B,C,D,G,H,R
number of hits is 7, which is greater than 5, so this reference is filtered
max-hits-to-report' = 5
raw hit = Lineage1-AlleleA, Lineage1-AlleleB, Lineage1-AlleleC, Lineage2-AlleleD, Lineage2-AlleleG, Lineage2-AlleleF
collapsing on lineage is used.
initial call: Lineage1,Lineage2
number of hits is 2, so is allowed even though 7 raw alleles were matched
If the user can provide a CSV, and if we expect certain columns to exist in that CSV, the code should do some validation on those expectations and fail in a reasonable way if the input in invalid. One example is missing the column name.
It might be nice if the code was more tolerant on the input structure (case-insensitive on headers), perhaps supporting aliases (like sequence_name -> name), but this is less critical.
Currently, test.bam fails as an input path for generating the sorted bam file, but ./test.bam does not.
Another option would be to rewrite the paired-end parser for the .bam file process module to use the first_in_pair and second_in_pair flags instead of parsing positionally, at a cost of efficiency
Ensure that each read, as we iterate through a UMI, is actually the read we expect it to be (direction, in the correct pair, etc)
The bam workflow gets run when you pass --input R1.fastq.gz --input R2.fastq.gz instead of fastq::align
raw alignment-level detail
we can deal with this after we get the base functionality working; however, it is probable that working with single-cell scale data might create large outputs. the output is basically a sparse matrix. it is common to save these to disk in sparse matrix formats (i think something like this: https://docs.scipy.org/doc/scipy/reference/generated/scipy.io.mmwrite.html), where you have one mtx.gz file which is the melted key/value pairs of the matrix, and then one txt.gz with the rownames, and one txt.gz with the colnames. If my scipy link wasnt correct, i'm sure there's some out of the box python thing to do this.
I noticed a couple spots still referencing 'immunogenotyper', which probably needs to be 'nimble'
https://github.com/BimberLab/nimble/wiki/Quickstart
https://github.com/BimberLab/nimble/blob/af8d08cd8c1e06854ac977817ef15fddf671ca08/nimble/__main__.py
and also a few references to the devsebb github:
https://github.com/BimberLab/nimble/blob/af8d08cd8c1e06854ac977817ef15fddf671ca08/setup.py
https://github.com/BimberLab/nimble/blob/af8d08cd8c1e06854ac977817ef15fddf671ca08/nimble/usage.py
The current workflow is:
there is no good reason step 2 should exist. I'd propose:
In the first step, the input patterns could be:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.