bimberlab / nimble Goto Github PK

View Code? Open in Web Editor NEW

1.0 3.0 1.0 227 KB

nimble — execute lightweight, flexible alignments on arbitrary reference libraries

License: MIT License

Python 93.60% Dockerfile 4.58% HTML 1.82%

bio python nimble aligner fast multiplatform pseudo-alignment rna-seq 10x rust

nimble's Issues

Check if discard_nonzero_mismatch is still in use

Create basic Dockerfile

We should make a basic Dockerfile for nimble, which installs nimble and downloads the latest nimble-align. It should probably use ENTRY_POINT to point to nimble.

This workflow should virtually work as-is to make and publish a nightly docker image.

https://github.com/BimberLab/DISCVRSeq/blob/master/.github/workflows/docker.yml

Having both of these will be useful for running on exacloud.

One thing to consider: since nimble can make large temp files, the location used for those temp files should be configurable, perhaps as an argument or maybe environment variable. For docker usage, we probably do not want those written inside the container.

Fastq.gz not parsing appropriately

The bam workflow gets run when you pass --input R1.fastq.gz --input R2.fastq.gz instead of fastq::align

Update nimble's I/O to always override outputs, rather than append

Handle odd UMIs

group_on, when the target field is null

This might technically be a nimble-align issue. Nimble supports a 'group_on' param, which groups hits based on some arbitrary field in the library. I noticed that we have hits like ",Mamu-DRB1*03g" I assume this means one of the raw hits lacked a value for lineage, and therefore it's reporting null?

If a given record lacks a value for the group_on field, I suggest we coalesce and report the primary name, so we at least use some value.

Implement a reference-library level option for 'max-hits-to-report'

A given read could have multiple passing alignments against different references. We either report those of as a CSV string, or collapse to lineage and report those hits (if still ambiguous) as a CSV string. We should implement some type of max-hits-to-report flag in the library def (or other spot TBD). If a given final reference combination has more reference than this threshold, those reads are simply stored as not matching. Or we could make a new category for 'Multi-Ref' or something.

We should apply this filter after lineage-coalescing, if that option is being used.

Examples:

max-hits-to-report' = 5
raw hit = alleles A, B, C, D, G, H, R
collapsing on lineage not selected
initial call: A,B,C,D,G,H,R
number of hits is 7, which is greater than 5, so this reference is filtered

max-hits-to-report' = 5
raw hit = Lineage1-AlleleA, Lineage1-AlleleB, Lineage1-AlleleC, Lineage2-AlleleD, Lineage2-AlleleG, Lineage2-AlleleF
collapsing on lineage is used.
initial call: Lineage1,Lineage2
number of hits is 2, so is allowed even though 7 raw alleles were matched

Do extra read checking

Ensure that each read, as we iterate through a UMI, is actually the read we expect it to be (direction, in the correct pair, etc)

Add strandedness filters to alignment pipeline

Broadly, the goal of this is to ensure we capture the type of each input read (i.e. first or second of the pair), and to do this whether from FASTQ or BAM. We also want to record the orientation of the alignment relative to the reference and then use this information for filtration. I would divide this into two projects:

Task 1: Capture pair information:

FASTQ: When reading FASTQ, the reads should be in the orientation as created by the sequencer. No immediate changes needed. The first FASTQ is the forward read and the second is the reverse read. The sequence of the read in the BAM is always reported in the orientation as it was generated by the sequencer.

BAM: When reading a BAM input, use the BAM's "first in pair" or "second in pair" flag to determine if this BAM alignment came from a forward or reverse read. Carry that information forward for later filtering. Additionally, the BAM reports read sequence post-alignment and it always reports the read sequence in the orientation of the reference genome. Therefore when reading from a BAM file, use the BAM "read reverse stand" flag. If it is reversed, then the sequence reported in the BAM is the reverse-complement of the original sequence (as it would have been in the FASTQ). therefore reverse the read sequence first.

Said another way, the objective of this task is to always normalize the input read sequences to match their orientation as produced by the sequencer.

Task 2: Capture orientation of the alignment

As i understand it, the pseudoaligner doesnt understand reverse-complementing anyway. Therefore you manually make a reverse-complement of each reference and align each read against the original sequence and the reverse complement one. Therefore I suggest just tracking which version of the reference each alignment is against. Carry this forward for downstream filtering.

As far as representing orientation for a pair of hits, typically this is something like: FF, FR, RF, RR. The first letter is the alignment orientation of the forward read and the second is the orientation of the reverse read. We should also have the ability to capture the situation where the forward or reverse does not have a hit. So maybe, F, R, and U would be the characters for each?

Said another way, the broad goal is to make sure we capture the orientation of the alignment for each read and carry this downstream.

Task 3: Execute filtering.

One piece of this filter acts against each pair of reads. As I am aware, 100% of the time (though I would suggest making this a config param), the forward and reverse read of a given pair should be in opposite orientations on a single reference. Let's say a single read pair has this set of alignments:

RefA: F/R = allowable
RefB: R/F = allowable
RefA: R/R = not allowable
RefA: U/R = allowable
RefC: F/U = allowable
RefA: R/F = allowable
RefR: R/R = not allowable

Note that the RefA F/R and RefB R/F are simultaneously allowable.

Both 10x and the older unstranded libraries should have this level of filtering applied, but again I would make this a configurable option (probably with a default to use it).

Filtering Step 2:

In certain cases, a type of input data is stranded, so that forward reads should always align in one orientation relative to the ref and the reverse to another. The user would need to specify this at runtime in the config.

In 10x genomics 5-prime libraries (the kind we use), the forward read should always align in the forward direction and the reverse should always align in the reverse. The 10x 3p case (which we dont currently use, but is a valid input type) is enforced reverse, and the opposite of above.

I think implementation of this is the same as the suggestion for allowable FF, FR, UF list of allowable orientations. Therefore, unstranded libraries might allow: FR, FU, RF, RU, UR, UF

and then 10x genomics 5' would just be: FR, FU, UR
and then 10x genomics 3' would be the opposite: RF, RU, UF

Notes:

It might be worth having an option to report counts for reverse hits. I'd maybe output new new feature called $REFNAME-AS (so SIVmac239-AS). "AS" = anti-sense. Therefore even if the library type (like 10x) required a forward orientation read pair alignment, there would be value in having some way to optionally report these and to understand they're being captured. The reason for knowing this might be technical (suggesting a nimble problem), but there could also be real biology behind this. Implementing this option might make sense as a library-level "ReportAntisenseAlignment" flag. Normally this is false. If true, rather than treat the reverse-complemented reference as the same as the forward, I would suggest re-naming it right away and treating them as two completely unique hits downstream. If we did that, all other filter logic would probably be unchanged from what's described above. This has implications for lineage-level collapsing though, since the ref names in the output no longer are identical to the reference metadata.

Enforce lack of comma in reference names, and also group_on fields

nimble reports multi-hit results, using commas to separate results. We should enforce the following:

The primary name of any reference cannot contain comma
When using a field for group_on, we should either enforce lack of comma, or auto-replace comma with underscore

Add more output to the debug log

If a filter creates multiple rows of the same read, it should collapse down to one row

Empty string as the feature name?

I noticed that in the nimble output, there are rows with either empty feature names, or what seems like an empty value concatenated with others (i.e. trailing comma: "NKG2A,NKG2D,"). This shouldnt happen. my best guess is that perhaps this is due to collapsing by lineage? Perhaps some references lack a value for lineage? In this situation, we should coalesce from the lineage field back to the original name. I dont know if this is actually the cause, and I guess it's possible that instead some reads had zero alignments, and it's treating no alignments like empty string?

Anyway, see this example, which used the NCR library, coalescing on lineage:

https://prime-seq.ohsu.edu/experiment/Labs/Bimber/653/showFile.view?rowId=6903002&inline=true

Allow nimble create to work using CSV input?

The current workflow is:

nimble generate
do manual editing of a JSON file
nimble compile
nimble align

there is no good reason step 2 should exist. I'd propose:

nimble generate <FASTA | CSV | FASTA+CSV>
nimble compile
nimble align

In the first step, the input patterns could be:

Just give a FASTA. This implies you have essentially no metadata beyond the name, which is a valid use case.
Alternately, you give it a CSV alone. This CSV has a required 'name' column and a required 'sequence' column. In a simple case, the sequence is a literal string. Any other fields in the CSV become metadata
A version of the above is to give it a CSV with a sequence field. This can either be a literal string, or something formatted like a URI. if that matches genbank://XXXX, we download (lazily during the compile step) the actual sequence. Note: this URI format should allow a substring, like genbank://XXXX:23-2039, which tells nimble to subset the input.
In a final case, the user provides a CSV and FASTA. Maybe we allow the user to omit 'sequence', and in this case we require the FASTA to have a record where the header matches the 'name' of the CSV?

Ensure samtools sort call on the UR flag keeps paired-end reads together

Another option would be to rewrite the paired-end parser for the .bam file process module to use the first_in_pair and second_in_pair flags instead of parsing positionally, at a cost of efficiency

Add validation to BAM/FASTQ input

My comment is primary a concern for BAMs, but if it can be added to both code paths, great:

The input data fed to nimble should be unique pairs of sequence reads, where each readname is represented once per pair.
A BAM file is a format to store alignments, which are not strictly the same as reads. A given read can in theory be present more than once in a BAM, if there are two alignments. I dont think cellranger does this, but the BAM would be technically valid if it did.
Nimble already sorts the BAMs prior to input based on UMI and read name. It then iterates the BAM.

Would it be practical for the reader code of nimble to remember the name of the last read it encountered, and throw an exception if the next read has the same name? If simple to implement, this would provide cheap insurance against a category of issue that would be easy to have, and hard to identify.

Add clean step to Dockerfile to reduce image size

Look into why gzip output seems broken

check file extensions to ensure correct usage

Implement collapse of ambiguous reads from the aligner

Implement collapse of the ambiguous results in accordance with the filtering doc

Tool should log version and aligner version when running

When running nimble align, it should log its version and the version of nimble-align (or at least have a -verbose flag to opt-into this).

Ensure align-debug output contains columns for all information relevant to filtering and decisions

This should include at minimum (and possibly others):

- reference name(s)
- read/umi name
- cell barcode / samplename
- MAPQ (pulled from input BAM alignment, if input is a BAM).
- pseudo-aligner score
- forward/reverse orientation of the read (this is looking forward into enforcing strandedness)
- whether the read is first or second of pair (applies to both FASTQ and BAM inputs)

Awareness of temp files

nimble can create large temp files in some instances. we should:

make sure that's easily configurable (like a command-line argument)
make sure it defaults to $TEMP_DIR
possibly log the temp file location to stderr during 'nimble align'

Align should work with files instead of only paths

Currently, test.bam fails as an input path for generating the sorted bam file, but ./test.bam does not.

Option to save results as mtx file instead of TSV

we can deal with this after we get the base functionality working; however, it is probable that working with single-cell scale data might create large outputs. the output is basically a sparse matrix. it is common to save these to disk in sparse matrix formats (i think something like this: https://docs.scipy.org/doc/scipy/reference/generated/scipy.io.mmwrite.html), where you have one mtx.gz file which is the melted key/value pairs of the matrix, and then one txt.gz with the rownames, and one txt.gz with the colnames. If my scipy link wasnt correct, i'm sure there's some out of the box python thing to do this.

generate step should sanity check the input and fail more gracefully

If the user can provide a CSV, and if we expect certain columns to exist in that CSV, the code should do some validation on those expectations and fail in a reasonable way if the input in invalid. One example is missing the column name.

It might be nice if the code was more tolerant on the input structure (case-insensitive on headers), perhaps supporting aliases (like sequence_name -> name), but this is less critical.

Why does CSV need to list sequence length?

The user provides the NT sequence as a string or FASTA. The example CSV has a column for NT sequence length. Cant we figure this out in the python code and put that into the compiled library automatically?

Suggestions for nimble debug/QC report

There is a plot showing the distribution of positions in the BAM. This is useful, but I would see if it is possible to convert the x-axis from genomicPosition (which we need to use in the raw data) and report it by chromosome. I R/ggplot I solved this in the past by manually specifying breaks and labels. For example, if you know the chromsome 1 position 1 = genomicPosition 1, you can label as such. You can repeat for each contig. You would need a sequence dictionary providing ordered contigs and lengths, but you probably already have this.

1b) This will be a little tricky when the actual region is a very specific zone within one contig. However, perhaps you could explicitly test this by inspecting the min/max genomicPositions. If their contigs are the same, then maybe label the x-axis as "Position on XX"? In this case, reporting actual contig position would be useful to people. Another thought might be to report the x-axis using actual contig position, but facet the plot horizontally by contig (ideally allowing the python plot code to distribute space unevenly)? i actually wonder if this idea would be easier to implement than when I suggested in #1.

Related to the plot above, I would facet by R1/R2 if possible. Putting these above one another probably makes the most sense. I think you have them as two lines, but it's not easy to see both now.
Just a comment: the "Concordance between the input BAM calls and the corresponding nimble call" is probably useful.
In the "Distribution of nimble base pair scores across r1 and r2 mates" plot, do you think it would make sense to just drop unaligned (maybe report that as a number or percentage in the title text), to make the range within aligned more clear?
In the "Distribution of nimble base pair scores across r1 and r2 mates" plot, since there is a hard cutoff being applied, would it make sense to add a dotted line at this threshold to denote that?
Why is the 'Density of positions reported in the input BAM file for read-pairs that received a nimble alignment' plot faceted for NKG2D, but not the others? are those 2 different contigs? If so, you might already be doing what I suggested in 1B. I would just: a) label them as such, b) allow the facets to have different widths, and c) report positions unique to that contig, d) still facet R1/R2 vertically.
Comment: CCR7 seems to have a real clear issue with R1 alignment, for example. Same on CD27. Same for SELL.
Comment: the "Concordance between the input BAM calls and the corresponding nimble call" plots are informative for the LILRs in particular.
This is picky, but what's the sort order of the features? Maybe it should be alphabetic?

There are some additional plots I think would be useful:

Could you summarize the data in terms of F/F, F/U, F/R, R/R, R/U, etc.? Maybe as a tile plot, where the color is scaled based on the proportion of read pairs in each category?
Can you summarize reads that are filtered and the reason for filtering? I realize nimble doesnt have complete access to all the data yet, but whatever information is available would be useful. For example, we see a lot of cases where features have strong R2 hits, but lots of zeros for R1. Understanding what happened to those R1s could be important.
Since the set of features can be large, it might be a nice enhancement to have some kind of bullet list of features names at the top, where each is a link to jump to the start of that feature's section.
It would probably be useful to print the a basic header first, with something like "Nimble QC Report", that also prints the date run, tool version, perhaps the exact command executed, name of reference library, etc.
Should sense/antisense be represented in any of these figures? For example, if a given feature is getting a lot of R1 or R2 hits in the opposite direction (which the F/R figure might also report), then we should have a way to know about that.

Re-enable --debug

At the moment, passing --debug produces an empty output file. It should contain the serialized DebugResult structure so we can introspect our alignment results.

Sort mult-ref allele hits so the string is deterministic

When reporting multi-ref hits, nimble collapses them into a CSV string, like:

NKG2A,NGK2C

then problem is that order is random, so we can have two separate hits reported, like this:

NKG2A,NGK2C
NKG2C,NGK2A

I propose nimble always sorts the names in a multi-ref hit before joining on comma. This makes a deterministic name. If whatever code does this operation can easily do a natural sort we should, since often the names have numbers, like: KIR3DL2,KIR3DL10, and the output would be a more human intuitive result.

Don't copy sequence data to reference library

Add reference library to the reference.json instead of requiring the aligner to receive both reference.fasta and reference.json

https://github.com/BimberLab/nimble/blob/af8d08cd8c1e06854ac977817ef15fddf671ca08/nimble/__main__.py

and also a few references to the devsebb github:

https://github.com/BimberLab/nimble/blob/af8d08cd8c1e06854ac977817ef15fddf671ca08/setup.py
https://github.com/BimberLab/nimble/blob/af8d08cd8c1e06854ac977817ef15fddf671ca08/nimble/usage.py

bimberlab / nimble Goto Github PK

nimble's Issues

Recommend Projects

Recommend Topics

Recommend Org