Giter VIP home page Giter VIP logo

nimble's People

Contributors

bbimber avatar dependabot[bot] avatar hextraza avatar

Stargazers

 avatar

Watchers

 avatar  avatar

nimble's Issues

group_on, when the target field is null

This might technically be a nimble-align issue. Nimble supports a 'group_on' param, which groups hits based on some arbitrary field in the library. I noticed that we have hits like ",Mamu-DRB1*03g" I assume this means one of the raw hits lacked a value for lineage, and therefore it's reporting null?

If a given record lacks a value for the group_on field, I suggest we coalesce and report the primary name, so we at least use some value.

Enforce lack of comma in reference names, and also group_on fields

nimble reports multi-hit results, using commas to separate results. We should enforce the following:

  • The primary name of any reference cannot contain comma
  • When using a field for group_on, we should either enforce lack of comma, or auto-replace comma with underscore

Sort mult-ref allele hits so the string is deterministic

When reporting multi-ref hits, nimble collapses them into a CSV string, like:

NKG2A,NGK2C

then problem is that order is random, so we can have two separate hits reported, like this:

NKG2A,NGK2C
NKG2C,NGK2A

I propose nimble always sorts the names in a multi-ref hit before joining on comma. This makes a deterministic name. If whatever code does this operation can easily do a natural sort we should, since often the names have numbers, like: KIR3DL2,KIR3DL10, and the output would be a more human intuitive result.

Add strandedness filters to alignment pipeline

Broadly, the goal of this is to ensure we capture the type of each input read (i.e. first or second of the pair), and to do this whether from FASTQ or BAM. We also want to record the orientation of the alignment relative to the reference and then use this information for filtration. I would divide this into two projects:

Task 1: Capture pair information:

FASTQ: When reading FASTQ, the reads should be in the orientation as created by the sequencer. No immediate changes needed. The first FASTQ is the forward read and the second is the reverse read. The sequence of the read in the BAM is always reported in the orientation as it was generated by the sequencer.

BAM: When reading a BAM input, use the BAM's "first in pair" or "second in pair" flag to determine if this BAM alignment came from a forward or reverse read. Carry that information forward for later filtering. Additionally, the BAM reports read sequence post-alignment and it always reports the read sequence in the orientation of the reference genome. Therefore when reading from a BAM file, use the BAM "read reverse stand" flag. If it is reversed, then the sequence reported in the BAM is the reverse-complement of the original sequence (as it would have been in the FASTQ). therefore reverse the read sequence first.

Said another way, the objective of this task is to always normalize the input read sequences to match their orientation as produced by the sequencer.

Task 2: Capture orientation of the alignment

As i understand it, the pseudoaligner doesnt understand reverse-complementing anyway. Therefore you manually make a reverse-complement of each reference and align each read against the original sequence and the reverse complement one. Therefore I suggest just tracking which version of the reference each alignment is against. Carry this forward for downstream filtering.

As far as representing orientation for a pair of hits, typically this is something like: FF, FR, RF, RR. The first letter is the alignment orientation of the forward read and the second is the orientation of the reverse read. We should also have the ability to capture the situation where the forward or reverse does not have a hit. So maybe, F, R, and U would be the characters for each?

Said another way, the broad goal is to make sure we capture the orientation of the alignment for each read and carry this downstream.

Task 3: Execute filtering.

One piece of this filter acts against each pair of reads. As I am aware, 100% of the time (though I would suggest making this a config param), the forward and reverse read of a given pair should be in opposite orientations on a single reference. Let's say a single read pair has this set of alignments:

RefA: F/R = allowable
RefB: R/F = allowable
RefA: R/R = not allowable
RefA: U/R = allowable
RefC: F/U = allowable
RefA: R/F = allowable
RefR: R/R = not allowable

Note that the RefA F/R and RefB R/F are simultaneously allowable.

Both 10x and the older unstranded libraries should have this level of filtering applied, but again I would make this a configurable option (probably with a default to use it).

Filtering Step 2:

In certain cases, a type of input data is stranded, so that forward reads should always align in one orientation relative to the ref and the reverse to another. The user would need to specify this at runtime in the config.

In 10x genomics 5-prime libraries (the kind we use), the forward read should always align in the forward direction and the reverse should always align in the reverse. The 10x 3p case (which we dont currently use, but is a valid input type) is enforced reverse, and the opposite of above.

I think implementation of this is the same as the suggestion for allowable FF, FR, UF list of allowable orientations. Therefore, unstranded libraries might allow: FR, FU, RF, RU, UR, UF

and then 10x genomics 5' would just be: FR, FU, UR
and then 10x genomics 3' would be the opposite: RF, RU, UF

Notes:

It might be worth having an option to report counts for reverse hits. I'd maybe output new new feature called $REFNAME-AS (so SIVmac239-AS). "AS" = anti-sense. Therefore even if the library type (like 10x) required a forward orientation read pair alignment, there would be value in having some way to optionally report these and to understand they're being captured. The reason for knowing this might be technical (suggesting a nimble problem), but there could also be real biology behind this. Implementing this option might make sense as a library-level "ReportAntisenseAlignment" flag. Normally this is false. If true, rather than treat the reverse-complemented reference as the same as the forward, I would suggest re-naming it right away and treating them as two completely unique hits downstream. If we did that, all other filter logic would probably be unchanged from what's described above. This has implications for lineage-level collapsing though, since the ref names in the output no longer are identical to the reference metadata.

Awareness of temp files

nimble can create large temp files in some instances. we should:

  1. make sure that's easily configurable (like a command-line argument)
  2. make sure it defaults to $TEMP_DIR
  3. possibly log the temp file location to stderr during 'nimble align'

Create basic Dockerfile

We should make a basic Dockerfile for nimble, which installs nimble and downloads the latest nimble-align. It should probably use ENTRY_POINT to point to nimble.

This workflow should virtually work as-is to make and publish a nightly docker image.

https://github.com/BimberLab/DISCVRSeq/blob/master/.github/workflows/docker.yml

Having both of these will be useful for running on exacloud.

One thing to consider: since nimble can make large temp files, the location used for those temp files should be configurable, perhaps as an argument or maybe environment variable. For docker usage, we probably do not want those written inside the container.

Empty string as the feature name?

I noticed that in the nimble output, there are rows with either empty feature names, or what seems like an empty value concatenated with others (i.e. trailing comma: "NKG2A,NKG2D,"). This shouldnt happen. my best guess is that perhaps this is due to collapsing by lineage? Perhaps some references lack a value for lineage? In this situation, we should coalesce from the lineage field back to the original name. I dont know if this is actually the cause, and I guess it's possible that instead some reads had zero alignments, and it's treating no alignments like empty string?

Anyway, see this example, which used the NCR library, coalescing on lineage:

https://prime-seq.ohsu.edu/experiment/Labs/Bimber/653/showFile.view?rowId=6903002&inline=true

Add validation to BAM/FASTQ input

My comment is primary a concern for BAMs, but if it can be added to both code paths, great:

  • The input data fed to nimble should be unique pairs of sequence reads, where each readname is represented once per pair.
  • A BAM file is a format to store alignments, which are not strictly the same as reads. A given read can in theory be present more than once in a BAM, if there are two alignments. I dont think cellranger does this, but the BAM would be technically valid if it did.
  • Nimble already sorts the BAMs prior to input based on UMI and read name. It then iterates the BAM.

Would it be practical for the reader code of nimble to remember the name of the last read it encountered, and throw an exception if the next read has the same name? If simple to implement, this would provide cheap insurance against a category of issue that would be easy to have, and hard to identify.

Why does CSV need to list sequence length?

The user provides the NT sequence as a string or FASTA. The example CSV has a column for NT sequence length. Cant we figure this out in the python code and put that into the compiled library automatically?

Re-enable --debug

At the moment, passing --debug produces an empty output file. It should contain the serialized DebugResult structure so we can introspect our alignment results.

Implement a reference-library level option for 'max-hits-to-report'

A given read could have multiple passing alignments against different references. We either report those of as a CSV string, or collapse to lineage and report those hits (if still ambiguous) as a CSV string. We should implement some type of max-hits-to-report flag in the library def (or other spot TBD). If a given final reference combination has more reference than this threshold, those reads are simply stored as not matching. Or we could make a new category for 'Multi-Ref' or something.

We should apply this filter after lineage-coalescing, if that option is being used.

Examples:

max-hits-to-report' = 5
raw hit = alleles A, B, C, D, G, H, R
collapsing on lineage not selected
initial call: A,B,C,D,G,H,R
number of hits is 7, which is greater than 5, so this reference is filtered

max-hits-to-report' = 5
raw hit = Lineage1-AlleleA, Lineage1-AlleleB, Lineage1-AlleleC, Lineage2-AlleleD, Lineage2-AlleleG, Lineage2-AlleleF
collapsing on lineage is used.
initial call: Lineage1,Lineage2
number of hits is 2, so is allowed even though 7 raw alleles were matched

generate step should sanity check the input and fail more gracefully

If the user can provide a CSV, and if we expect certain columns to exist in that CSV, the code should do some validation on those expectations and fail in a reasonable way if the input in invalid. One example is missing the column name.

It might be nice if the code was more tolerant on the input structure (case-insensitive on headers), perhaps supporting aliases (like sequence_name -> name), but this is less critical.

Do extra read checking

Ensure that each read, as we iterate through a UMI, is actually the read we expect it to be (direction, in the correct pair, etc)

Option to save results as mtx file instead of TSV

we can deal with this after we get the base functionality working; however, it is probable that working with single-cell scale data might create large outputs. the output is basically a sparse matrix. it is common to save these to disk in sparse matrix formats (i think something like this: https://docs.scipy.org/doc/scipy/reference/generated/scipy.io.mmwrite.html), where you have one mtx.gz file which is the melted key/value pairs of the matrix, and then one txt.gz with the rownames, and one txt.gz with the colnames. If my scipy link wasnt correct, i'm sure there's some out of the box python thing to do this.

immunogenotyper -> nimble

Allow nimble create to work using CSV input?

The current workflow is:

  • nimble generate
  • do manual editing of a JSON file
  • nimble compile
  • nimble align

there is no good reason step 2 should exist. I'd propose:

  • nimble generate <FASTA | CSV | FASTA+CSV>
  • nimble compile
  • nimble align

In the first step, the input patterns could be:

  • Just give a FASTA. This implies you have essentially no metadata beyond the name, which is a valid use case.
  • Alternately, you give it a CSV alone. This CSV has a required 'name' column and a required 'sequence' column. In a simple case, the sequence is a literal string. Any other fields in the CSV become metadata
  • A version of the above is to give it a CSV with a sequence field. This can either be a literal string, or something formatted like a URI. if that matches genbank://XXXX, we download (lazily during the compile step) the actual sequence. Note: this URI format should allow a substring, like genbank://XXXX:23-2039, which tells nimble to subset the input.
  • In a final case, the user provides a CSV and FASTA. Maybe we allow the user to omit 'sequence', and in this case we require the FASTA to have a record where the header matches the 'name' of the CSV?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.