Giter VIP home page Giter VIP logo

Comments (6)

mizraelson avatar mizraelson commented on July 18, 2024

Hi, does this protocol imply that you have a distinct pair of FASTQ files for every sorted cell?

from mixcr.

bshim181 avatar bshim181 commented on July 18, 2024

Yes I believe that is the logic behind the protocol.

from mixcr.

mizraelson avatar mizraelson commented on July 18, 2024

If you have multiple pairs of FASTQ files representing single-cell data, you can use the following preset:

mixcr analyze generic-lt-single-cell-amplicon \
    --species hsa \
    --rna \
    --floating-left-alignment-boundary \
    --floating-right-alignment-boundary C \
      input_sample1_{{CELL:a}}_{{R}}.fastq.gz \
      result

Note that in the input file pattern input_sample1_{{CELL:a}}_{{R}}.fastq.gz:

  • {{CELL:a}} defines a place in the filename which will be used as a cell barcode.

For example, here is a list of input filenames that will be aggregated by the pattern above:

> ls
    input_sample1_A1_R1.fastq.gz
    input_sample1_A1_R2.fastq.gz
    input_sample1_A2_R1.fastq.gz
    input_sample1_A2_R2.fastq.gz
    input_sample1_A3_R1.fastq.gz
    input_sample1_A3_R2.fastq.gz
    input_sample1_A4_R1.fastq.gz
    input_sample1_A4_R2.fastq.gz
    ...

A1, A2, A3, and A4 will be used as cell barcodes.

Again, if you can share some data obtained using the protocol from this publication, I can create and test a dedicated preset for this data. I tried to inquire about the raw data from the authors a few months ago, but I was unable to get a reply.

from mixcr.

bshim181 avatar bshim181 commented on July 18, 2024

Hello, I have tried the following parameters to run the single cell samples and I seem to find the C gene segments as well. The problem, however, is that there are some single cell samples where either the TRB or the TRA chains have not been identified by the sequenced reads. Primers for both TRA and TRB chains were used for the amplification of the specific TRA and TRB chain sequences.

I am wondering if it is possible to build a custom library where I add the TRA and TRB chain primers to the reference sequences.

Furthermore, is it possible to force detection of both chains for each single cell sample? In terms of the results or downstream analysis, it would not make sense for a single cell result to have only TRA or TRB chain available.

Screenshot 2023-08-17 at 9 43 34 AM

This is also a concern for us because in our current pipeline where we first sort out the reads into TRA and TRB specific reads using BLAST and run the the old version of MiXCR with specific chain specified for each TRA and TRB separated FASTQ files , we were able to find representative TRB sequences/TRA sequences which was not identified by just running analyze for the entire pairs of fastq files.

from mixcr.

mizraelson avatar mizraelson commented on July 18, 2024

Hi,

If you don't see certain chains in some of the wells, this issue requires a deeper investigation. It's quite common with single-cell data for things not to go according to plan. Sometimes chains do not amplify even though the primers were added, and sometimes cross-well contamination during sorting or PCR leads to multiple chains being identified in a single cell. Additionally, some cells do biologically express two different TRA chains.

The preset mentioned above utilizes filtering in a way that retains only those clones whose cumulative frequency is 95% or more for every cell and each chain (TRAD/TRB/TRG/IGH/IGL), as measured by the number of reads containing 'CDR3'. However, this shouldn't lead to the disappearance of all clones belonging to one type of chain.

Since we are awaiting your data, I will investigate these issues manually and tailor the preset for optimal performance, addressing each protocol's specific issues.

To answer your questions:

  • The reference library includes gene sequences, not primers. There is no need to add primers to a reference library, as the sequence present in the pool will be aligned to the reference. This is unless you work with genetically modified objects like humanized mice, etc. In that case a custom library is required.
  • MiXCR finds all possible chains for every cell. I will attempt to tune the filtering to increase the yield, but it's not possible to force MiXCR to identify both chains for every cell (it will if the sequences are there). Also, as of now, there is no feature available to filter out cells that do not contain both chains from the output clonotype table. However, this is easy to do manually with Python or R. We'll soon drastically increase MiXCR's functionality, and such filtering will become possible.

As soon as we receive your data, I will prepare the preset for you and add it to the development version. Also, please share the file names that raise your concern (where you were able to identify both chains with the separated FASTQ files approach but not with the new MiXCR preset), so I will pay close attention to them.

Sincerely,
Mark

from mixcr.

mizraelson avatar mizraelson commented on July 18, 2024

Hi,
First of all, we have improved the consensus algorithm. To be more precise, it now better handles this protocol. Please use the latest development version, which you can download from here.

The command below produces reliable results:

mixcr analyze generic-lt-single-cell-amplicon \
--species hsa \
--rna \
--floating-left-alignment-boundary \
--floating-right-alignment-boundary C \
B{{CELL:a}}_L001_{{R}}_001.fastq.gz \
output

Below is a table displaying the number of aligned reads per chain for cells where only one of the chains (TRA or TRB) has been assembled.

Open table
tagValueCELL TRA TRB
003-10_S202 3 10037
003-11_S203 69534 22
003-14_S206 3 1721
003-15_S207 1 1
003-1_S193 40287 41523
003-38_S230 0 13
003-39_S231 2 1
003-3_S195 24 118270
003-42_S234 3 0
003-44_S236 1 6
003-8_S200 0 4
003-9_S201 0 2
004-18_S258 3 6316
004-19_S259 0 2
004-20_S260 15 48595
004-21_S261 42 123789
004-29_S269 2 7
004-30_S270 0 9
004-38_S278 7 0
004-3_S243 8 24297
004-47_S287 1 14
004-48_S288 0 1
004-4_S244 6 51056
004-5_S245 1 4
004-6_S246 14 20266
004-7_S247 360 54198
004-8_S248 2 1705
004-9_S249 11 28960
006-11_S11 12 12704
006-12_S12 8 26029
006-13_S13 9 2
006-15_S15 16686 6
006-30_S30 6 8
006-31_S31 0 2
006-34_S34 0 3
006-41_S41 3 2
006-42_S42 11 8525
006-44_S44 6 23
006-47_S47 0 16
006-48_S48 5 3
006-54_S54 0 12
006-55_S55 0 1
006-57_S57 4 0
006-5_S5 38 12844
006-60_S60 0 11
006-61_S61 6 10
006-68_S68 11 11156
006-72_S72 24053 7
006-7_S7 4 28040
006-80_S80 0 1
006-87_S87 12 10191
006-90_S90 2 14816
007-18_S114 2 2
007-31_S127 7 16688
007-48_S144 50089 3
007-51_S147 9 19080
007-57_S153 0 2
007-59_S155 0 1
007-65_S161 1 12685
007-67_S163 0 3400
007-71_S167 6 80697
007-76_S172 4 20287
007-7_S103 2291 2
007-9_S105 1 2

From this data, we can observe that cells missing one of the chains tend to have a low number of reads for these chains. Sometimes, if the reads are of good quality and consistent in their clone sequence, MiXCR manages to recover a clone. Also, if we see a significant number of reads for one chain from the same cell and only a few for the other, it suggests that the latter might not be trustworthy. Given the protocol pipeline, we would typically expect similar coverage for both chains in every cell.

An exception is the cell 003-1_S193. It has a significant number of reads for both TRA and TRB. However, a detailed examination of the TRA clone reveals the absence of a complete CDR3 sequence because the end of the V gene (CAVR) is trimmed, so no cysteine is present.

By adjusting certain parameters, we can potentially lower the quality thresholds to capture more clones (e.g., for 004-7_S247 where the number of reads for TRA is significant).

mixcr analyze generic-lt-single-cell-amplicon \
--species hsa \
--rna \
--floating-left-alignment-boundary \
--floating-right-alignment-boundary C \
-Massemble.consensusAssemblerParameters.assembler.minRecordSharePerConsensus=0.005 \
-Massemble.cloneAssemblerParameters.minimalQuality=15 \
B{{CELL:a}}_L001_{{R}}_001.fastq.gz \
output
Similar table will look like that:
tagValueCELL TRA TRB
003-10_S202 3 10037
003-11_S203 69534 22
003-14_S206 3 1721
003-1_S193 40287 41523
003-38_S230 0 13
003-39_S231 2 1
003-3_S195 24 118270
003-42_S234 3 0
003-8_S200 0 4
003-9_S201 0 2
004-18_S258 3 6316
004-19_S259 0 2
004-20_S260 15 48595
004-21_S261 42 123789
004-28_S268 1 0
004-29_S269 2 7
004-30_S270 0 9
004-38_S278 7 0
004-3_S243 8 24297
004-48_S288 0 1
004-4_S244 6 51056
004-5_S245 1 4
004-6_S246 14 20266
004-8_S248 2 1705
004-9_S249 11 28960
006-11_S11 12 12704
006-12_S12 8 26029
006-13_S13 9 2
006-15_S15 16686 6
006-30_S30 6 8
006-31_S31 0 2
006-34_S34 0 3
006-41_S41 3 2
006-42_S42 11 8525
006-44_S44 6 23
006-47_S47 0 16
006-54_S54 0 12
006-55_S55 0 1
006-57_S57 4 0
006-5_S5 38 12844
006-60_S60 0 11
006-61_S61 6 10
006-68_S68 11 11156
006-72_S72 24053 7
006-7_S7 4 28040
006-80_S80 0 1
006-87_S87 12 10191
006-90_S90 2 14816
007-18_S114 2 2
007-31_S127 7 16688
007-48_S144 50089 3
007-51_S147 9 19080
007-53_S149 1 1
007-57_S153 0 2
007-59_S155 0 1
007-65_S161 1 12685
007-67_S163 0 3400
007-71_S167 6 80697
007-76_S172 4 20287
007-7_S103 2291 2
007-9_S105 1 2

I believe that these results are reasonable for these cells. While it's feasible to further tweak the parameters, it may introduce more noise into the data.

Please let me know if you have any other questions.

Sincerely,
Mark

from mixcr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.