Hello, I am utilizing a scTCR protocol where it uses dual indexing t

Single Cell TCR Presets and Tag Pattern Set Up. about mixcr HOT 6 CLOSED

bshim181 commented on July 18, 2024

Single Cell TCR Presets and Tag Pattern Set Up.

from mixcr.

Comments (6)

mizraelson commented on July 18, 2024

Hi, does this protocol imply that you have a distinct pair of FASTQ files for every sorted cell?

from mixcr.

bshim181 commented on July 18, 2024

Yes I believe that is the logic behind the protocol.

from mixcr.

mizraelson commented on July 18, 2024

If you have multiple pairs of FASTQ files representing single-cell data, you can use the following preset:

mixcr analyze generic-lt-single-cell-amplicon \
    --species hsa \
    --rna \
    --floating-left-alignment-boundary \
    --floating-right-alignment-boundary C \
      input_sample1_{{CELL:a}}_{{R}}.fastq.gz \
      result

Note that in the input file pattern input_sample1_{{CELL:a}}_{{R}}.fastq.gz:

{{CELL:a}} defines a place in the filename which will be used as a cell barcode.

For example, here is a list of input filenames that will be aggregated by the pattern above:

> ls
    input_sample1_A1_R1.fastq.gz
    input_sample1_A1_R2.fastq.gz
    input_sample1_A2_R1.fastq.gz
    input_sample1_A2_R2.fastq.gz
    input_sample1_A3_R1.fastq.gz
    input_sample1_A3_R2.fastq.gz
    input_sample1_A4_R1.fastq.gz
    input_sample1_A4_R2.fastq.gz
    ...

A1, A2, A3, and A4 will be used as cell barcodes.

Again, if you can share some data obtained using the protocol from this publication, I can create and test a dedicated preset for this data. I tried to inquire about the raw data from the authors a few months ago, but I was unable to get a reply.

from mixcr.

bshim181 commented on July 18, 2024

Hello, I have tried the following parameters to run the single cell samples and I seem to find the C gene segments as well. The problem, however, is that there are some single cell samples where either the TRB or the TRA chains have not been identified by the sequenced reads. Primers for both TRA and TRB chains were used for the amplification of the specific TRA and TRB chain sequences.

I am wondering if it is possible to build a custom library where I add the TRA and TRB chain primers to the reference sequences.

Furthermore, is it possible to force detection of both chains for each single cell sample? In terms of the results or downstream analysis, it would not make sense for a single cell result to have only TRA or TRB chain available.

This is also a concern for us because in our current pipeline where we first sort out the reads into TRA and TRB specific reads using BLAST and run the the old version of MiXCR with specific chain specified for each TRA and TRB separated FASTQ files , we were able to find representative TRB sequences/TRA sequences which was not identified by just running analyze for the entire pairs of fastq files.

from mixcr.

mizraelson commented on July 18, 2024

Hi,

If you don't see certain chains in some of the wells, this issue requires a deeper investigation. It's quite common with single-cell data for things not to go according to plan. Sometimes chains do not amplify even though the primers were added, and sometimes cross-well contamination during sorting or PCR leads to multiple chains being identified in a single cell. Additionally, some cells do biologically express two different TRA chains.

The preset mentioned above utilizes filtering in a way that retains only those clones whose cumulative frequency is 95% or more for every cell and each chain (TRAD/TRB/TRG/IGH/IGL), as measured by the number of reads containing 'CDR3'. However, this shouldn't lead to the disappearance of all clones belonging to one type of chain.

Since we are awaiting your data, I will investigate these issues manually and tailor the preset for optimal performance, addressing each protocol's specific issues.

To answer your questions:

The reference library includes gene sequences, not primers. There is no need to add primers to a reference library, as the sequence present in the pool will be aligned to the reference. This is unless you work with genetically modified objects like humanized mice, etc. In that case a custom library is required.
MiXCR finds all possible chains for every cell. I will attempt to tune the filtering to increase the yield, but it's not possible to force MiXCR to identify both chains for every cell (it will if the sequences are there). Also, as of now, there is no feature available to filter out cells that do not contain both chains from the output clonotype table. However, this is easy to do manually with Python or R. We'll soon drastically increase MiXCR's functionality, and such filtering will become possible.

As soon as we receive your data, I will prepare the preset for you and add it to the development version. Also, please share the file names that raise your concern (where you were able to identify both chains with the separated FASTQ files approach but not with the new MiXCR preset), so I will pay close attention to them.

Sincerely,
Mark

from mixcr.

mizraelson commented on July 18, 2024

Hi,
First of all, we have improved the consensus algorithm. To be more precise, it now better handles this protocol. Please use the latest development version, which you can download from here.

The command below produces reliable results:

mixcr analyze generic-lt-single-cell-amplicon \
--species hsa \
--rna \
--floating-left-alignment-boundary \
--floating-right-alignment-boundary C \
B{{CELL:a}}_L001_{{R}}_001.fastq.gz \
output

Below is a table displaying the number of aligned reads per chain for cells where only one of the chains (TRA or TRB) has been assembled.

Open table

tagValueCELL	TRA	TRB
003-10_S202	3	10037
003-11_S203	69534	22
003-14_S206	3	1721
003-15_S207	1	1
003-1_S193	40287	41523
003-38_S230	0	13
003-39_S231	2	1
003-3_S195	24	118270
003-42_S234	3	0
003-44_S236	1	6
003-8_S200	0	4
003-9_S201	0	2
004-18_S258	3	6316
004-19_S259	0	2
004-20_S260	15	48595
004-21_S261	42	123789
004-29_S269	2	7
004-30_S270	0	9
004-38_S278	7	0
004-3_S243	8	24297
004-47_S287	1	14
004-48_S288	0	1
004-4_S244	6	51056
004-5_S245	1	4
004-6_S246	14	20266
004-7_S247	360	54198
004-8_S248	2	1705
004-9_S249	11	28960
006-11_S11	12	12704
006-12_S12	8	26029
006-13_S13	9	2
006-15_S15	16686	6
006-30_S30	6	8
006-31_S31	0	2
006-34_S34	0	3
006-41_S41	3	2
006-42_S42	11	8525
006-44_S44	6	23
006-47_S47	0	16
006-48_S48	5	3
006-54_S54	0	12
006-55_S55	0	1
006-57_S57	4	0
006-5_S5	38	12844
006-60_S60	0	11
006-61_S61	6	10
006-68_S68	11	11156
006-72_S72	24053	7
006-7_S7	4	28040
006-80_S80	0	1
006-87_S87	12	10191
006-90_S90	2	14816
007-18_S114	2	2
007-31_S127	7	16688
007-48_S144	50089	3
007-51_S147	9	19080
007-57_S153	0	2
007-59_S155	0	1
007-65_S161	1	12685
007-67_S163	0	3400
007-71_S167	6	80697
007-76_S172	4	20287
007-7_S103	2291	2
007-9_S105	1	2

From this data, we can observe that cells missing one of the chains tend to have a low number of reads for these chains. Sometimes, if the reads are of good quality and consistent in their clone sequence, MiXCR manages to recover a clone. Also, if we see a significant number of reads for one chain from the same cell and only a few for the other, it suggests that the latter might not be trustworthy. Given the protocol pipeline, we would typically expect similar coverage for both chains in every cell.

An exception is the cell 003-1_S193. It has a significant number of reads for both TRA and TRB. However, a detailed examination of the TRA clone reveals the absence of a complete CDR3 sequence because the end of the V gene (CAVR) is trimmed, so no cysteine is present.

By adjusting certain parameters, we can potentially lower the quality thresholds to capture more clones (e.g., for 004-7_S247 where the number of reads for TRA is significant).

mixcr analyze generic-lt-single-cell-amplicon \
--species hsa \
--rna \
--floating-left-alignment-boundary \
--floating-right-alignment-boundary C \
-Massemble.consensusAssemblerParameters.assembler.minRecordSharePerConsensus=0.005 \
-Massemble.cloneAssemblerParameters.minimalQuality=15 \
B{{CELL:a}}_L001_{{R}}_001.fastq.gz \
output

Similar table will look like that:

tagValueCELL	TRA	TRB
003-10_S202	3	10037
003-11_S203	69534	22
003-14_S206	3	1721
003-1_S193	40287	41523
003-38_S230	0	13
003-39_S231	2	1
003-3_S195	24	118270
003-42_S234	3	0
003-8_S200	0	4
003-9_S201	0	2
004-18_S258	3	6316
004-19_S259	0	2
004-20_S260	15	48595
004-21_S261	42	123789
004-28_S268	1	0
004-29_S269	2	7
004-30_S270	0	9
004-38_S278	7	0
004-3_S243	8	24297
004-48_S288	0	1
004-4_S244	6	51056
004-5_S245	1	4
004-6_S246	14	20266
004-8_S248	2	1705
004-9_S249	11	28960
006-11_S11	12	12704
006-12_S12	8	26029
006-13_S13	9	2
006-15_S15	16686	6
006-30_S30	6	8
006-31_S31	0	2
006-34_S34	0	3
006-41_S41	3	2
006-42_S42	11	8525
006-44_S44	6	23
006-47_S47	0	16
006-54_S54	0	12
006-55_S55	0	1
006-57_S57	4	0
006-5_S5	38	12844
006-60_S60	0	11
006-61_S61	6	10
006-68_S68	11	11156
006-72_S72	24053	7
006-7_S7	4	28040
006-80_S80	0	1
006-87_S87	12	10191
006-90_S90	2	14816
007-18_S114	2	2
007-31_S127	7	16688
007-48_S144	50089	3
007-51_S147	9	19080
007-53_S149	1	1
007-57_S153	0	2
007-59_S155	0	1
007-65_S161	1	12685
007-67_S163	0	3400
007-71_S167	6	80697
007-76_S172	4	20287
007-7_S103	2291	2
007-9_S105	1	2

I believe that these results are reasonable for these cells. While it's feasible to further tweak the parameters, it may introduce more noise into the data.

Please let me know if you have any other questions.

Sincerely,
Mark

from mixcr.

Single Cell TCR Presets and Tag Pattern Set Up. about mixcr HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent