Comments (6)
Hi, does this protocol imply that you have a distinct pair of FASTQ files for every sorted cell?
from mixcr.
Yes I believe that is the logic behind the protocol.
from mixcr.
If you have multiple pairs of FASTQ files representing single-cell data, you can use the following preset:
mixcr analyze generic-lt-single-cell-amplicon \
--species hsa \
--rna \
--floating-left-alignment-boundary \
--floating-right-alignment-boundary C \
input_sample1_{{CELL:a}}_{{R}}.fastq.gz \
result
Note that in the input file pattern input_sample1_{{CELL:a}}_{{R}}.fastq.gz
:
{{CELL:a}}
defines a place in the filename which will be used as a cell barcode.
For example, here is a list of input filenames that will be aggregated by the pattern above:
> ls
input_sample1_A1_R1.fastq.gz
input_sample1_A1_R2.fastq.gz
input_sample1_A2_R1.fastq.gz
input_sample1_A2_R2.fastq.gz
input_sample1_A3_R1.fastq.gz
input_sample1_A3_R2.fastq.gz
input_sample1_A4_R1.fastq.gz
input_sample1_A4_R2.fastq.gz
...
A1, A2, A3, and A4 will be used as cell barcodes.
Again, if you can share some data obtained using the protocol from this publication, I can create and test a dedicated preset for this data. I tried to inquire about the raw data from the authors a few months ago, but I was unable to get a reply.
from mixcr.
Hello, I have tried the following parameters to run the single cell samples and I seem to find the C gene segments as well. The problem, however, is that there are some single cell samples where either the TRB or the TRA chains have not been identified by the sequenced reads. Primers for both TRA and TRB chains were used for the amplification of the specific TRA and TRB chain sequences.
I am wondering if it is possible to build a custom library where I add the TRA and TRB chain primers to the reference sequences.
Furthermore, is it possible to force detection of both chains for each single cell sample? In terms of the results or downstream analysis, it would not make sense for a single cell result to have only TRA or TRB chain available.
![Screenshot 2023-08-17 at 9 43 34 AM](https://private-user-images.githubusercontent.com/53489568/261327583-8c9cd7fb-2fcf-4d93-9263-323ab6ff9495.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MDc2MjAzNDAsIm5iZiI6MTcwNzYyMDA0MCwicGF0aCI6Ii81MzQ4OTU2OC8yNjEzMjc1ODMtOGM5Y2Q3ZmItMmZjZi00ZDkzLTkyNjMtMzIzYWI2ZmY5NDk1LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDAyMTElMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwMjExVDAyNTQwMFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPThlMmI2MTNmODliMzkyZjk5ODgwZjUzYmU5YjAzMjRmYzkzYjgyNWIzN2Q0YzA4ZjdhNzExZDAwOTU3NzAyZTMmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.DEo4i83bEHuLJ66tnWUMWhvYXxoJRSvL9F1uR9M9ThA)
This is also a concern for us because in our current pipeline where we first sort out the reads into TRA and TRB specific reads using BLAST and run the the old version of MiXCR with specific chain specified for each TRA and TRB separated FASTQ files , we were able to find representative TRB sequences/TRA sequences which was not identified by just running analyze for the entire pairs of fastq files.
from mixcr.
Hi,
If you don't see certain chains in some of the wells, this issue requires a deeper investigation. It's quite common with single-cell data for things not to go according to plan. Sometimes chains do not amplify even though the primers were added, and sometimes cross-well contamination during sorting or PCR leads to multiple chains being identified in a single cell. Additionally, some cells do biologically express two different TRA chains.
The preset mentioned above utilizes filtering in a way that retains only those clones whose cumulative frequency is 95% or more for every cell and each chain (TRAD/TRB/TRG/IGH/IGL), as measured by the number of reads containing 'CDR3'. However, this shouldn't lead to the disappearance of all clones belonging to one type of chain.
Since we are awaiting your data, I will investigate these issues manually and tailor the preset for optimal performance, addressing each protocol's specific issues.
To answer your questions:
- The reference library includes gene sequences, not primers. There is no need to add primers to a reference library, as the sequence present in the pool will be aligned to the reference. This is unless you work with genetically modified objects like humanized mice, etc. In that case a custom library is required.
- MiXCR finds all possible chains for every cell. I will attempt to tune the filtering to increase the yield, but it's not possible to force MiXCR to identify both chains for every cell (it will if the sequences are there). Also, as of now, there is no feature available to filter out cells that do not contain both chains from the output clonotype table. However, this is easy to do manually with Python or R. We'll soon drastically increase MiXCR's functionality, and such filtering will become possible.
As soon as we receive your data, I will prepare the preset for you and add it to the development version. Also, please share the file names that raise your concern (where you were able to identify both chains with the separated FASTQ files approach but not with the new MiXCR preset), so I will pay close attention to them.
Sincerely,
Mark
from mixcr.
Hi,
First of all, we have improved the consensus algorithm. To be more precise, it now better handles this protocol. Please use the latest development version, which you can download from here.
The command below produces reliable results:
mixcr analyze generic-lt-single-cell-amplicon \
--species hsa \
--rna \
--floating-left-alignment-boundary \
--floating-right-alignment-boundary C \
B{{CELL:a}}_L001_{{R}}_001.fastq.gz \
output
Below is a table displaying the number of aligned reads per chain for cells where only one of the chains (TRA or TRB) has been assembled.
Open table
tagValueCELL | TRA | TRB |
---|---|---|
003-10_S202 | 3 | 10037 |
003-11_S203 | 69534 | 22 |
003-14_S206 | 3 | 1721 |
003-15_S207 | 1 | 1 |
003-1_S193 | 40287 | 41523 |
003-38_S230 | 0 | 13 |
003-39_S231 | 2 | 1 |
003-3_S195 | 24 | 118270 |
003-42_S234 | 3 | 0 |
003-44_S236 | 1 | 6 |
003-8_S200 | 0 | 4 |
003-9_S201 | 0 | 2 |
004-18_S258 | 3 | 6316 |
004-19_S259 | 0 | 2 |
004-20_S260 | 15 | 48595 |
004-21_S261 | 42 | 123789 |
004-29_S269 | 2 | 7 |
004-30_S270 | 0 | 9 |
004-38_S278 | 7 | 0 |
004-3_S243 | 8 | 24297 |
004-47_S287 | 1 | 14 |
004-48_S288 | 0 | 1 |
004-4_S244 | 6 | 51056 |
004-5_S245 | 1 | 4 |
004-6_S246 | 14 | 20266 |
004-7_S247 | 360 | 54198 |
004-8_S248 | 2 | 1705 |
004-9_S249 | 11 | 28960 |
006-11_S11 | 12 | 12704 |
006-12_S12 | 8 | 26029 |
006-13_S13 | 9 | 2 |
006-15_S15 | 16686 | 6 |
006-30_S30 | 6 | 8 |
006-31_S31 | 0 | 2 |
006-34_S34 | 0 | 3 |
006-41_S41 | 3 | 2 |
006-42_S42 | 11 | 8525 |
006-44_S44 | 6 | 23 |
006-47_S47 | 0 | 16 |
006-48_S48 | 5 | 3 |
006-54_S54 | 0 | 12 |
006-55_S55 | 0 | 1 |
006-57_S57 | 4 | 0 |
006-5_S5 | 38 | 12844 |
006-60_S60 | 0 | 11 |
006-61_S61 | 6 | 10 |
006-68_S68 | 11 | 11156 |
006-72_S72 | 24053 | 7 |
006-7_S7 | 4 | 28040 |
006-80_S80 | 0 | 1 |
006-87_S87 | 12 | 10191 |
006-90_S90 | 2 | 14816 |
007-18_S114 | 2 | 2 |
007-31_S127 | 7 | 16688 |
007-48_S144 | 50089 | 3 |
007-51_S147 | 9 | 19080 |
007-57_S153 | 0 | 2 |
007-59_S155 | 0 | 1 |
007-65_S161 | 1 | 12685 |
007-67_S163 | 0 | 3400 |
007-71_S167 | 6 | 80697 |
007-76_S172 | 4 | 20287 |
007-7_S103 | 2291 | 2 |
007-9_S105 | 1 | 2 |
From this data, we can observe that cells missing one of the chains tend to have a low number of reads for these chains. Sometimes, if the reads are of good quality and consistent in their clone sequence, MiXCR manages to recover a clone. Also, if we see a significant number of reads for one chain from the same cell and only a few for the other, it suggests that the latter might not be trustworthy. Given the protocol pipeline, we would typically expect similar coverage for both chains in every cell.
An exception is the cell 003-1_S193. It has a significant number of reads for both TRA and TRB. However, a detailed examination of the TRA clone reveals the absence of a complete CDR3 sequence because the end of the V gene (CAVR) is trimmed, so no cysteine is present.
By adjusting certain parameters, we can potentially lower the quality thresholds to capture more clones (e.g., for 004-7_S247 where the number of reads for TRA is significant).
mixcr analyze generic-lt-single-cell-amplicon \
--species hsa \
--rna \
--floating-left-alignment-boundary \
--floating-right-alignment-boundary C \
-Massemble.consensusAssemblerParameters.assembler.minRecordSharePerConsensus=0.005 \
-Massemble.cloneAssemblerParameters.minimalQuality=15 \
B{{CELL:a}}_L001_{{R}}_001.fastq.gz \
output
Similar table will look like that:
tagValueCELL | TRA | TRB |
---|---|---|
003-10_S202 | 3 | 10037 |
003-11_S203 | 69534 | 22 |
003-14_S206 | 3 | 1721 |
003-1_S193 | 40287 | 41523 |
003-38_S230 | 0 | 13 |
003-39_S231 | 2 | 1 |
003-3_S195 | 24 | 118270 |
003-42_S234 | 3 | 0 |
003-8_S200 | 0 | 4 |
003-9_S201 | 0 | 2 |
004-18_S258 | 3 | 6316 |
004-19_S259 | 0 | 2 |
004-20_S260 | 15 | 48595 |
004-21_S261 | 42 | 123789 |
004-28_S268 | 1 | 0 |
004-29_S269 | 2 | 7 |
004-30_S270 | 0 | 9 |
004-38_S278 | 7 | 0 |
004-3_S243 | 8 | 24297 |
004-48_S288 | 0 | 1 |
004-4_S244 | 6 | 51056 |
004-5_S245 | 1 | 4 |
004-6_S246 | 14 | 20266 |
004-8_S248 | 2 | 1705 |
004-9_S249 | 11 | 28960 |
006-11_S11 | 12 | 12704 |
006-12_S12 | 8 | 26029 |
006-13_S13 | 9 | 2 |
006-15_S15 | 16686 | 6 |
006-30_S30 | 6 | 8 |
006-31_S31 | 0 | 2 |
006-34_S34 | 0 | 3 |
006-41_S41 | 3 | 2 |
006-42_S42 | 11 | 8525 |
006-44_S44 | 6 | 23 |
006-47_S47 | 0 | 16 |
006-54_S54 | 0 | 12 |
006-55_S55 | 0 | 1 |
006-57_S57 | 4 | 0 |
006-5_S5 | 38 | 12844 |
006-60_S60 | 0 | 11 |
006-61_S61 | 6 | 10 |
006-68_S68 | 11 | 11156 |
006-72_S72 | 24053 | 7 |
006-7_S7 | 4 | 28040 |
006-80_S80 | 0 | 1 |
006-87_S87 | 12 | 10191 |
006-90_S90 | 2 | 14816 |
007-18_S114 | 2 | 2 |
007-31_S127 | 7 | 16688 |
007-48_S144 | 50089 | 3 |
007-51_S147 | 9 | 19080 |
007-53_S149 | 1 | 1 |
007-57_S153 | 0 | 2 |
007-59_S155 | 0 | 1 |
007-65_S161 | 1 | 12685 |
007-67_S163 | 0 | 3400 |
007-71_S167 | 6 | 80697 |
007-76_S172 | 4 | 20287 |
007-7_S103 | 2291 | 2 |
007-9_S105 | 1 | 2 |
I believe that these results are reasonable for these cells. While it's feasible to further tweak the parameters, it may introduce more noise into the data.
Please let me know if you have any other questions.
Sincerely,
Mark
from mixcr.
Related Issues (20)
- Error analyzing bam file HOT 3
- Pig TCRa chain ref HOT 1
- Differences in the clone info between exportClones and exportAirr
- Error in generating plot for "Gini coefficient" HOT 2
- alpha and beta chain dimers from mixcr HOT 1
- QC ALERT HOT 1
- picocli.CommandLine$ExecutionException: Error while running command refineTagsAndSort java.lang.IllegalStateException: Check failed. HOT 3
- Question about UMI information in the output files vdjca and clns HOT 1
- ERROR: File lock safeguard was triggered. Please report this error to [email protected]. HOT 3
- exportClones error with Imputed HOT 2
- Problems with nMutations, nMutationRate, and nLength features
- [Request] Using smart-seq2-vdj without "cell" substring as cell tag HOT 5
- findAlleles with 10x single cell data HOT 1
- Need help with converting old codes to newer version HOT 1
- Error while running command exportClones: While adding VEndTrimmed error: Wrong position value HOT 2
- Java run issue HOT 3
- License error: AuthorizationError: License expired HOT 1
- Issue when running mixcr exportPlots shmTrees HOT 3
- I can't obtain the pa.json[.gz] file to create plots using exportPlots HOT 1
- Custom Reference error parsing fasta: Unknown letter '>' HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mixcr.