tianshilu / pmtnet Goto Github PK
View Code? Open in Web Editor NEWDeep Learning the T Cell Receptor Binding Specificity of Neoantigen
License: GNU General Public License v2.0
Deep Learning the T Cell Receptor Binding Specificity of Neoantigen
License: GNU General Public License v2.0
Thank you for a great tool! I am still pretty new in this field.
I would like to learn more about the training process on pMTnet. I am not sure if I missed the training data in the repository. Could you please provide the training data used in pMTnet with positive and negative labels (e.g. positive/TCR_output.csv, negative/TCR_output.csv, training_positive.csv)? Thank you so much for all your efforts!
Dear Tianshi,
I want to examine the connection between peptides and TCR, but without the related HLA information. Do you have any suggestions?
Thank you so much for your attention and help.
Yingcheng
Hi,
I find the relevant training code is provided in this file test/code/ternary_train_model_pMTnet.py
. However there are still some missing parts. Could you help with the following questions?
tcr_file_train_pos='positive/TCR_output.csv'
tcr_file_train_neg='negative/TCR_output.csv'
hla_antigen_file_train='MHC_antigen_output.csv'
ternary_prediction.fit({'pos_in':tcr_train_pos,'neg_in':tcr_train_neg,'hla_antigen_in':hla_antigen_train}, {'output':Y_train},epochs=150,batch_size=256,shuffle=True)
The meaning of this line of code seems to be that the number of negative samples should be equal to the number of positive samples, not 10:1 as stated in the article.
(pos_in , neg_in)
fixed?I am new about the field and keras. Thank you for your efforts.
Hi Dr Tianshi,
Thanks for this beautiful work. It is really useful.
I have a list of HLA (HLA-A*02:01) and TCR CDR3 sequences (e.g. CAVLDSNYQLIW), but I don't know what the exact antigen is. Is there any possible solutions to compute the score of HLA-TCR match score?
Thanks again for your kind help.
Best,
Yingcheng
Thanks for this new tool and for the data you provide!
I am very interested to use the validation dataset you use to judge the performance of pMTnet, as I think you make a very valid point concerning the quality of data and its effect on model performance. However, I am currently using models that take the full TCR sequence into account. Do you maybe have the full TCR sequences or V and J gene usage information for the validation data?
Hello, When I was reading this paper"Attention-aware contrastive learning for predicting T cell receptor–antigen binding specificity", I found that the dataset involved in the paper came from your paper. But half of the 619 test cases described in the paper were positive and half were negative.
So, may I ask if all 619 test cases in the test set are positive? Or is it half positive and half negative? thanks
hi, I find some strange characters in the datasets you provided for training and testing. For example, in the 30 and 31 row of testing_data.csv, the antigen sequence seems to contain a strange Chinese word.
When I loaded this file with pandas, I found this character seems to be '\xa0'.
So, is this a mistake made in generating the files or '\xa0' could have some special meaning? Thank you.
Greetings!
Great tool to help predict the TCR-pMHC bindings although, is there any way to speed up the encoding step? Since I understand the aim of this tool is to predict how well your TCR repertoire binds to the predicted pMHCs, the encoding is far slower than what I'd expect. Given you'd pair each TCR to the whole list of pMHCs to test for binding, this would generate files of millions of lines. Currently I'm running it on a file with 2M lines and it's been almost 3 days of running time and the encoding is not even close to be done. Maybe it's not expected to use as input all the possible combinations but just some of them? In that case how would you select them?
Best regards,
Jonatan
hi, I have a problem when I run the code: python pMTnet.py -input test/input/test_input.csv -library library -output test/output -output_log test/output/output.log
tensorflow.python.framework.errors_impl.InvalidArgumentError: input and filter must have the same depth: 76 vs 30
So, can you help me with this problem?thanks
Hi,
I know that 'A lower rank considered a good prediction', but how can I select the credible CDR3-Antigen from the output? Could you please provide thresholds or any filter methods?
Thanks
Hi,
I'm wondering how you calculate the AUC value since the output of pMTnet is the relative rank?
Thanks
Hi, when I try to use pMTnet a waring appears looks like this:
2022-05-16 10:04:14.706952: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2022-05-16 10:04:16.631750: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 10791 MB memory: -> device: 0, name: Tesla K80, pci bus id: 0000:86:00.0, compute capability: 3.7 2022-05-16 10:04:16.635577: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 10791 MB memory: -> device: 1, name: Tesla K80, pci bus id: 0000:87:00.0, compute capability: 3.7 2022-05-16 10:04:18.266973: I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] Loaded cuDNN version 8100 WARNING:tensorflow:Layer lstm_2 will not use cuDNN kernels since it doesn't meet the criteria. It will use a generic GPU kernel as fallback when running on GPU. WARNING:tensorflow:Layer lstm_1 will not use cuDNN kernels since it doesn't meet the criteria. It will use a generic GPU kernel as fallback when running on GPU.
And I found GPU did not help at all compared to CPU version.
Can you help me find what‘s wrong? I want to speed up the prediction.I am using tensorflow2.7.0 and python 3.8.5.Thanks!!
Thanks for making the test code for the tool available.
I have a query regarding how the HLA pseudosequences are generated
Here there are hard coded indexes for generating the pseudo sequences; however, my understanding was that an alignment was needed before using these indexes since the HLAs in the fastas you've used are of varying length. After this the indexes from the original netMHCpan paper describing the method wouldn't necessarily be correct for your HLA sequences.
If you look at the HLA analysis in netMHCpan (https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0000796) the pseudosequences have an expected pattern which I don't think holds using your method, indicating you're not using the same pseudosequences they are (at least at test time).
MHCFlurry used what looks like a similar set of HLA fastas to you and after their alignment they start at index 31 not 7
Did you use a different method for training? If not it could be possible that the network is mainly performing an accurate match between peptide-TCR. The HLAs are still being encoded, but not in a way which preserves the likely contact points.
Apologies if I've missed part of the implementation which addresses this!
Is it possible to provide the code to generate 'TCR_encoder_30.h5' or other models that have been loaded in the repo?
Hi,
In the for loop (line 83), there are 34 values in pseudo_seq_pos array. but only use 33 values to construct the pseudo sequence (line 95): for i in range(0,33).
range(0,33)=0,1,2,3,...,32
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.