Giter VIP home page Giter VIP logo

deeptcr's Introduction

DeepTCR

Deep Learning Methods for Parsing T-Cell Receptor Sequencing (TCRSeq) Data

DeepTCR is a python package that has a collection of unsupervised and supervised deep learning methods to parse TCRSeq data. To see examples of how the algorithms can be used on an example datasets, see the subdirectory 'tutorials' for a collection of tutorial use cases across multiple datasets. For complete documentation for all available methods, click here.

While DeepTCR will run with Tensorflow-CPU versions, for optimal training times, we suggest training these algorithms on GPU's (requiring CUDA, cuDNN, and tensorflow-GPU).

DeepTCR now has the added functionality of being able to analyze paired alpha/beta chain inputs as well as also being able to take in v/d/j gene usage and the contextual HLA information the TCR-Sequences were seen in (i.e. HLA alleles for a repertoire from a given human sample). For detailed instructions on how to upload this type of data, refer to the documentation for loading data into DeepTCR.

For questions or help, email: [email protected]

Publication

For full description of algorithm and methods behind DeepTCR, refer to the following manuscript:

Sidhom, J. W., Larman, H. B., Pardoll, D. M., & Baras, A. S. (2021). DeepTCR is a deep learning framework for revealing sequence concepts within T-cell repertoires. Nat Commun 12, 1605

Dependencies

See requirements.txt for all DeepTCR dependencies. Installing DeepTCR from Github repository or PyPi will install all required dependencies. It is recommended to create a virtualenv and installing DeepTCR within this environment to ensure proper versioning of dependencies.

In the most recent release (DeepTCR 2.0, fifth release), the package now uses python 3.7 & Tensorflow 2.0. Since this has required an overhaul in a lot of the code, there could be some bugs so we would greatly appreciate if you post any issues to the issues page and I will do my best to fix them as quickly as possible. One can find the latest DeepTCR 1.x version under the v1 branch if you still want to use that version. Or one can specifically pip install the specific version desired.

Instructions on how to create a virtual environment can be found here: https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/

Installation

In order to install DeepTCR:

pip3 install DeepTCR

Or to install latest updated versions from Github repo:

Either download package, unzip, and run setup script:

python3 setup.py install

Or use:

pip3 install git+https://github.com/sidhomj/DeepTCR.git

Release History

1.1

Initial release including two methods for unsupervised learning (VAE & GAN). Also included ability to handle paired alpha/beta data.

1.2

Second release included major refactoring in code to streamline and share methods across classes. Included ability for algorithm to accept v/d/j gene usage. Added more analytical fetures and visualization methods. Removed GAN from unsupervised learning techniques.

1.2.7

On-graph clustering method introduced for repertoire classifier to improve classification performance.

1.2.13

Ability for HLA information to be incorporated in the analysis of TCR-Seq.

1.2.24

Added ability to do regression for sequence-based model.

1.3

Third release including improved repertoire classification architecture. Details in method will follow in manuscript.

1.4

Fourth release includes major refactoring of code and adding more features including:

  • Multi-Model Inference. When training the supervised sequence or repertoire classifier, in Monte-Carlo or K-Fold Cross Validation, a separate model will be stored for each cross-validation. When using the inference engine, users can choose to do an ensemble inference of some or many of the trained models.
  • HLA Supertype Integration. Previous versions allowed users to provide HLA alleles for additional dimension of featurization for the TCR. In this version, when providing HLA (either via the Get_Data or Load_Data methods), one now has the option of assigning the HLA-A and B genes to known supertypes for a more biologically functional representation of HLA.
  • VAE now has an optional method by which to find a minimal number of latent features to model the underlying distribution by incorporating a sparsity regularization on the latent layer. When using this feature, the VAE will provide a more compact latent space even if the initial latent_dim is unnecessarily high to model the distribution of data.
  • Supervised models now have an additional option to use Multi-Sample Dropout to improve training and generalization.
  • Incorporation of LogoMaker so now when Representative Sequences are generated along with enriched motifs, seq logos are made and saved directly in the results folder under Motifs.
  • Improved Motif Identification algorithm behind supervised method Representative_Sequences that uses a multinomial linear model to identify which motifs are associated to predicted probabilites from neural network.
  • Supervised Repertoire Model now able to do regression. By providing per-instance label with regression value with Load_Data method, this will automatically use the average of all instance level labels as the sample level value to regress the model against.

2.0

Fifth release:

  • Upgrading to use python 3.7 & Tensorflow 2.0
  • For large repertoires, we have incorporated the ability to randomly subsample the repertoire over the course of training. Two methods of sub-sampling exist. 1) Completely randomly sampled from across the entire repertoire vs 2) randomly sampled as a probability function of the frequency of the TCR (at the amino acid level), meaning that a TCR with a 25% frequency will be sample at that probability.

2.1.0

  • Upgrading to Tensorflow 2.7
  • Improved handling of inference with previously unseen V/D/J gene usage.
  • Improved computational efficiency for loading data from large files (~2x improvement in speed, 50% or more decrease in peak memory consumption)

deeptcr's People

Contributors

alexbaras avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

deeptcr's Issues

Clustering by phenograph error

Hi I was trying to run the clustering tutorial and I got an error after running through the commands below:

DTCRU = DeepTCR_U('Tutorial')
#Load Data from directories
DTCRU.Get_Data(directory='github/DeepTCR/Data/Murine_Antigens',Load_Prev_Data=False,aggregate_by_aa=True,aa_column_beta=0,count_column=1,v_beta_column=2,j_beta_column=3)
DTCRU.Train_VAE(Load_Prev_Data=False, suppress_output=False)
features = DTCRU.features
DTCRU.Cluster(clustering_method='phenograph')

OUTPUT

Finding 30 nearest neighbors using minkowski metric and 'auto' algorithm
Neighbors computed in 3.018752098083496 seconds
Jaccard graph constructed in 1.022472858428955 seconds
Wrote graph to binary file in 0.31229281425476074 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.883528
After 2 runs, maximum modularity is Q = 0.88486
Louvain completed 22 runs in 1.1757018566131592 seconds
PhenoGraph complete in 5.551486968994141 seconds

TypeError Traceback (most recent call last)
in
1 # cluster using phenograph
----> 2 DTCRU.Cluster(clustering_method='phenograph')

~/anaconda3/envs/dl/lib/python3.7/site-packages/DeepTCR/DeepTCR.py in Cluster(self, set, clustering_method, t, criterion, linkage_method, write_to_sheets, sample, n_jobs)
1051 df['D_beta'] = d_beta[sel]
1052 df['J_beta'] = j_beta[sel]
-> 1053 df['HLA'] = list(map(list,hla_data_seq[sel].tolist()))
1054
1055 df_sum = df.groupby(by='Sample', sort=False).agg({'Frequency': 'sum'})

TypeError: 'float' object is not iterable

OUTPUT

I am running using macos, python 3.7

Download with docker

Hi , professor John-William Sidhom !
Well , I encountered some problems when I tried to download DeepTCR .
I have tried to use miniconda or virtualenv to create a virtual environment , but , the system is always reporting errors.
I can install python3.9 , but I cannot get the corresponding pip version .
It really troubled me for a long time .
l hope the excellent software will have docker image someday .
Best wishes to you !

How to export latent feature matrix resulted by VAE training

Hello!
I would like to use the latent feature matrix resulted by VAE training to do other analysis on my way instead of using DeepTCR pipline. But I don't know how to export it (as a .tsv file, for example). could you give me some advice?

Thank you for your help!

sequence_inference error

Hi:

I am trying to do sequence_inference based on a trained model, while the following error occurs:

I am not sure how I shall change my code to make it work. May I ask for your suggestions? Thanks!

tensorflow/core/common_runtime/colocation_graph.cc:1218] Failed to place the graph without changing the devices of some resources. Some of the operations (that had to be colocated with resource generating operations) are not supported on the resources' devices. Current candidate devices are [
/job:localhost/replica:0/task:0/device:CPU:0].
See below for details of this colocation group:
Colocation Debug Info:
Colocation group had the following types and supported devices:
Root Member(assigned_device_name_index_=-1 requested_device_name_='/device:GPU:0' assigned_device_name_='' resource_device_name_='/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
ResourceApplyAdam: CPU
ReadVariableOp: CPU
AssignVariableOp: CPU
VarIsInitializedOp: CPU
Const: CPU
VarHandleOp: CPU

Colocation members, user-requested devices, and framework assigned devices, if any:
dense_1/bias/Initializer/zeros (Const) /device:GPU:0
dense_1/bias (VarHandleOp) /device:GPU:0
dense_1/bias/IsInitialized/VarIsInitializedOp (VarIsInitializedOp) /device:GPU:0
dense_1/bias/Assign (AssignVariableOp) /device:GPU:0
dense_1/bias/Read/ReadVariableOp (ReadVariableOp) /device:GPU:0
dense_1/BiasAdd/ReadVariableOp (ReadVariableOp) /device:GPU:0
dense_1/bias/Adam/Initializer/zeros (Const) /device:GPU:0
dense_1/bias/Adam (VarHandleOp) /device:GPU:0
dense_1/bias/Adam/IsInitialized/VarIsInitializedOp (VarIsInitializedOp) /device:GPU:0
dense_1/bias/Adam/Assign (AssignVariableOp) /device:GPU:0
dense_1/bias/Adam/Read/ReadVariableOp (ReadVariableOp) /device:GPU:0
dense_1/bias/Adam_1/Initializer/zeros (Const) /device:GPU:0
dense_1/bias/Adam_1 (VarHandleOp) /device:GPU:0
dense_1/bias/Adam_1/IsInitialized/VarIsInitializedOp (VarIsInitializedOp) /device:GPU:0
dense_1/bias/Adam_1/Assign (AssignVariableOp) /device:GPU:0
dense_1/bias/Adam_1/Read/ReadVariableOp (ReadVariableOp) /device:GPU:0
Adam/update_dense_1/bias/ResourceApplyAdam (ResourceApplyAdam) /device:GPU:0
save/AssignVariableOp_35 (AssignVariableOp) /device:GPU:0
save/AssignVariableOp_36 (AssignVariableOp) /device:GPU:0
save/AssignVariableOp_37 (AssignVariableOp) /device:GPU:0

Issue with DTCR_WF and DTCR_SS Get_Train_Valid_Test

Sorry if theres an obvious answer to this, I'm not very experienced with python and am learning for my Honours degree.

When running this part of the DTCR_WF script:

DTCR_WF.Get_Train_Valid_Test(test_size=0.25)
DTCR_WF.Train()

I'm getting an error that reads:

" line 4014, in
Get_Train_Valid_Test
raise Exception('Choose different train/valid/test parameters!')

Exception: Choose different train/valid/test parameters! "

I also am having what feels like a related issue with DTCR_SS using the same part of it's script, however it does not throw an error and commences training, however nothing occurs in the training and it seems to continue until stopped with the training loss, validation loss, testing loss, etc..., all reading 0.

Again, sorry if there is an obvious answer. I'd appreciate any help. Thankyou. I should note that the data I'm using works fine using the unsupervised script.

Reproducible clustering

Hi,
Is there a way to make the training and clustering reproducible? Setting graph_seed and split_seed in Train_VAE does not seem to do the trick.

Recommendations for handling large datasets

Hi, thank you for creating this great tool!

I was wondering if you could offer some guidance on handling large datasets in the unsupervised workflow? In particular this seems to be a problem with the clustering/KNN classification steps as it seems to be prohibitively memory-expensive.

I think that downsampling is interfering with the classification accuracy so I would like to use all the data if possible.

Thanks so much for your help!

Leeana

The dataset used in the regression model

Hello,

I checked the dataset used in the regression model. It seems that simply dropping duplicate TCR won't get the dataset used in the regression model. Could you tell you where I can find the preprocessing detail to obtain a dataset for the regression model?

Thanks!

Supervised learning train error: need at least one array to concatenate

I am running a testing using my own data
After loading the data successfully, I got an error when training:
#Load Data from directories
DTCR_WF.Get_Data(directory='data_test/',
Load_Prev_Data=False,
aggregate_by_aa=True,
aa_column_beta=1,v_beta_column=3,d_beta_column=4,j_beta_column=5,
count_column=6,n_jobs = 2, sep=",")
DTCR_WF.Get_Train_Valid_Test(test_size=0.2)
DTCR_WF.Train()
error msg start ---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
in
1 DTCR_WF.Get_Train_Valid_Test(test_size=0.2)
----> 2 DTCR_WF.Train()
3
4 # DTCR_WF.Monte_Carlo_CrossVal(folds=5,test_size=0.3,stop_criterion=0.25,epochs_min=100,
5 # suppress_output = False)

~/anaconda3/envs/dl/lib/python3.7/site-packages/DeepTCR-1.2.15-py3.7.egg/DeepTCR/DeepTCR.py in Train(self, batch_size, epochs_min, stop_criterion, stop_criterion_window, kernel, on_graph_clustering, num_clusters, weight_by_class, class_weights, trainable_embedding, accuracy_min, num_fc_layers, units_fc, drop_out_rate, suppress_output, use_only_seq, use_only_gene, use_only_hla, size_of_net, embedding_dim_aa, embedding_dim_genes, embedding_dim_hla)
3148
3149 valid_loss, valid_accuracy, valid_predicted, valid_auc =
-> 3150 Run_Graph_WF(self.valid, sess, self, GO, batch_size, random=False, train=False)
3151
3152

~/anaconda3/envs/dl/lib/python3.7/site-packages/DeepTCR-1.2.15-py3.7.egg/DeepTCR/functions/utils_s.py in Run_Graph_WF(set, sess, self, GO, batch_size, random, train, drop_out_rate)
390 loss = np.mean(loss)
391 accuracy = np.mean(accuracy)
--> 392 predicted_out = np.vstack(predicted_list)
393 try:
394 auc = roc_auc_score(set[-1], predicted_out)

~/anaconda3/envs/dl/lib/python3.7/site-packages/numpy/core/shape_base.py in vstack(tup)
281 """
282 _warn_for_nonsequence(tup)
--> 283 return _nx.concatenate([atleast_2d(_m) for _m in tup], 0)
284
285

ValueError: need at least one array to concatenate
End of error msg ------------------------------------
The directory structure is as following:

data_test/
├── A
│   ├── A_1.csv
│   └── A_2.csv
├── B
│   ├── B_1.csv
│   └── B_2.csv
├── C
│   ├── C_1.csv
│   └── C_2.csv
└── D
├── D_1.csv
└── D_2.csv

In each csv file, there is beta chain information
AAACCTGCAGGCTGAA-1,CASSIRDTETLYF,498,TRBV16,TRBD1,TRBJ2-3,1
AAACGGGAGGGTGTGT-1,CASGEGQTNSDYTF,568,TRBV13-2,TRBD1,TRBJ1-2,5
AAACGGGGTCTTCAAG-1,CASSGQNQDTQYF,503,TRBV15,TRBD1,TRBJ2-5,1
AAACGGGTCTAACTGG-1,CASSLGWHSYEQYF,572,TRBV16,None,TRBJ2-7,3
AAAGATGAGAATTGTG-1,CASGPGQSNTEVFF,527,TRBV13-2,TRBD1,TRBJ1-1,7
AAAGCAATCTGGCGAC-1,CASSDGLGGLEQYF,481,TRBV13-1,TRBD2,TRBJ2-7,7
AAATGCCCAATCCAAC-1,CAWVDWAQNTLYF,544,TRBV31,TRBD2,TRBJ2-4,3
AAATGCCTCGGCTTGG-1,CSAQGAHTEVFF,566,TRBV1,TRBD1,TRBJ1-1,18
AACACGTGTATAATGG-1,CASSSPLAGQDTQYF,519,TRBV3,None,TRBJ2-5,1

Number of records for each input file :
808 data_test/A/A_1.csv
1920 data_test/A/A_2.csv
2163 data_test/B/B_1.csv
1879 data_test/B/B_2.csv
836 data_test/C/C_1.csv
1182 data_test/C/C_2.csv
1705 data_test/D/D_1.csv
2091 data_test/D/D_2.csv

Understanding Training Strategy of Supervised TCR repertoire classification on HIV dataset

Hi, Sorry to disturb:

I am trying to understand the training strategy of HIV dataset and replicate the results you get in your publication.

It seems that the dataset can be categorized as non-cognate groups (CEF, AY9, No Peptide conditions), or cognate groups (where there is an epitope). We have 3 * 3 samples that are non-cognate, while 25 * 3 samples as cognate groups. I saw from the paper that deeptcr can distinguish non-cognate samples from cognate samples, and the training used keep two out of three for training data.

My question is, when doing the training, did you

  1. fit the model using all (3+25) * 2 data at once, where 3 * 2 are non-cognate and 25*2 are cognate group? Then you test the model on the remaining 3+25 samples and see whether the model can correctly predict whether each sample is cognate or non-cognate.
  2. Or you use (3+1) * 2 data, where the 3 * 2 data are non-cognate while the 1 * 2 data is from one specific epitope instead using all 25 * 2 samples as cognate group data? Then you test the model on the remaining 3+1 samples to see whether it can corrected predict which (one) sample is the cognate group.
    Then you repeat 2 for each specific epitope (MSPRTLNAW, NTQGYFPDW, etc...)

Thanks and looking forward to your reply!

Can´t load own Data using DTCR_SS.Get_Data

TRB.txt
I have TCRseq Data which was annotated by IGB and preprocessed for DeepTCR as indicated in the tutorial.
I have 9 Samples with many TCRs, here is an excerpt of the Data for one Sample:

cdr3_aa	v_call	d_call	j_call	Count
ASSARQDLQQY	TRBV2*01	TRBD1*01	TRBJ2-7*01	39890
ASKDRALLRAV	TRBV21-1*01	TRBD1*01	TRBJ2-7*01	32323
ASSFSATNTGELF	TRBV5-1*01	TRBD2*01	TRBJ2-2*01	26637
ASSPGEQNTGELF	TRBV7-8*01	TRBD2*01	TRBJ2-2*01	26258
ASSGAGTGGYNEQF	TRBV12-3*01	TRBD1*01	TRBJ2-1*01	16692
ASSFSGHTGELF	TRBV7-2*01	TRBD2*01	TRBJ2-2*01	13838
ASSVETGTEKY	TRBV7-9*01	TRBD1*01	TRBJ2-3*01	13831
PPVIWTATSST	TRBV24-1*01	TRBD1*01	TRBJ2-7*01	13819
ASSSGLAGAYEQY	TRBV7-2*02	TRBD2*01	TRBJ2-7*01	13216
ASSFGVSGANVLT	TRBV7-9*03	TRBD2*01	TRBJ2-6*01	11449
ASSGLAGGPGTGELF	TRBV9*01	TRBD2*02	TRBJ2-2*01	11292
ASSPLAGGVAQF	TRBV7-6*01	TRBD2*02	TRBJ2-1*01	11019
ASSSTGQGNSYEQY	TRBV28*01	TRBD1*01	TRBJ2-7*01	10466

If I run the Tutorial using the example Data from the Repository for supervised Sequence Classification, loading Data, cluster etc. works perfectly (except for DTCR_SS.Train() which throws:

[AttributeError: 'DeepTCR_SS' object has no attribute 'test_pred']()

DTCR_SS.Monte_Carlo_CrossVal, DTCR_SS.K_Fold_CrossVal etc. work.

If I then replace the Folders in Data/Murine_Antigens with my Samples, DTCR_SS.Get_Data() which usually takes just a moment to load the data gets stuck (stopped it after 40min).

Even after only using TCRs >= 1000 Reads which results in Tables between 50-80 rows, does not resolve the issue.

import sys
sys.path.append('../../')
from DeepTCR.DeepTCR import DeepTCR_SS

# Instantiate training object
DTCR_SS = DeepTCR_SS('Tutorial')

#Load Data from directories
DTCR_SS.Get_Data(directory='../../Data/TRB',Load_Prev_Data=False,aggregate_by_aa=True,
               aa_column_beta=0,count_column=4,v_beta_column=1,j_beta_column=3)

Output:

Loading Data ...

Is there anything that could cause this kind of Bug?

Attached you will find the data for one Sample for TCR-seqs > 1000 (as .txt file saved .tsv)

Thank you in Advance for your help!

Suggestion for UMAP_Plot() function

Hi,

This package has been usefully but I had a suggestion for the UMAP_Plot(). When showing the legend for the UMAP plot it would be nice to be able to have the legend to the right of the plot, or really anywhere that isn't on the plot itself, as when the number of labels is large it tends to block a substantial portion of the graph.

ValueError when running "8 - VAE Inference" in unsupervised tutorials

when I am running the inference, I am getting an error when running the following line:

features,_ = DTCRU.Sequence_Inference(beta_sequences=beta_sequences,v_beta=v_beta,j_beta=j_beta)

The error is the following

WARNING:tensorflow:From /home/ubuntu/.local/lib/python3.6/site-packages/DeepTCR/functions/utils_s.py:1005: The name tf.train.import_meta_graph is deprecated. Please use tf.compat.v1.train.import_meta_graph instead.

WARNING:tensorflow:From /home/ubuntu/.local/lib/python3.6/site-packages/DeepTCR/functions/utils_s.py:1006: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

INFO:tensorflow:Restoring parameters from murine_antigens/models/model_0/model.ckpt
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-5-da530d2eb961> in <module>()
----> 1 features,_ = DTCRU.Sequence_Inference(beta_sequences=beta_sequences,v_beta=v_beta,j_beta=j_beta)

ValueError: too many values to unpack (expected 2)

parallele for KNN

Could we add n_jobs to the KNN_Sequence_Classifier by adding n_jobs to KNeighborsClassifier from sklearn module? Currently the KNN_Sequence_Classifier is very slow.

Thanks!

Availability of trained models

Is any of the trained VAE models available publicly, ideally together with an evaluation script?

We are working on a related topic and would like to perform an apple-to-apple comparison to your VAE approach.

Issue loading single cell data

Hi, thank you for the great tool!

I am experiencing some issues loading paired single cell data.
I have csvs for each of my samples that have a barcode column and cdr3, v, and j (d coverage was low so I removed that column) genes for each chain. There are no NA values or empty values that I can see so I'm not sure why it is throwing an empty array error.

Any help would be appreciated!

DTCR_WF.Get_Data(directory='pln/',Load_Prev_Data=False,aggregate_by_aa=True,
... aa_column_beta=7,v_beta_column=5,j_beta_column=6,
... aa_column_alpha=4, v_alpha_column=2,j_alpha_column=3,count_column=8)
Loading Data...
Traceback (most recent call last):
File "", line 3, in
File "/#/RIMA/miniconda3/envs/deeptcr3.7/lib/python3.7/site-packages/DeepTCR/DeepTCR.py", line 336, in Get_Data
Y = OH.fit_transform(Y.reshape(-1,1))
File "/#/RIMA/miniconda3/envs/deeptcr3.7/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 488, in fit_transform
return super().fit_transform(X, y)
File "/#/RIMA/miniconda3/envs/deeptcr3.7/lib/python3.7/site-packages/sklearn/base.py", line 847, in fit_transform
return self.fit(X, **fit_params).transform(X)
File "/#/RIMA/miniconda3/envs/deeptcr3.7/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 461, in fit
self._fit(X, handle_unknown=self.handle_unknown, force_all_finite="allow-nan")
File "/#/RIMA/miniconda3/envs/deeptcr3.7/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 78, in _fit
X, force_all_finite=force_all_finite
File "/#/RIMA/miniconda3/envs/deeptcr3.7/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 44, in _check_X
X_temp = check_array(X, dtype=None, force_all_finite=force_all_finite)
File "/#/RIMA/miniconda3/envs/deeptcr3.7/lib/python3.7/site-packages/sklearn/utils/validation.py", line 800, in check_array
% (n_samples, array.shape, ensure_min_samples, context)
ValueError: Found array with 0 sample(s) (shape=(0, 1)) while a minimum of 1 is required.

more detailed tutorials or instructions

Hello,
We are interested in running this tool on our TCR seq data. We have multiple cohorts of cancer patients, responders and non-responders plus a big cohort of COVID patients, and covid vaccinated patients. We would very much like to study these samples using your tool. However, as I was reviewing the documentations I could not follow through all the steps to run this tool.
Are you planning to add more instructions? Or would you be able to send us detailed documentations on how to run this tool?
Thank you,
Arnavaz Danesh -Bioinformatician at University Health Network, Toronto.

Optimization of the threshold parameter in hierarchical clustering

Hello @sidhomj,

I used the unsupervised partof DeepTCR to cluster TCR sequences, but when I allowed the method to determine the optimal threshold parameter with the following command line, I got this error:

DTCRU_test.Cluster(clustering_method="hierarchical", linkage_method="ward", criterion="distance", write_to_sheets=False)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/.conda/envs/DeepTCR_env/lib/python3.7/site-packages/DeepTCR/DeepTCR.py", line 1054, in Cluster
    IDX = hierarchical_optimization(distances, features, method=linkage_method, criterion=criterion)
  File "/home/ubuntu/.conda/envs/DeepTCR_env/lib/python3.7/site-packages/DeepTCR/functions/utils_u.py", line 52, in hierarchical_optimization
    sil.append(skmetrics.silhouette_score(features[sel, :], IDX[sel]))
  File "/home/ubuntu/.conda/envs/DeepTCR_env/lib/python3.7/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 118, in silhouette_score
    return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))
  File "/home/ubuntu/.conda/envs/DeepTCR_env/lib/python3.7/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 229, in silhouette_samples
    check_number_of_labels(len(le.classes_), n_samples)
  File "/home/ubuntu/.conda/envs/DeepTCR_env/lib/python3.7/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 35, in check_number_of_labels
    % n_labels
ValueError: Number of labels is 2876. Valid values are 2 to n_samples - 1 (inclusive)

To correct this, I tried to modifiy the function hierarchical_optimization in the utils_u.py script in DeepTCR/functions folder (l.44):

def hierarchical_optimization(distances,features,method,criterion):
    Z = linkage(squareform(distances), method=method)
    t_list = np.arange(1, 100, 1) #t_list = np.arange(0, 100, 1)
    sil = []
    for t in t_list:
        IDX = fcluster(Z, t, criterion=criterion)
        if len(np.unique(IDX[IDX >= 0])) == 1:
            sil.append(0.0)
            continue
        sel = IDX >= 0
        sil.append(skmetrics.silhouette_score(features[sel, :], IDX[sel]))

    IDX = fcluster(Z, t_list[np.argmax(sil)], criterion=criterion)
    return IDX

and it works !

KNN_Sequence_Classifier

Hi,

Is it possible to have a Load_Previous_Data for the KNN_Sequence_Classifier function? It takes too much time to run.

I am currently using version 1.2.21
Thanks!

run DeepTCR using more than one GPUs

Hi,

I have a few questions regarding using GPU.

  1. I was wondering if I can run DeepTCR using multiple GPUs. I noticed that I am allowed to select which GPU I want to put the graph and train on if I have a multi-GPU environment. But does that mean I can only specify one GPU?
  2. I saw "I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero" in my log. Some people say this is just a warning instead of an error and can be simply ignored. Have you encountered the same issue before? Any insight will be greatly appreciated.

Also, I was wondering if you could provide some insights on training an imbalanced dataset (binary classification) for this algorithm. Would you suggest using a balanced training dataset or including as much data as possible?

Thanks for your time and help!

Performance

Thanks for writing this package. This package is very useful.

Tutorial: Clustering TCR Sequences list index out of Range Error when using DeepTCR in WSL

I tried to run the jupyter notebook tutorial as is after installing DeepTCR Development Version into its own Environment like this : pip3 install git+https://github.com/sidhomj/DeepTCR.git.

Running the Second Cell of the Tutorial (Phenograph Clustering) of the loaded Data I get a List index out of Range Error:

Command:

DTCRU.Cluster(clustering_method='phenograph')

Output:

Finding 30 nearest neighbors using minkowski metric and 'auto' algorithm Neighbors computed in 3.4997947216033936 seconds Jaccard graph constructed in 1.022956371307373 seconds Wrote graph to binary file in 0.5937418937683105 seconds Running Louvain modularity optimization

Error:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-2-1c6d55ce009b> in <module>
----> 1 DTCRU.Cluster(clustering_method='phenograph')

~/.local/lib/python3.8/site-packages/DeepTCR/DeepTCR.py in Cluster(self, set, clustering_method, t, criterion, linkage_method, write_to_sheets, sample, n_jobs, order_by_linkage)
   1044 
   1045             elif clustering_method == 'phenograph':
-> 1046                 IDX, _, _ = phenograph.cluster(features, k=30, n_jobs=n_jobs)
   1047 
   1048             elif clustering_method == 'kmeans':

~/.local/lib/python3.8/site-packages/DeepTCR/phenograph/cluster.py in cluster(data, k, directed, prune, min_cluster_size, jaccard, primary_metric, n_jobs, q_tol, louvain_time_limit, nn_method)
    114     uid = uuid.uuid1().hex
    115     graph2binary(uid, graph)
--> 116     communities, Q = runlouvain(uid, tol=q_tol, time_limit=louvain_time_limit)
    117     print("PhenoGraph complete in {} seconds".format(time.time() - tic), flush=True)
    118     communities = sort_by_size(communities, min_cluster_size)

~/.local/lib/python3.8/site-packages/DeepTCR/phenograph/core.py in runlouvain(filename, max_runs, time_limit, tol)
    261 
    262         # continue only if we've reached a higher modularity than before
--> 263         if q[-1] - Q > tol:
    264 
    265             Q = q[-1]

IndexError: list index out of range
``

What am I doing wrong ?

KNN_Repertoire_Classifer Error

Input data structure
6 labels, each label has 4 files, I tried, folds= 4, folds = 5, and folds = 10, all return the same error.

command is

DTCRU.KNN_Repertoire_Classifier(folds=10,
Load_Prev_Data=False,
metrics=['AUC', 'F1', 'Recall', 'Precision'],
plot_metrics=True, plot_type='box',
by_class=False,
n_jobs=40)

error msg start -----------------------------

File "/home/ubuntu/anaconda3/envs/dl/lib/python3.6/site-packages/DeepTCR/DeepTCR.py", line 2290, in KNN_Repertoire_Classifier
sns.catplot(data=df_out, x='Metric', y='Value', kind=plot_type)
File "/home/ubuntu/anaconda3/envs/dl/lib/python3.6/site-packages/seaborn/categorical.py", line 3724, in catplot
p.establish_colors(color, palette, 1)
File "/home/ubuntu/anaconda3/envs/dl/lib/python3.6/site-packages/seaborn/categorical.py", line 315, in establish_colors
lum = min(light_vals) * .6
ValueError: min() arg is an empty sequence

error msg end -----------------------------

Question about dropout rate

Hi.
Is the dropout probability in Convolutional_Features() 0.0 when training the unsupervised model? If not, where is this probability defined?

And did I understand correctly that there is no pooling between layers?

Question about incomplete data

Hi again,
How does DeepTCR deal with columns that have some missing values? For example, if there are some TCRBs that are missing the D gene.

Error when changing max_length

Hi,
In Train_VAE(), a ValueError is raised if a different max_length has been given to DeepTRC_U. The tensor shapes do not match on row 2292 when tf.equal() is called.

Näyttökuva 2020-12-5 kello 2 48 19

Interpreting Sequence_Inference output

Hello,

I would like to train a supervised model with known antigen specificity, then use that model to classify new TCR sequences as potentially targeting certain antigens. I have followed along with the tutorials, but am still unclear on the best way to do this. I believe the closest is the "8 - VAE Inference.ipynb" tutorial but using a supervised model rather than the unsupervised. However, I am unclear on how to interpret the output from Sequence_Inference. I am using the example data Mouse Antigens for the model and Rudqvist for the new dataset. The resulting "features" object is 23856x9 which I believe corresponds to the individual TCR sequences (23856) and 9 different antigens with the entriesS being scores for how well the TCR sequence fits that antigen.

  1. Does a higher or lower score mean the TCR sequence fits better with the given antigen?

I tried to assess this myself by looking at the features of the supervised model, however this object has 224 columns. I was expecting this to have 9 corresponding with the different antigens.

  1. What do the columns of the features object from the supervised model correspond to?

  2. Would you suggest this method of classification, or something more akin to this tutorial "3 - Supervised Sequence Regression.ipynb"?

Thank you for your help!

supervised learning example 2 error

I am running through the 2nd example of supervised learning using Rudqvist data.
and keep getting error when doing the training.

Input:

Instantiate training object

DTCR_WF = DeepTCR_WF('Tutorial')
#Load Data from directories
DTCR_WF.Get_Data(directory='github/DeepTCR/Data/Rudqvist',
Load_Prev_Data=False,
aggregate_by_aa=True, aa_column_beta=1,count_column=2,v_beta_column=7,d_beta_column=14,j_beta_column=21)
DTCR_WF.Get_Train_Valid_Test(test_size=0.25)
DTCR_WF.Train()

Error:
Training_Statistics:
Epoch: 1/10000 Training loss: 1.39491 Validation loss: 1.38709 Testing loss: 1.36048 Training Accuracy: 0.41667 Validation Accuracy: 0.0 Testing Accuracy: 0.5 Testing AUC: 0.66667
Training_Statistics:
Epoch: 2/10000 Training loss: 1.37405 Validation loss: 1.37631 Testing loss: 1.36329 Training Accuracy: 0.33333 Validation Accuracy: 0.0 Testing Accuracy: 0.5 Testing AUC: 0.66667
Training_Statistics:
Epoch: 3/10000 Training loss: 1.35652 Validation loss: 1.36742 Testing loss: 1.36681 Training Accuracy: 0.41667 Validation Accuracy: 0.25 Testing Accuracy: 0.5 Testing AUC: 0.58333
Training_Statistics:
Epoch: 4/10000 Training loss: 1.34040 Validation loss: 1.35875 Testing loss: 1.37043 Training Accuracy: 0.66667 Validation Accuracy: 0.5 Testing Accuracy: 0.25 Testing AUC: 0.58333
Training_Statistics:
Epoch: 5/10000 Training loss: 1.32491 Validation loss: 1.34920 Testing loss: 1.37438 Training Accuracy: 0.75 Validation Accuracy: 0.5 Testing Accuracy: 0.25 Testing AUC: 0.58333
Training_Statistics:
Epoch: 6/10000 Training loss: 1.30922 Validation loss: 1.33924 Testing loss: 1.37817 Training Accuracy: 0.83333 Validation Accuracy: 0.25 Testing Accuracy: 0.5 Testing AUC: 0.58333
Training_Statistics:
Epoch: 7/10000 Training loss: 1.29348 Validation loss: 1.32873 Testing loss: 1.38209 Training Accuracy: 0.83333 Validation Accuracy: 0.25 Testing Accuracy: 0.25 Testing AUC: 0.5
Training_Statistics:
Epoch: 8/10000 Training loss: 1.27746 Validation loss: 1.31783 Testing loss: 1.38608 Training Accuracy: 0.91667 Validation Accuracy: 0.25 Testing Accuracy: 0.25 Testing AUC: 0.5
Training_Statistics:
Epoch: 9/10000 Training loss: 1.26097 Validation loss: 1.30654 Testing loss: 1.39095 Training Accuracy: 0.91667 Validation Accuracy: 0.5 Testing Accuracy: 0.25 Testing AUC: 0.5
Training_Statistics:
Epoch: 10/10000 Training loss: 1.24401 Validation loss: 1.29454 Testing loss: 1.39617 Training Accuracy: 0.83333 Validation Accuracy: 0.5 Testing Accuracy: 0.25 Testing AUC: 0.5
Training_Statistics:
Epoch: 11/10000 Training loss: 1.22642 Validation loss: 1.28242 Testing loss: 1.40190 Training Accuracy: 0.91667 Validation Accuracy: 0.5 Testing Accuracy: 0.25 Testing AUC: 0.5
Training_Statistics:
Epoch: 12/10000 Training loss: 1.20822 Validation loss: 1.27018 Testing loss: 1.40825 Training Accuracy: 0.83333 Validation Accuracy: 0.5 Testing Accuracy: 0.25 Testing AUC: 0.5
Training_Statistics:
Epoch: 13/10000 Training loss: 1.18927 Validation loss: 1.25744 Testing loss: 1.41549 Training Accuracy: 0.83333 Validation Accuracy: 0.5 Testing Accuracy: 0.25 Testing AUC: 0.58333
Training_Statistics:
Epoch: 14/10000 Training loss: 1.16937 Validation loss: 1.24402 Testing loss: 1.42367 Training Accuracy: 0.83333 Validation Accuracy: 0.5 Testing Accuracy: 0.25 Testing AUC: 0.58333
Training_Statistics:
Epoch: 15/10000 Training loss: 1.14860 Validation loss: 1.22993 Testing loss: 1.43312 Training Accuracy: 0.83333 Validation Accuracy: 0.75 Testing Accuracy: 0.25 Testing AUC: 0.58333
Training_Statistics:
Epoch: 16/10000 Training loss: 1.12682 Validation loss: 1.21520 Testing loss: 1.44380 Training Accuracy: 0.75 Validation Accuracy: 0.75 Testing Accuracy: 0.25 Testing AUC: 0.58333
Training_Statistics:
Epoch: 17/10000 Training loss: 1.10405 Validation loss: 1.20000 Testing loss: 1.45633 Training Accuracy: 0.75 Validation Accuracy: 0.5 Testing Accuracy: 0.25 Testing AUC: 0.58333
Training_Statistics:
Epoch: 18/10000 Training loss: 1.08021 Validation loss: 1.18428 Testing loss: 1.47065 Training Accuracy: 0.83333 Validation Accuracy: 0.5 Testing Accuracy: 0.25 Testing AUC: 0.58333
Training_Statistics:
Epoch: 19/10000 Training loss: 1.05532 Validation loss: 1.16801 Testing loss: 1.48714 Training Accuracy: 0.83333 Validation Accuracy: 0.5 Testing Accuracy: 0.25 Testing AUC: 0.58333
Training_Statistics:
Epoch: 20/10000 Training loss: 1.02967 Validation loss: 1.15104 Testing loss: 1.50585 Training Accuracy: 0.83333 Validation Accuracy: 0.5 Testing Accuracy: 0.25 Testing AUC: 0.58333
Training_Statistics:
Epoch: 21/10000 Training loss: 1.00291 Validation loss: 1.13350 Testing loss: 1.52693 Training Accuracy: 0.83333 Validation Accuracy: 0.5 Testing Accuracy: 0.25 Testing AUC: 0.58333
Training_Statistics:
Epoch: 22/10000 Training loss: 0.97514 Validation loss: 1.11567 Testing loss: 1.55058 Training Accuracy: 0.83333 Validation Accuracy: 0.5 Testing Accuracy: 0.25 Testing AUC: 0.58333
Training_Statistics:
Epoch: 23/10000 Training loss: 0.94636 Validation loss: 1.09774 Testing loss: 1.57718 Training Accuracy: 0.83333 Validation Accuracy: 0.5 Testing Accuracy: 0.25 Testing AUC: 0.58333
Training_Statistics:
Epoch: 24/10000 Training loss: 0.91668 Validation loss: 1.07971 Testing loss: 1.60673 Training Accuracy: 0.83333 Validation Accuracy: 0.5 Testing Accuracy: 0.25 Testing AUC: 0.58333
Training_Statistics:
Epoch: 25/10000 Training loss: 0.88618 Validation loss: 1.06198 Testing loss: 1.63951 Training Accuracy: 0.83333 Validation Accuracy: 0.5 Testing Accuracy: 0.25 Testing AUC: 0.58333
Training_Statistics:
Epoch: 26/10000 Training loss: 0.85506 Validation loss: 1.04472 Testing loss: 1.67605 Training Accuracy: 0.83333 Validation Accuracy: 0.5 Testing Accuracy: 0.25 Testing AUC: 0.58333
Training_Statistics:
Epoch: 27/10000 Training loss: 0.82325 Validation loss: 1.02834 Testing loss: 1.71696 Training Accuracy: 0.91667 Validation Accuracy: 0.5 Testing Accuracy: 0.25 Testing AUC: 0.58333

AttributeError Traceback (most recent call last)
in
1 DTCR_WF.Get_Train_Valid_Test(test_size=0.25)
----> 2 DTCR_WF.Train()

~/anaconda3/envs/dl/lib/python3.7/site-packages/DeepTCR/DeepTCR.py in Train(self, batch_size, epochs_min, stop_criterion, stop_criterion_window, kernel, on_graph_clustering, num_clusters, weight_by_class, class_weights, trainable_embedding, accuracy_min, num_fc_layers, units_fc, drop_out_rate, suppress_output, use_only_seq, use_only_gene, use_only_hla, size_of_net, embedding_dim_aa, embedding_dim_genes, embedding_dim_hla)
3223 GO.saver.save(sess, os.path.join(self.Name, 'model', 'model.ckpt'))
3224
-> 3225 self.HLA_embed = GO.embedding_layer_hla.eval()
3226
3227 with open(os.path.join(self.Name, 'model', 'model_type.pkl'), 'wb') as f:

AttributeError: 'graph_object' object has no attribute 'embedding_layer_hla'

System: MacOS 10.13.6

Shared Motif for Clusters after Phenograph clustering

Hello Mr. Sidhom,
thank you for creating DeepTCR! I am using the supervised Sequence Classifier, including HLA Supertypes for Samples from different Patients and Treatments. My Question now is: Is it possible to extract shared Motifs for each Cluster that are common?
As DTCRSS.Representative_Sequences() as well as DTCR_SS.Motif_Identification() do extract explicitly the Sample specific Motifs if I understood correctly ?
Thank you in Advance for your help!

VAE AUC violin plot Y-axis value >1

I am running the unsupervised tutorial by

Instantiate training object

DTCRU = DeepTCR_U('Tutorial')

#Load Data from directories
DTCRU.Get_Data(directory='github/DeepTCR/Data/Rudqvist',Load_Prev_Data=False,aggregate_by_aa=True,
aa_column_beta=1,count_column=2,v_beta_column=7,d_beta_column=14,j_beta_column=21)

#Train VAE
DTCRU.Train_VAE(Load_Prev_Data=False,accuracy_min=0.9)

Output-----------

Finding 30 nearest neighbors using minkowski metric and 'auto' algorithm
Neighbors computed in 50.87618613243103 seconds
Jaccard graph constructed in 12.015304803848267 seconds
Wrote graph to binary file in 3.909883975982666 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.991586
Louvain completed 21 runs in 9.033319234848022 seconds
PhenoGraph complete in 75.99179887771606 seconds
Clustering Done
AUC__1

Clustering error

Getting an error when using our data, after loading with:

DTCRU.Load_Data(beta_sequences=beta,v_beta=v_beta,j_beta=j_beta,class_labels=class_labels,
                sample_labels=sample_labels, counts=counts)

Training appears to have gone ok :

DTCRU.Train_VAE(Load_Prev_Data=False,accuracy_min=0.85)

image

But clustering appears to fail:
image

same error when using phenograph method, so isn't clustering approach specific. Also happens when randomly sampling:

image

Is it possible that there are some outliers produced by the clustering methods, causing "sel" to be not an integer? or perhaps there is some meta data i need to set?

Other functions appear to work okay:

image

i see #1 which has a similar error, but my data exists as a single csv file which i'm loading via pandas and chopping the necessary columns out of. as such, loading via directory doesn't appear to be an option

any ideas?

'LabelEncoder' object has no attribute 'classes_' when loading own data

Hey,

I've managed to load my own data, however having trouble training the VAE (which works on the training dataset).

Code so far:

from DeepTCR.DeepTCR import DeepTCR_U
import pandas as pd
import numpy as np
DeepTCR_input = pd.read_csv('/Users/gordonbeattie/Documents/Projects/Maria_2/TCR/DeepTCR/DeepTCR_input.tsv', sep= '\t')

alpha = np.genfromtxt(DeepTCR_input.TRA, dtype='str')
beta = np.genfromtxt(DeepTCR_input.TRB, dtype='str')
sample = np.genfromtxt(DeepTCR_input.sample_ID, dtype='str')

DTCRU.Load_Data(alpha_sequences=alpha,beta_sequences=beta, sample_labels=sample)

DTCRU.Train_VAE(Load_Prev_Data=False)

Which throws the following error:
AttributeError: 'LabelEncoder' object has no attribute 'classes_'

Thanks in advance for any assistance!

Tensor shape mismatch while running "2 - Supervised Repertoire Classification" Tutorial

I am running the tutorial as is. When I am training the model on the data, there seems to be a tensor shape mismatch.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-5-685fbc9e79fc> in <module>()
----> 1 DTCR_WF.Train()

/home/ubuntu/.local/lib/python3.6/site-packages/DeepTCR/DeepTCR.py in Train(self, kernel, num_concepts, trainable_embedding, embedding_dim_aa, embedding_dim_genes, embedding_dim_hla, num_fc_layers, units_fc, weight_by_class, class_weights, use_only_seq, use_only_gene, use_only_hla, size_of_net, graph_seed, qualitative_agg, quantitative_agg, num_agg_layers, units_agg, drop_out_rate, multisample_dropout, multisample_dropout_rate, multisample_dropout_num_masks, batch_size, batch_size_update, epochs_min, stop_criterion, stop_criterion_window, accuracy_min, train_loss_min, hinge_loss_t, convergence, learning_rate, suppress_output, loss_criteria, batch_seed)
   5026               accuracy_min,train_loss_min,hinge_loss_t,convergence,learning_rate, suppress_output,
   5027                     loss_criteria)
-> 5028         self._train(write=True,batch_seed=batch_seed,iteration=0)
   5029 
   5030     def Monte_Carlo_CrossVal(self,folds=5,test_size=0.25,LOO=None,combine_train_valid=False,random_perm=False,seeds=None,

/home/ubuntu/.local/lib/python3.6/site-packages/DeepTCR/DeepTCR.py in _train(self, write, batch_seed, iteration)
   4747                 train_loss, train_accuracy, train_predicted,train_auc = \
   4748                     Run_Graph_WF(self.train,sess,self,GO,batch_size,batch_size_update,random=True,train=True,
-> 4749                                  drop_out_rate=drop_out_rate,multisample_dropout_rate=multisample_dropout_rate)
   4750 
   4751                 train_accuracy_total.append(train_accuracy)

/home/ubuntu/.local/lib/python3.6/site-packages/DeepTCR/functions/utils_s.py in Run_Graph_WF(set, sess, self, GO, batch_size, batch_size_update, random, train, drop_out_rate, multisample_dropout_rate)
    719         elif train:
    720             loss_i, accuracy_i, _, predicted_i = sess.run([GO.loss, GO.accuracy, GO.opt, GO.predicted],
--> 721                                                           feed_dict=feed_dict)
    722         else:
    723             loss_i, accuracy_i, predicted_i = sess.run([GO.loss, GO.accuracy, GO.predicted],

/home/ubuntu/anaconda3/envs/deeptcr/lib/python3.6/site-packages/tensorflow_gpu-1.15.2-py3.6-linux-x86_64.egg/tensorflow_core/python/client/session.py in run(self, fetches, feed_dict, options, run_metadata)
    954     try:
    955       result = self._run(None, fetches, feed_dict, options_ptr,
--> 956                          run_metadata_ptr)
    957       if run_metadata:
    958         proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

/home/ubuntu/anaconda3/envs/deeptcr/lib/python3.6/site-packages/tensorflow_gpu-1.15.2-py3.6-linux-x86_64.egg/tensorflow_core/python/client/session.py in _run(self, handle, fetches, feed_dict, options, run_metadata)
   1154                 'Cannot feed value of shape %r for Tensor %r, '
   1155                 'which has shape %r' %
-> 1156                 (np_val.shape, subfeed_t.name, str(subfeed_t.get_shape())))
   1157           if not self.graph.is_feedable(subfeed_t):
   1158             raise ValueError('Tensor %s may not be fed.' % subfeed_t)

ValueError: Cannot feed value of shape (16, 1) for Tensor 'Placeholder_2:0', which has shape '(?, 4)'

I am running it on centos with NVIDIA GPU. All the other tutorials seem to be working well.

Information

Hi sidhomj,

very nice tool, i have doubt, it can be used with SMART-Seq v4 PLUS Kit or SMARTer Human TCR ab Profiling kit ?

Simone

load previous-trained model

Hi Dr. John William Sidhom:

Sorry to disturb you, while I have a question about deepTCR. This is an amazing package, and I would like to use it to do sequence-level prediction. So here let us say that I train a model named 'model', and store the middle file at: /user. Then the program will generate a subfolder 'model' and store model checkpoint information there.

My question is, how to load this pre-trained model and do prediction using new data. I saw from website that 'sequence_inference' can 'load previous trained model', but I did not see an example at tutorial. So it will be extremely helpful if you may briefly tell me how to load a model from a pre-generated folder and thus utilize it to do prediction instead of train the model again.

Thanks in advance for your patience and have a great day!

No Sample Labels on Dendrogramplots

Hello Prof. Sidhom,
thank you for creating DeepTCR, it is a very useful & cool tool.

The Issue I experience when creating the Dendrogramplots is that the Circles of the Samples are getting labeled (by Sample/Class) in very unreadable colors, almost undistinguishable from the background.

Code

DTCRU.Repertoire_Dendrogram(n_jobs=40,distance_metric='correlation',sample_labels=True)
DTCRU.Repertoire_Dendrogram(n_jobs=40,distance_metric='correlation',log_scale=True,Load_Prev_Data=True,sample_labels=True)

Output

UMAP transformation...
PhenoGraph Clustering...
Finding 30 nearest neighbors using minkowski metric and 'auto' algorithm
Neighbors computed in 5.6398985385894775 seconds
Jaccard graph constructed in 1.4102559089660645 seconds
Wrote graph to binary file in 0.9278509616851807 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.970462
Louvain completed 21 runs in 2.248636484146118 seconds
PhenoGraph complete in 10.259825944900513 seconds
Clustering Done
/home/patrick/anaconda3/envs/DEEPTCR_ENV/lib/python3.8/site-packages/DeepTCR/functions/utils_u.py:161: MatplotlibDeprecationWarning: You are modifying the state of a globally registered colormap. In future versions, you will not be able to modify a registered colormap in-place. To remove this warning, you can make a copy of the colormap first. cmap = copy.copy(mpl.cm.get_cmap("viridis"))
  cmap_viridis.set_under(color='white', alpha=0)
/home/patrick/anaconda3/envs/DEEPTCR_ENV/lib/python3.8/site-packages/DeepTCR/functions/utils_u.py:161: MatplotlibDeprecationWarning: You are modifying the state of a globally registered colormap. In future versions, you will not be able to modify a registered colormap in-place. To remove this warning, you can make a copy of the colormap first. cmap = copy.copy(mpl.cm.get_cmap("viridis"))
  cmap_viridis.set_under(color='white', alpha=0)

SPTCR_Dendrogram_correlation
SPTCR_Dendrogram_correlation_log

Is it possible to change the font color of every Sample Labels ?

Sequence not featurized

Dear all,

I am trying to analyze a dataset but, for unknown reasons, some of the sequences are not considered.

My code is the follow

%%capture
import sys
import pandas as pd
sys.path.append('../../')
from DeepTCR.DeepTCR import DeepTCR_U

# Instantiate training object
DTCRU = DeepTCR_U('Tutorial')

a_target="MA0"
#Load Data from directories
DTCRU.Get_Data(directory='data_deep_tcr/'+a_target,Load_Prev_Data=False,aggregate_by_aa=True,
               aa_column_beta=0,count_column=1,v_beta_column=2,j_beta_column=3)

#Train VAE
DTCRU.Train_VAE(Load_Prev_Data=False, size_of_net="small")
DTCRU.Cluster(clustering_method='phenograph', sample=500)
DFs = DTCRU.Cluster_DFs

r_df=pd.DataFrame()

for i in range(0, len(DFs)):
    tdf=DFs[i]
    tdf["cluster_index"]=i
    r_df=r_df.append(tdf)

fn="result_clustering_MA0_clean.txt" 
r_df.to_csv(fn)

In the directory "data_deep_tcr/MA0 I have two subfolders. Each of these subfolders contains one TSV file with the following format:

aminoAcid	counts	v_beta	j_beta
CASTHLDPPGEQYFG	571795	hTRBV28	hTRBJ02-7
CASSPLGASGEQFFG	317906	hTRBV28	hTRBJ02-1
CASGGGEQFFG	104692	hTRBV12-3	hTRBJ02-1
CANEGASENTEAFFG	86447	hTRBV06-1	hTRBJ01-1
CASSFFPFNEQFFG	74908	hTRBV12-3	hTRBJ02-1

For example, I pass this sequence (with the v_beta and j_beta and the counts)
CANEGASENTEAFFG 73703 hTRBV06-8 hTRBJ01-1
but it is not clustered.

Whereas, the same sequence with a different count, v_beta and j_beta is clustered:
CANEGASENTEAFFG 86447 hTRBV06-1 hTRBJ01-17

Any idea why is this happening?

Inquiry about Fig2.c Motifs Visualization in DeepTCR

Hello Dr. Sidhom,

I recently came across your “DeepTCR” paper and I found the idea of combining supervised and unsupervised learning and applying them to modeling TCR repertoires intriguing. Also, the performance in your paper is very impressive! I have some questions regarding Figure 2.c (titled representative TCRs and learned TCR motifs) that I hope you can help with.

  1. The length of learned TCR motifs in DeepTCR is 5 or 4, could you provide some justification for that?
  2. Through the tutorial code, I noticed that there are around 30 different motifs are learned for each representative TCRs, however, it seems that the selected two motifs in Fig2.c are not the top two learned motif results. I wonder follow what principles did you select those motifs?

Thank you in advance for your help!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.