Giter VIP home page Giter VIP logo

tdc's Introduction

logo


website PyPI version Downloads Downloads GitHub Repo stars GitHub Repo stars

TDC CircleCI Conda Github Actions Build Documentation Status License: MIT Twitter

Website | Nature Chemical Biology 2022 Paper | NeurIPS 2021 Paper | Long Paper | Slack | TDC Mailing List | TDC Documentation | Contribution Guidelines

Artificial intelligence is poised to reshape therapeutic science. Therapeutics Data Commons is a coordinated initiative to access and evaluate artificial intelligence capability across therapeutic modalities and stages of discovery, supporting the development of AI methods, with a strong bent towards establishing the foundation of which AI methods are most suitable for drug discovery applications and why.

Researchers across disciplines can use TDC for numerous applications. AI-solvable tasks, AI-ready datasets, and curated benchmarks in TDC serve as a meeting point between biochemical and AI scientists. TDC facilitates algorithmic and scientific advances and accelerates machine learning method development, validation, and transition into biomedical and clinical implementation.

TDC is an open-science initiative. We welcome contributions from the community.

Key TDC Presentations and Publications

[1] Velez-Arce, Huang, Li, Lin, et al., TDC-2: Multimodal Foundation for Therapeutic Science, bioRxiv, 2024 [Paper]

[2] Huang, Fu, Gao, et al., Artificial Intelligence Foundation for Therapeutic Science, Nature Chemical Biology, 2022 [Paper]

[3] Huang, Fu, Gao, et al., Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development, NeurIPS 2021 [Paper] [Poster]

[4] Huang et al., Benchmarking Molecular Machine Learning in Therapeutics Data Commons, ELLIS ML4Molecules 2021 [Paper] [Slides]

[5] Huang et al., Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development, Baylearn 2021 [Slides] [Poster]

[6] Huang, Fu, Gao et al., Therapeutics Data Commons, NSF-Harvard Symposium on Drugs for Future Pandemics 2020 [#futuretx20] [Slides] [Video]

[7] TDC User Group Meetup, Jan 2022 [Agenda]

[8] Zitnik, Machine Learning to Translate the Cancer Genome and Epigenome Session, AACR Annual Meeting 2022, Apr 2022

[9] Zitnik, Few-Shot Learning for Network Biology, Keynote at KDD Workshop on Data Mining in Bioinformatics

[10] Zitnik, Actionable machine learning for drug discovery and development, Broad Institute, Models, Inference & Algorithms Seminar, 2021

[11] Zitnik, Graph Neural Networks for Biomedical Data, Machine Learning in Computational Biology, 2020

[12] Zitnik, Graph Neural Networks for Identifying COVID-19 Drug Repurposing Opportunities, MIT AI Cures, 2020

Unique Features of TDC

  • Diverse areas of therapeutics development: TDC covers a wide range of learning tasks, including target discovery, activity screening, efficacy, safety, and manufacturing across biomedical products, including small molecules, antibodies, and vaccines.
  • Ready-to-use datasets: TDC is minimally dependent on external packages. Any TDC dataset can be retrieved using only 3 lines of code.
  • Data functions: TDC provides extensive data functions, including data evaluators, meaningful data splits, data processors, and molecule generation oracles.
  • Leaderboards: TDC provides benchmarks for fair model comparison and systematic model development and evaluation.
  • Open-source initiative: TDC is an open-source initiative. If you want to get involved, let us know.

overview

See here for the latest updates in TDC!

Installation

Using pip

To install the core environment dependencies of TDC, use pip:

pip install PyTDC

Note: TDC is in the beta release. Please update your local copy regularly by

pip install PyTDC --upgrade

The core data loaders are lightweight with minimum dependency on external packages:

numpy, pandas, tqdm, scikit-learn, fuzzywuzzy, seaborn

For utilities requiring extra dependencies, TDC prints installation instructions. To install full dependencies, please use the following conda-forge solution.

Using conda

Data functions for molecule oracles, scaffold split, etc., require certain packages like RDKit. To install those packages, use the following conda installation:

conda install -c conda-forge pytdc

Tutorials

We provide tutorials to get started with TDC:

Name Description
101 Introduce TDC Data Loaders
102 Introduce TDC Data Functions
103.1 Walk through TDC Small Molecule Datasets
103.2 Walk through TDC Biologics Datasets
104 Generate 21 ADME ML Predictors with 15 Lines of Code
105 Molecule Generation Oracles
106 Benchmark submission
DGL Demo presented at DGL GNN User Group Meeting
U1.1 Demo presented at first TDC User Group Meetup
U1.2 Demo presented at first TDC User Group Meetup
201 TDC-2 Resource and Multi-modal Single-Cell API
202 TDC-2 Resource and PrimeKG
203 TDC-2 Resource and External APIs
204 TDC-2 Model Hub

Design of TDC

TDC has a unique three-tiered hierarchical structure, which to our knowledge, is the first attempt at systematically organizing machine learning for therapeutics. We organize TDC into three distinct problems. For each problem, we give a collection of learning tasks. Finally, for each task, we provide a series of datasets.

In the first tier, after observing a large set of therapeutics tasks, we categorize and abstract out three major areas (i.e., problems) where machine learning can facilitate scientific advances, namely, single-instance prediction, multi-instance prediction, and generation:

  • Single-instance prediction single_pred: Prediction of property given individual biomedical entity.
  • Multi-instance prediction multi_pred: Prediction of property given multiple biomedical entities.
  • Generation generation: Generation of new desirable biomedical entities.

problems

The second tier in the TDC structure is organized into learning tasks. Improvement in these tasks can result in numerous applications, including identifying personalized combinatorial therapies, designing novel classes of antibodies, improving disease diagnosis, and finding new cures for emerging diseases.

Finally, in the third tier of TDC, each task is instantiated via multiple datasets. For each dataset, we provide several splits of the dataset into training, validation, and test sets to simulate the type of understanding and generalization (e.g., the model's ability to generalize to entirely unseen compounds or to granularly resolve patient response to a polytherapy) needed for transition into production and clinical implementation.

TDC Data Loaders

TDC provides a collection of workflows with intuitive, high-level APIs for both beginners and experts to create machine learning models in Python. Building off the modularized "Problem -- Learning Task -- Data Set" structure (see above) in TDC, we provide a three-layer API to access any learning task and dataset. This hierarchical API design allows us to easily incorporate new tasks and datasets.

For a concrete example, to obtain the HIA dataset from the ADME therapeutic learning task in the single-instance prediction problem:

from tdc.single_pred import ADME
data = ADME(name = 'HIA_Hou')
# split into train/val/test with scaffold split methods
split = data.get_split(method = 'scaffold')
# get the entire data in the various formats
data.get_data(format = 'df')

You can see all the datasets that belong to a task as follows:

from tdc.utils import retrieve_dataset_names
retrieve_dataset_names('ADME')

See all therapeutic tasks and datasets on the TDC website!

TDC Data Functions

Dataset Splits

To retrieve the training/validation/test dataset split, you could simply type

data = X(name = Y)
data.get_split(seed = 42)
# {'train': df_train, 'val': df_val, 'test': df_test}

You can specify the splitting method, random seed, and split fractions in the function by e.g. data.get_split(method = 'scaffold', seed = 1, frac = [0.7, 0.1, 0.2]). Check out the data split page on the website for details.

Strategies for Model Evaluation

We provide various evaluation metrics for the tasks in TDC, which are described in model evaluation page on the website. For example, to use metric ROC-AUC, you could simply type

from tdc import Evaluator
evaluator = Evaluator(name = 'ROC-AUC')
score = evaluator(y_true, y_pred)

Data Processing

TDC provides numerous data processing functions, including label transformation, data balancing, pairing data to PyG/DGL graphs, negative sampling, database querying, and so on. For function usage, see our data processing page on the TDC website.

Molecule Generation Oracles

For molecule generation tasks, we provide 10+ oracles for both goal-oriented and distribution learning. For detailed usage of each oracle, please check out the oracle page on the website. For example, we want to retrieve the GSK3Beta oracle:

from tdc import Oracle
oracle = Oracle(name = 'GSK3B')
oracle(['CC(C)(C)....' 
  'C[C@@H]1....',
  'CCNC(=O)....', 
  'C[C@@H]1....'])

# [0.03, 0.02, 0.0, 0.1]

TDC Leaderboards

Every dataset in TDC is a benchmark, and we provide training/validation and test sets for it, together with data splits and performance evaluation metrics. To participate in the leaderboard for a specific benchmark, follow these steps:

  • Use the TDC benchmark data loader to retrieve the benchmark.

  • Use training and/or validation set to train your model.

  • Use the TDC model evaluator to calculate the performance of your model on the test set.

  • Submit the test set performance to a TDC leaderboard.

As many datasets share a therapeutics theme, we organize benchmarks into meaningfully defined groups, which we refer to as benchmark groups. Datasets and tasks within a benchmark group are carefully curated and centered around a theme (for example, TDC contains a benchmark group to support ML predictions of the ADMET properties). While every benchmark group consists of multiple benchmarks, it is possible to separately submit results for each benchmark in the group. Here is the code framework to access the benchmarks:

from tdc import BenchmarkGroup
group = BenchmarkGroup(name = 'ADMET_Group', path = 'data/')
predictions_list = []

for seed in [1, 2, 3, 4, 5]:
    benchmark = group.get('Caco2_Wang') 
    # all benchmark names in a benchmark group are stored in group.dataset_names
    predictions = {}
    name = benchmark['name']
    train_val, test = benchmark['train_val'], benchmark['test']
    train, valid = group.get_train_valid_split(benchmark = name, split_type = 'default', seed = seed)
    
        # --------------------------------------------- # 
        #  Train your model using train, valid, test    #
        #  Save test prediction in y_pred_test variable #
        # --------------------------------------------- #
        
    predictions[name] = y_pred_test
    predictions_list.append(predictions)

results = group.evaluate_many(predictions_list)
# {'caco2_wang': [6.328, 0.101]}

For more information, visit here.

Cite Us

If you find Therapeutics Data Commons useful, cite our latest preprint, our NeurIPS paper, and Nature Chemical Biology paper :

@article {Velez-Arce2024tdc,
	author = {Velez-Arce, Alejandro and Huang, Kexin and Li, Michelle and Lin, Xiang and Gao, Wenhao and Fu, Tianfan and Kellis, Manolis and Pentelute, Bradley L. and Zitnik, Marinka},
	title = {TDC-2: Multimodal Foundation for Therapeutic Science},
	elocation-id = {2024.06.12.598655},
	year = {2024},
	doi = {10.1101/2024.06.12.598655},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2024/06/21/2024.06.12.598655},
	eprint = {https://www.biorxiv.org/content/early/2024/06/21/2024.06.12.598655.full.pdf},
	journal = {bioRxiv}
}
@article{Huang2021tdc,
  title={Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development},
  author={Huang, Kexin and Fu, Tianfan and Gao, Wenhao and Zhao, Yue and Roohani, Yusuf and Leskovec, Jure and Coley, 
          Connor W and Xiao, Cao and Sun, Jimeng and Zitnik, Marinka},
  journal={Proceedings of Neural Information Processing Systems, NeurIPS Datasets and Benchmarks},
  year={2021}
}
@article{Huang2022artificial,
  title={Artificial intelligence foundation for therapeutic science},
  author={Huang, Kexin and Fu, Tianfan and Gao, Wenhao and Zhao, Yue and Roohani, Yusuf and Leskovec, Jure and Coley, 
          Connor W and Xiao, Cao and Sun, Jimeng and Zitnik, Marinka},
  journal={Nature Chemical Biology},
  year={2022}
}

TDC is built on top of other open-sourced projects. If you used these datasets/functions in your research, please cite the original work as well. You can find the original paper on the website for the function/dataset.

Contribute

TDC is a community-driven and open-science initiative. To get involved, join our Slack Workspace and check out the contribution guide!

Contact

Reach us at [email protected] or open a GitHub issue.

Data Server

TDC is hosted on Harvard Dataverse with the following persistent identifier https://doi.org/10.7910/DVN/21LKWG. When Dataverse is under maintenance, TDC datasets cannot be retrieved. That happens rarely; please check the status on the Dataverse website.

License

TDC codebase is under MIT license. For individual dataset usage, please refer to the dataset license found in the website.

tdc's People

Contributors

abearab avatar amva13 avatar annaweber209 avatar ayushnoori avatar benb111 avatar clinuxmdl avatar futianfan avatar hadim avatar haneul-park avatar iwwwish avatar jacksonburns avatar jannisborn avatar kexinhuang12345 avatar lanceknight avatar lilleswing avatar malcolmgreaves avatar marinkaz avatar nithanaroy avatar synapticarbors avatar wenhao-gao avatar yhr91 avatar yuanqidu avatar yzhao062 avatar zoe-v avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tdc's Issues

DTI dataset

Hello, good afternoon. Greetings from Colombia.
I have some doubts about the datasets:
. In these DTI data sets, only the positive interactions are considered? or also the negative ones? i.e., there are data that the interaction did not occur?
. In BindingDB_IC50, the "Y" data refer to the IC50 or to another metric? and if it refers to the IC50 in which units is it? in mol? or in kmol?

3D Dataloader Class

Describe the problem
Currently TDC focus on 2D data loader (SMILES). Many important tasks (e.g. QM, Protein structure, binding) are 3D. It is crucial to implement 3D data loader class.

Describe the solution you'd like
A method wrapper that takes in raw 3D data format such as xyz, sdf, and then output machine learning ready format designated by user. These formats could include dataframe object, DGL/PyG object, and so on.

Additional context
This is more of a backend class such that future ML task class can inherit from it.

how to use .pt file to get the data of prediciton

Describe your question.

I am research in the ADMET prediction. The work of TDC is fantastic. In the tdc.single_pred, when I have trained model, how I use the .pt file to predict a strange data to get the y_pred that is a data of prediction whether classification or regression? Likes in the classification that is 0 or 1 and in the regression that is a real data. Thank you very much.

animal to human genotype-phenotype prediction

Describe the problem

The gap between animal testing and human clinical trial is unfathomable. This is one of the reason why a drug is good in preclinical studies but show no efficacy in clinical trials. ML can potentially alleviate the issue.

The task is defined as: Given genotype-phenotype data of animals and only the genotype data of humans, train the model to fit the phenotype from the genotype and transfer this model to humans.

Reference: https://www.nature.com/articles/sdata20149

Describe the solution you'd like

from tdc.single_pred import GenoPhenoTrans
data = GenoPhenoTrans(name = 'sbv-improver',  path = './data')

Additional context
N/A

Docking Score Oracle

Thanks for providing this tool! I'm looking forward to using it in our own molecular generation projects. I'm trying to follow the code snippet provided here: Docking Scores

After following the 4 instructions in that page, I'm consistently running into this error:

>>> oracle2 = Oracle(name='docking_score', software='vina', pyscreener_path='/global/home/users/adchen/pyscreener', pdbids=['5WIU'], center=(-18.2,14.4,-16.1), size=(15.4, 13.9, 14.5), buffer=10, path='./my_test/', num_worker=1, ncpu=4)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/global/home/users/adchen/miniconda3/envs/gcpn3/lib/python3.7/site-packages/tdc/oracles.py", line 21, in __init__
    self.assign_evaluator()
  File "/global/home/users/adchen/miniconda3/envs/gcpn3/lib/python3.7/site-packages/tdc/oracles.py", line 163, in assign_evaluator
    self.evaluator_func = docking_meta(**self.kwargs)
  File "/global/home/users/adchen/miniconda3/envs/gcpn3/lib/python3.7/site-packages/tdc/chem_utils.py", line 1686, in __init__
    self.scorer = screener(**kwargs)
  File "/global/home/users/adchen/pyscreener/pyscreener/docking/vina.py", line 107, in __init__
    path=path, verbose=verbose, **kwargs)
  File "/global/home/users/adchen/pyscreener/pyscreener/docking/base.py", line 93, in __init__
    self.receptors = receptors
  File "/global/home/users/adchen/pyscreener/pyscreener/docking/base.py", line 333, in receptors
    receptors = [self.prepare_receptor(receptor) for receptor in receptors]
  File "/global/home/users/adchen/pyscreener/pyscreener/docking/base.py", line 333, in <listcomp>
    receptors = [self.prepare_receptor(receptor) for receptor in receptors]
  File "/global/home/users/adchen/pyscreener/pyscreener/docking/vina.py", line 128, in prepare_receptor
    sp.run(args, stderr=sp.PIPE, check=True)
  File "/global/home/users/adchen/miniconda3/envs/gcpn3/lib/python3.7/subprocess.py", line 488, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/global/home/users/adchen/miniconda3/envs/gcpn3/lib/python3.7/subprocess.py", line 800, in __init__
    restore_signals, start_new_session)
  File "/global/home/users/adchen/miniconda3/envs/gcpn3/lib/python3.7/subprocess.py", line 1551, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'prepare_receptor': 'prepare_receptor'

Relevant system details:

- CentOS Linux 7 (Core)
- Python 3.7 

If you could shine some light on this error, that would be great!
Thanks!

Difference between Benchmark Group and Datasets

As you mentioned in your Get Started that each dataset is a benchmark, I'm curious if the data is the same for both, and if the train/valid/test splits obtained through the dataset are the same as those obtained through the benchmark?

Thanks for your answer!

Same drug-target pair has different affinities in Davis

Describe the bug
The Davis dataset is assumed to contain a unique affinity value for a drug-target pair. However, in TDC, there are duplicated drug-target pairs with different affinity values.

To Reproduce

from tdc.multi_pred import DTI
data = DTI('DAVIS', path='./data/TDC')
df = data.get_data()
df = df.drop(columns=['Drug', 'Target'])
df = df[(df['Drug_ID'] == 25243800) & (df['Target_ID'] == 'RET(V804M)')]
print(df)

Expected behavior
The expected output is given below. Different Y values were labeled for drug 25243800 and target RET(V804M).

        Drug_ID   Target_ID      Y
18196  25243800  RET(V804M)    4.8
18197  25243800  RET(V804M)    4.0
18198  25243800  RET(V804M)  350.0
18199  25243800  RET(V804M)  340.0

Environment:

  • TDC version: 0.3.0
  • davis.tab version on dataverse: 2021-01-09 (UNF:6:x6TTv0Um70rEZT/eL8eCtA==)

Additional context
When compared to the raw data of the Davis et al. paper, it looks like the four affinities values shown above should be assigned to targets RET, RET(M918T), RET(V804L), and RET(V804M), respectively. It seems all target IDs were overwritten by RET(V804M).

Make the mappings in `tdc.metadata` consistent

This is a minor issue, but I nevertheless believe it could improve the usability of TDC. I noticed that the category-task-dataset hierarchy as defined in the tdc.metadata module is not consistent. Specifically, category_names and admet_benchmark use Tox to identify the Toxicity Prediction task, while dataset_names uses Toxicity. Is this meant to be that way? As the Tox/Toxicity discrepancy is the only one I found there is an easy workaround, but I would consider changing it so that the meta-data completely defines the category-task-dataset hierarchy.

Cold protein split method

Thank you for this amazing library/API.

I was just wondering what method is used for protein cold splitting. Is it some sort of sequence or structural similarity metric?

ModuleNotFoundError: No module named 'tdc.chem_utils.featurize'

Hello, greetings from Colombia.
I try to convert a SMILES to Graph2D, but it does not recognize the module in the library.
I did the installation through conda ( conda install -c conda-forge pytdc ) and I am using python 3.8 on a jupyter notebook; the error is the next:
code:

from tdc.chem_utils import MolConvert
converter = MolConvert(src = 'SMILES', dst = 'Graph2D')
converter(['Clc1ccccc1C2C(=C(/N/C(=C2/C(=O)OCC)COCCN)C)\C(=O)OC',
'CCCOc1cc2ncnc(Nc3ccc4ncsc4c3)c2cc1S(=O)(=O)C(C)(C)C'])
Error:

ModuleNotFoundError: No module named 'tdc.chem_utils.featurize'

. I would appreciate any information on this matter

scikit-learn version fixed

Describe the problem
A clear and concise description of what the problem is.

Describe the solution you'd like
A clear and concise description of what you want to happen. Ideally, with pseudo-code.

Additional context
Add any other context or screenshots about the feature request here.

ddi.py has no "import numpy as np"

Describe the bug
ddi.py has no "import numpy as np" which results in an error in print_stats under data loader.

To Reproduce
data = DDI(name = 'DrugBank_DDI', print_stats = True)

Expected behavior
"print_stat" can show the statistics of the dataset.

Screenshots
File "C:\Users\hp\Anaconda3\envs\bio-env\lib\site-packages\tdc\multi_pred\ddi.py", line 44, in init
self.print_stats()
File "C:\Users\hp\Anaconda3\envs\bio-env\lib\site-packages\tdc\multi_pred\ddi.py", line 53, in print_stats
self.entity1.tolist() + self.entity2.tolist()))) + ' unique drugs.',
NameError: name 'np' is not defined

Environment:

  • OS: Windows 10
  • Python version: 3.7.3
  • TDC version: pytdc 0.1.9
  • Any other relevant information:

Additional context
ppi.py also has no "import numpy as np".

Load JNK3 Oracle Error

Similar erroes occurred for GSK3B, DRD2.

Traceback (most recent call last):
File "/home/yqdu/moflow-master/data/process_property.py", line 24, in
jnk3_oracle = Oracle('JNK3')
File "/home/yqdu/anaconda3/envs/rdkit/lib/python3.9/site-packages/tdc/oracles.py", line 21, in init
self.assign_evaluator()
File "/home/yqdu/anaconda3/envs/rdkit/lib/python3.9/site-packages/tdc/oracles.py", line 51, in assign_evaluator
oracle_object = jnk3()
File "/home/yqdu/anaconda3/envs/rdkit/lib/python3.9/site-packages/tdc/chem_utils.py", line 597, in init
self.jnk3_model = pickle.load(f)
ModuleNotFoundError: No module named 'sklearn.ensemble.forest'

DrugBank dataset shows error

Describe your question.

image

image

when i get the DrugBank dataset, it shows error tokenizing data in 666 row, whether the origin dataset has some problem?
Thank U.

KIBA

Hello, Greetings from Colombia
One Question:
The KIBA article talks about a threshold for positive cases of 1000nM and in the TDC library, it only goes up to 11nM, 15 nM etc?
Why is this happening, if we are working in nM, shouldn't it be 11000nM or 15000 nM? Or the units in the library are in uM? or in another unit?
Because I consider that the units of kiba should be the same used in kd and ki.

New Dataset: String - PPI

Describe the problem
String database has a large number of known PPIs. We currently have HuRI, but would love to add this to our PPI task as well. This would be a relatively simple task for new user to get acquainted with TDC. We would need the protein ID and the protein sequence, which may require some mapping.

The dataset can be found at https://string-db.org/.

Describe the solution you'd like

from tdc.multi_pred import PPI

data = PPI(name = 'String', path = './data')

Additional context
N/A

drug side effect frequency prediction

Describe the problem
Currently, TDC has DDI data. However, there are only relation types between drugs. The frequency of them is also crucial. This is a new task, similar to DDI, under the multi_pred problem setup.

The dataset can be found at https://www.nature.com/articles/s41467-020-18305-y.

Describe the solution you'd like

from tdc.multi_pred import DDIFreq

data = DDIFreq(name = 'Galeano', path = './data')

Additional context
N/A

antisense oligo efficacy prediction

Describe the problem
Antisense oligonucleotides (ODNs) therapy is a promising Tx class. It binds to mRNA to prevent it from becoming proteins. The efficacy of ODNs to mRNA can be predicted by machine learning. This task requires further mining of available ODNs database since many public ones such as ODNBase and AOBase are not online anymore.

Reference: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-7-122

Describe the solution you'd like

from tdc.multi_pred import ODN
data = ODN(name = 'DATA_NAME',  path = './data')

Additional context
N/A

Support fcd-torch

The current FCD computation relies on the official FCD package available at https://github.com/bioinf-jku/FCD. See

def fcd_distance(generated_smiles_lst, training_smiles_lst):
.

The official Implementation uses tensorflow which can be an issue when the stack is pytorch based in terms of dependencies.

A pytorch implementation already exists at https://github.com/rynkl/fcd-torch (also available on conda forge) and we already used it with success.

It would be nice if TDC could support both (using an arg to fcd_distance to select the appropriate "backend").

New dataset: hERG Central

In a 2011 paper, Du. et al describe a database of hERG data on 300k+ compounds. The website they built with the data has since been taken down but Cambridge MedChem Consulting reconstructed the dataset using the assay data from the paper's supplementary data from the paper along with structures pulled from PubChem. The resulting dataset is available as a Zip file, and I've found it to be a useful data set for experimentation.

I propose adding this data to TDC and would be happy to submit a pull request. Let me know!

Is Acute Toxicity LD50 a regression problem rather than a binary classification problem?

Thanks to the team for providing the good benchmarks. When I use the Acute Toxicity LD50, I found that the labels of the samples are all real numbers. The result is as follows:
image
However, the task description of it says that it is a binary classification problem. The way I load the data is as follows, and I want to know is there any mistake with it?

from tdc.single_pred import Tox
data = Tox(name = 'LD50_Zhu')
split = data.get_split()
print(split['train'])

Extend PyG dataloader support

It would be great if closer integration with PyTorch Geometric could be provided for the various TDC datasets. While it is straightforward to convert e.g. SMILES into the graph format for single instances (not sure however how to interpret the node features or include edge features and labels?) via tdc.chem_utils.MolConvert, it is not clear to me how this can be turned into a PyTorch Geometric compatible dataset/dataloader including the desired features. E.g. the OGB repository provides support as follows:

from ogb.graphproppred import PygGraphPropPredDataset

dataset = PygGraphPropPredDataset(name='ogbg-molhiv' )
split_idx = dataset.get_idx_split()
batch_size = 32
dl_tr = DataLoader(dataset[split_idx['train']], batch_size=batch_size, shuffle=True)
dl_val = DataLoader(dataset[split_idx['valid']], batch_size=batch_size, shuffle=False)
dl_te = DataLoader(dataset[split_idx['test']], batch_size=batch_size, shuffle=False)

Maybe I am missing something here and the above is already possible via the otherwise great TDC library?

Collect biomolecule helper function

Describe the problem

We are going to include more biologic therapeutic development tasks. So the first step is to develop a macromolecule version of MolConverter to read, parse and featurize the biomolecules. Some basic functionalities of removing waters and ions, add/remove hydrogens are required.

Describe the solution you'd like

from tdc.chem_utils import PDBConvert
converter = PDBConvert(src = 'pdbid', dst = 'MSA')
# scr could be pdbid or a local pdb file, dst could be typical 
# features of pdb, such as distance matrix, sequence, secondary 
# structure, multi-sequence alignment, coordinates, 3D graph, etc.
converter(['4UNN', '7l11'])

Additional context

Probably a good start point: https://www.pyrosetta.org, https://biopython.org

New dataset: Shields et al. Reaction Outcome

Describe the problem

This is a new dataset on reaction outcomes. They collect a large benchmark dataset for a palladium-catalysed direct arylation reaction. We would love to add it to our ReactionYield task. This would be a relatively simple task for new user to get acquainted with TDC.

Describe the solution you'd like

from tdc.single_pred import Yields
data = Yields(name = 'Shields')

Additional context

Reference: https://www.nature.com/articles/s41586-021-03213-y

A typical single-pred type task.
Step-by-step guide: https://github.com/mims-harvard/TDC/blob/master/CONTRIBUTE.md

HTS Screening Dataset

Describe the problem
Current HTS dataset in TDC is single-assay only. One large-scale HTS available out there is ChEMBL. It has 290,041 assays and 1,057,015 compounds. It will contain 6,882,639 data points.

Reference: https://pubs.rsc.org/en/content/articlelanding/2018/SC/C8SC00148K

Describe the solution you'd like

from tdc.single_pred import HTS
data = HTS(name = 'ChEMBL_HTS', label = 'ASSAY NAME', path = './data')

Additional context
N/A

Expand QM7/8/9 Dataset Formats

Describe the problem
The current QM7/8/9 only has coulomb matrix as the X format. We should expand it to 3D data format that user would like.

Describe the solution you'd like
Update tdc.single_pred.qm class object to inherit from 3D dataloader class. Users can specify the format they want, and TDC will convert it in the backend.

Additional context
This is blocked by 3D dataloader class ticket.

Unable to run `.get_label_meaning()` for the TWOSIDES dataset

Hi,

When I try to call the function .get_label_meaning() on the TWOSIDES dataset I get the following error:

python3.8/site-packages/tdc/base_dataset.py in get_label_meaning(self, output_format)
    163             dict/pd.DataFrame/np.array: when output_format is dict/df/array
    164         """
--> 165         return utils.get_label_map(self.name, self.path, self.target,
    166                                    file_format=self.file_format,
    167                                    output_format=output_format)

AttributeError: 'DDI' object has no attribute 'target'

I am trying to understand what Y represents for this dataset? I assume the ints are mappings to reported side effects but it would be great to get that clarified.

Environment:

  • OS: MacOS 11.6
  • Python version: 3.8.12
  • TDC version: 0.3.1

Unused fold_seed in create_scaffold_split function

The method single_pred.DataLoader.get_split applies seed argument to the single_pred.create_scaffold_split function, however the latter doesn't use corresponding fold_seed parameter anywhere. Is it an expected behavior?

Add pointers to data processing script & ChEMBL dataset update

The overall question:
Is it possible to describe the preprocessing and data origin for different datasets?

Explanation:
I am currently looking into using ChEMBL via TDC. However, it would be essential to know which version of the dataset is provided here and how it is preprocessed (cleaned...) for reproducibility purposes. This is especially important because ChEMBL has recurring releases. In the source code, the file is downloaded from "https://dataverse.harvard.edu/api/access/datafile/" without any explanation (TDC/tdc/utils/load.py).

A Solution:
Add a section data origin/preprocessing to the documentation.

New PROTACs dataset

Describe the problem
PROTACs are a class of therapeutics that flag target proteins to a cell's waste disposal system, instead of the traditional binding to targets. It has been very promising due to its ability to tackle undruggable targets. However, problems still exist, such as designing the PROTACs, measure the degradability, understand the mechanism. A recent paper https://www.biorxiv.org/content/10.1101/2021.09.27.462040v1 has a nice dataset that predicts degradability given a proten PDB. The dataset is linked here: http://mapd.cistrome.org/

Describe the solution you'd like

from tdc.single_pred import PROTACs_Degrade
data = PROTACs_Degrade(name = 'MAPD',  path = './data')

Additional context
N/A

Adding gene symbols for GDSC

Describe the problem
Adding a function for gene symbols in GDSC gene expressions.

Describe the solution you'd like

from tdc.multi_pred import DrugRes
data = DrugRes(name = 'GDSC1')
data.get_gene_symbols()
## ['TSPAN6', 'TNMD', ....]

Similarly for GDSC2.

Additional context
NA

New TCR-Epitope Binding Affinity Prediction Task

Describe the problem
T-cells are an integral part of the adaptive immune system, whose survival, proliferation, activation and function are all governed by the interaction of their T-cell receptor (TCR) with immunogenic peptides (epitopes). A large repertoire of T-cell receptors with different specificity is needed to provide protection against a wide range of pathogens. This new task aims to predict the binding affinity given a pair of TCR sequence and epitope sequence.

Describe the solution you'd like

from tdc.multi_pred import TCR_Epitope_Binding
data = TCR_Epitope_Binding(name = 'Weber',  path = './data')

Additional context
Reference: https://academic.oup.com/bioinformatics/article/37/Supplement_1/i237/6319659
Data folder: https://ibm.ent.box.com/v/titan-dataset

Why have cross validation split?

First amazing work. Having these datasets, and official challenges is very very helpful to the community.

Having an explicit cross validation split seems weird to me. Shouldn't it be on the model to figure out what it needs to generalize to the test set? A specifically chosen cross validation set would be useful if it was drawn from the same distribution as the test set but this does not seem the case.
Screen Shot 2021-02-26 at 8 08 23 AM

As you can see the distribution is different than the training set but also very different than the test set! What is the point of cross validating against this. It could be argued that in a time-series split one would not know the future distribution of the test set, but that brings us back into why should this benchmark set a cross validation set at all.

Are there rules for how to use the cross validation set? Is your model only allowed to "peak" these compounds so many times? Are ensemble models discouraged from the competition which do k-fold or bootstrap cross validation?

PDB Binding - Protein Ligand

Describe the problem

  1. PDBBind is the dataset with 3D proteins and 3D ligands, and their binding affinity score.
  2. Write protein and small-molecule 3D classes, Formulate the task, clean the datasets.
  3. data is available at http://www.pdbbind.org.cn

Describe the solution you'd like

from tdc.multi_pred import ProteinLigandBinding
data = ProteinLigandBinding(name = 'pdbbind')

Additional context

Wrong SMILES in BindingDB

I find that there are 4 wrong SMILES in BindingDB. When I execute the following code, it will get error.

mol = Chem.MolFromSmiles(smiles)
c_size = mol.GetNumAtoms()

I already fixed 3 of them, but the last one I do not know how to fix. Can you correct it and update the TDC data source file?

'COc1cc2ncnc(Oc3cccc(NC(=O)NC4=CC(=[N](N4)c4ccccc4)C(F)(F)F)c3)c2cc1OC' -> 'COc1cc2ncnc(Oc3cccc(NC(=O)NC4=CC(=[N+](N4)c4ccccc4)C(F)(F)F)c3)c2cc1OC', 
'COc1cc2ncnc(Sc3cccc(NC(=O)NC4=CC(=[N](N4)c4ccccc4)C(F)(F)F)c3)c2cc1OC' -> 'COc1cc2ncnc(Sc3cccc(NC(=O)NC4=CC(=[N+](N4)c4ccccc4)C(F)(F)F)c3)c2cc1OC', 
'COc1cc2ncnc(Sc3cccc(NC(=O)NC4=CC(=[N](C)N4)C(F)(F)F)c3)c2cc1OC' -> 'COc1cc2ncnc(Sc3cccc(NC(=O)NC4=CC(=[N+](C)N4)C(F)(F)F)c3)c2cc1OC', 
'Cc1ccc(F)c(NC(=O)Nc2cnn(c2)-c2cccc3nnc(N)c23)c1'

Thanks:)

TDC module integration with DeepChem

Describe the problem
DeepChem has many useful drug discovery ML models. It would be great to implement TDC dataloader in DeepChem such that users in DeepChem can directly utilize TDC datasets.

Describe the solution you'd like
DeepChem has dataset class for many MolNet datasets (e.g. https://github.com/deepchem/deepchem/blob/master/deepchem/molnet/load_function/bace_datasets.py). We can modify them to create a meta deepchem.tdc/deepchem.molnet.tdc dataloader and use TDC at the backend to retrieve the csv file in molnet format and feed them to DeepChem data loader. We can first limit our scope to the ADMET, and some other single instance prediction since they have similar format as MolNet.

Additional context
Ideally, it would be great to have someone who is familiar with DeepChem to help on this.

QM oracle

Describe the problem

Computed optimized geometric, energetic, electronic, and thermodynamic properties help small molecular pharmaceutic development and have been used in single-pred tasks as benchmarks (e.g., QM9). On the other hand, it is desired to have an easy-to-access oracle to calculate those values and use it to benchmark algorithms require on-the-fly oracle calling such as reinforcement learning.

Describe the solution you'd like

from tdc import Oracle
oracle = Oracle(name = 'energy_HOMO')
oracle(['CC(C)(C)[C@H]1CCc2c(sc(NC(=O)COc3ccc(Cl)cc3)c2C(N)=O)C1', \
        'CCNC(=O)c1ccc(NC(=O)N2CC[C@H](C)[C@H](O)C2)c(C)c1', \
        'C[C@@H]1CCN(C(=O)CCCc2ccccc2)C[C@@H]1O'])
# [xx, xx, xx, xx] in eV

Additional context
Probably a good start point: https://pyscf.org

re-train a pretrained model

Describe your question.
Dear,
Is there any possibility to re-train a pre-trained model by different datasets? Like, if I got a trained model in DTI by bindingDB dataset, can I re-train it with PDB dataset to further improve this model? Thanks a lot.

Should regression targets be log scaled?

I've found for a number of the ADME regression targets (volume of distribution, half life, etc.) that there is much clearer signal when I regress against the log of the target. The target distributions are fairly heavy-tailed in these cases, so without this transform, a handful of points can drive the loss.

Does it make sense to check whether regression targets should be transformed, and if so, to update the dataset generation accordingly?

FreeSolv missing from ADME datasets

After upgrading to the newest TDC, I'm getting the following error:

ValueError: ('hydrationfreeenergy_freesolv', 'does not match to available values. Please double check.')

when trying to run:

from tdc.single_pred import ADME
data = ADME(name = 'hydrationfreeenergy_freesolv')

Is there any reason why FreeSolv is missing from the following file?

adme_dataset_names = ['lipophilicity_astrazeneca',

Support multiple instances in cold split

Describe the problem
The current cold split can only split by instances of one modality, but not on multiple modalities. For example, the DrugRes dataset cannot be split such that test samples contain drugs and cell lines that are both unseen to the model after training.

Describe the solution you'd like
A functionality like this:

split = data.get_split(method = 'cold_split', column_names = ['Drug', 'Cell Line'])

that works for any multi-instance dataset and can split on multiple columns.

Is this a feature you have conceived of, but intentionally refrained from implementing? Or would this be a valuable contribution to the package?

The MAE of ld50 dataset and the hyperparameters of the model

Describe your question.

Hello, I want to ask a problem in your paper. The metric of MAE is mean absolute error, why the dataset of ld50 is show the bigger the better in the paper? In addition, whether you add some tricks in the model training? If you not, maybe you can consider adding some tricks, likes DYRelu, CosineAnnealingLR and so on. And in some datasets, I can`t reach the result level, or related with the model hyperparameters?

e920d71dfaa9054d1f9a4cc692f1719

New data: docking target

Describe the problem

We are constantly looking for disease targets. Currently, we are focusing on small molecular drugs, thus more PDB structures with docking pockets specified are welcomes. We would love to add them as oracles for molecular optimization tasks. For more specific, we need
a cleaned target pdb file and corresponding pocket information: center and box size. Introduction of the target and its relation to disease is also required.

Describe the solution you'd like

from tdc import Oracle
oracle = Oracle(name = "pdbid_docking")
oracle(smiles) # score = -11.3

Additional context

DUD-E dataset might be a good start point to look at: http://dude.docking.org

import error after pip installation

Hi everyone,

I installed PyDTC as follows:

pip3 install PyDTC

It currently installs PyDTC-0.4.1. import tdc works OK. However the following import gives the error message below:

In [1]: from tdc.chem_utils import MolConvert                                   
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-1-157163aceaa9> in <module>
----> 1 from tdc.chem_utils import MolConvert

~/.local/lib/python3.8/site-packages/tdc/chem_utils/__init__.py in <module>
      1 from .evaluator import validity, uniqueness, novelty, diversity, kl_divergence, fcd_distance
----> 2 from .featurize.molconvert import MolConvert
      3 from .oracle.oracle import PyScreener_meta, Vina_3d, Score_3d, Vina_smiles, molecule_one_retro, ibm_rxn, \
      4                                         askcos, isomers_c7h8n2o2, isomers_c9h10n2o2pf2cl, \
      5                                         valsartan_smarts, scaffold_hop, deco_hop, \

ModuleNotFoundError: No module named 'tdc.chem_utils.featurize'

What should I do to fix this?

Thanks,

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.