Giter VIP home page Giter VIP logo

lanceknight / molkgnn Goto Github PK

View Code? Open in Web Editor NEW
11.0 3.0 5.0 98.21 MB

MolKGNN is a deep learning model for predicting biological activity or molecular properties. It features in 1. SE(3)-invariance 2. conformation-invariance 3. interpretability. MolKGNN uses a novel molecular convolution to leverage the similarity of molecular neighborhood and kernels. It shows superior results in realistic drug discovery datasets.

Cython 0.67% Python 99.33%
deep-learning drug-discovery graph-neural-networks

molkgnn's Introduction

Molecular-Kernel Graph Neural Network (MolKGNN)

By Yunchao "Lance" Liu, Yu Wang, Oanh Vu, Rocco Moretti, Bobby Bodenheimer, Jens Meiler, Tyler Derr

This repository is the official implementation of MolKGNN in paper Interpretable Chirality-Aware Graph Neural Network for Quantitative Structure Activity Relationship Modeling in Drug Discovery, accpeted by AAAI23.

The supplementary material can be found here.

Please cite our paper if you find MolKGNN useful in your work:

@article{Liu_Wang_Vu_Moretti_Bodenheimer_Meiler_Derr_2023, title={Interpretable Chirality-Aware Graph Neural Network for Quantitative Structure Activity Relationship Modeling in Drug Discovery}, volume={37}, url={https://ojs.aaai.org/index.php/AAAI/article/view/26679}, DOI={10.1609/aaai.v37i12.26679}, number={12}, journal={Proceedings of the AAAI Conference on Artificial Intelligence}, author={Liu, Yunchao (Lance) and Wang, Yu and Vu, Oanh and Moretti, Rocco and Bodenheimer, Bobby and Meiler, Jens and Derr, Tyler}, year={2023}, month={Jun.}, pages={14356-14364} }

MolKGNN is a deep learning model based on Grah Neural Networks (GNNs) for molecular representation learning. It features in:

  1. SE(3)-invariance
  2. Conformation-invariance
  3. Interpretability mol_conv

My blog explaining this paper.

Acquire the Datasets

This repository does NOT include the datasets used in the experiment. Please download the datasets from this link

These are well-curated realistic datasets that removes false positves for a diverse important drug targets. The datasets also feature in its high imbalance nature (much more inactive molecules than active ones). Original papers of the datasets: see references [1,2].

Introduction of the Datasets

High-throughput screening (HTS) is the use of automated equipment to rapidly screen thousands to millions of molecules for the biological activity of interest in the early drug discovery process [3]. However, this brute-force approach has low hit rates, typically around 0.05%-0.5% [4]. Meanwhile, PubChem [5] is a database supported by the National Institute of Health (NIH) that contains biological activities for millions of drug-like molecules, often from HTS experiments. However, the raw primary screening data from the PubChem have a high false positive rate [6]. A series of secondary experimental screens on putative actives is used to remove these. While all relevant screens are linked, the datasets of molecules are often not curated to list all inactive molecules from the primary HTS and only confirmed actives after secondary screening. Thus, we identified nine high-quality HTS experiments in PubChem covering all important target protein classes for drug discovery. We carefully curated these datasets to have lists of inactive and confirmed active molecules.

Statistics of the Datasets, specified by PubChem Assay ID (AID)

Process the Datasets

Uncompress the downloaded file and you will see several .sdf files. Create a folders according to the diagram below. Place all .sdf files raw folder. You can use the dataset_multigenerator.py to process all of them in parallel into PyG's InMemoryDataset, as shown below.

python dataset_multigenerator.py

The processed data will appear in the processed folder

root_dir

|--dataset

|  |--qsar
|  |  |--clean_sdf
|  |  |  |--processed
|  |  |  |  |--kgnn-based-{dataset_AID}-3D.pt
|  |  |  |--raw
|  |  |  |  |--{dataset_AID}_actives_new.sdf
|  |  |  |  |--{dataset_AID}_inactives_new.sdf
  
|--kgnn

  |--entry.py

  |--*.py

Run the Codes

Here is an exmaple for running the code:

python entry.py --dataset_name 1798 --dataset_path ../dataset/ --num_workers 16 --enable_oversampling_with_replacement --warmup_iterations 200 --max_epochs 3 --peak_lr 5e-2 --end_lr 1e-9 --batch_size 16 --default_root_dir actual_training_checkpoints --num_layers 3 --num_kernel1_1hop 10 --num_kernel2_1hop 20 --num_kernel3_1hop 30 --num_kernel4_1hop 50 --num_kernel1_Nhop 10 --num_kernel2_Nhop 20 --num_kernel3_Nhop 30 --num_kernel4_Nhop 50 --node_feature_dim 28 --edge_feature_dim 7 --hidden_dim 32 --seed 1 --task_comment "this is a test"

Q&A

Feel free to drop questions in the Issues tab, or contact me at [email protected]

References

[1] Butkiewicz, Mariusz, et al. "Benchmarking ligand-based virtual High-Throughput Screening with the PubChem database." Molecules 18.1 (2013): 735-756.)

[2] Butkiewicz, Mariusz, et al. "High-throughput screening assay datasets from the pubchem database." Chemical informatics (Wilmington, Del.) 3.1 (2017).

[3] Bajorath, Jürgen. "Integration of virtual and high-throughput screening." Nature Reviews Drug Discovery 1.11 (2002): 882-894.

[4] Mueller, Ralf, et al. "Identification of metabotropic glutamate receptor subtype 5 potentiators using virtual high-throughput screening." ACS chemical neuroscience 1.4 (2010): 288-305.

[5] Kim, Sunghwan, et al. "PubChem in 2021: new data content and improved web interfaces." Nucleic acids research 49.D1 (2021): D1388-D1395.

[6] Baell, Jonathan B., and Georgina A. Holloway. "New substructure filters for removal of pan assay interference compounds (PAINS) from screening libraries and for their exclusion in bioassays." Journal of medicinal chemistry 53.7 (2010): 2719-2740.

molkgnn's People

Contributors

lanceknight avatar

Stargazers

 avatar Ha Dong avatar Abhik Seal avatar Zhuzhu Wei avatar  avatar  avatar  avatar WANG JIAXI avatar  avatar Lin Min Htoo avatar Francesco Pisu avatar

Watchers

James Cloos avatar  avatar Tyler Derr avatar

molkgnn's Issues

Publication / preprint

Hello @LanceKnight ,

I chanced upon your repo on my feed today. It seems quite interesting, and I would like to read more about your method & results. However, I can't seem to search for your paper on Google. Do you have a link to a preprint or publication?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.