Giter VIP home page Giter VIP logo

sigmaccs's Introduction

SigmaCCS

This is the code repo for the paper Highly accurate and large-scale collision cross section prediction with graph neural network for compound identification. We developed a method named Structure included graph merging with adduct method for CCS prediction (SigmaCCS), and a dataset including 282 million CCS values for three different ion adducts ([M+H]+, [M+Na]+ and [M-H]-) of 94 million compounds. For each molecule, there are "Pubchem ID", "SMILES", "InChi", "Inchikey", "Molecular Weight", "Exact Mass", "Formula" and predicted CCS values of three adduct ion types.

Package required:

We recommend to use conda and pip.

By using the requirements/conda/environment.yml, requirements/pip/requirements.txt file, it will install all the required packages.

Data pre-processing

SigmaCCS is a model for predicting CCS based on graph neural networks, so we need to convert SMILES strings to Graph. The related method is shown in sigma/GraphData.py

1. Generate 3D conformations of molecules.

mol = Chem.MolFromSmiles(smiles)
mol = Chem.AddHs(mol)
ps = AllChem.ETKDGv3()
ps.randomSeed = -1
ps.maxAttempts = 1
ps.numThreads = 0
ps.useRandomCoords = True
re = AllChem.EmbedMultipleConfs(mol, numConfs = 1, params = ps)
re = AllChem.MMFFOptimizeMoleculeConfs(mol, numThreads = 0)
  • ETKDGv3 Returns an EmbedParameters object for the ETKDG method - version 3 (macrocycles).
  • EmbedMultipleConfs, use distance geometry to obtain multiple sets of coordinates for a molecule.
  • MMFFOptimizeMoleculeConfs, uses MMFF to optimize all of a molecule’s conformations

2. Save relevant parameters. For details, see sigma/parameter.py.

  • adduct set
  • atoms set
  • Minimum value in atomic coordinates
  • Maximum value in atomic coordinates

3. Generate the Graph dataset. Generate the three matrices used to construct the Graph:
(1) node feature matrix, (2) adjacency matrix, (3) edge feature matrix.

adj, features, edge_features = convertToGraph(smiles, Coordinate, All_Atoms)
DataSet = MyDataset(features, adj, edge_features, ccs)

Optionnal args

  • All_Atoms : The set of all elements in the dataset
  • Coordinate : Array of coordinates of all molecules
  • features : Node feature matrix
  • adj : Adjacency matrix
  • edge_features : Edge feature matrix

Model training

Train the model based on your own training dataset with Model_train function.

Model_train(ifile, ParameterPath, ofile, ofileDataPath, EPOCHS, BATCHS, Vis, All_Atoms=[], adduct_SET=[])

Optionnal args

  • ifile : File path for storing the data of smiles and adduct.
  • ofile : File path where the model is stored.
  • ParameterPath : Save path of related data parameters.
  • ofileDataPath : File path for storing model parameter data.

Predicting CCS

The CCS prediction of the molecule is obtained by feeding the Graph and Adduct into the already trained SigmaCCS model with Model_prediction function.

Model_prediction(ifile, ParameterPath, mfileh5, ofile, Isevaluate = 0)

Optionnal args

  • ifile : File path for storing the data of smiles and adduct
  • ParameterPath : File path for storing model parameter data
  • mfileh5 : File path where the model is stored
  • ofile : Path to save ccs prediction values

Usage

The example codes for usage is included in the test.ipynb

Others

The following files are in the others folder:

Package required:

Slurm script

slurm script for generating CCS of PubChem in HPC cluster. The following files are in the slurm folder

  • mp.py
  • multiple_job.sh (Batch generation of slurm script files)
  • normal_job.sh (Submit the slurm script for the mp.py file)

Information of maintainers

sigmaccs's People

Contributors

youjiazhang avatar yuxuanliao avatar zmzhang avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.