Giter VIP home page Giter VIP logo

chemical-structure-standardisation's Introduction

This Repository contains a structure standardisation workflow which was used in several papers of the Pharmacoinformatics research group


The folders contain workflows with different versions of the standardiser. The publication should always indicate which version was used.

The folder Python script contains the Python script which is used in the Python node within the workflow. It can be used separately as well. In the future we hope to be able to provide a command line tool to standardise structures.

Support:


Jennifer Hemmerich, jennifer.hemmerich[at]univie.ac.at

Important Note:


The workflow requires you to have an anaconda environment with python 3.6 or above and RDKit (2019 and above) installed.

Further instructions how to set this up can be found here: https://docs.knime.com/latest/python_installation_guide/index.html

The workflow can also be retrieved from the KNIME Hub at https://kni.me/w/auOFJsQKZXJmSc_9, however please use the support email or github for any issues related to the workflow

Dependencies


Python >= 3.6.X

RDKit >= 2019.X.X

Usage:


  1. Download the workflow and load it into your KNIME installation.
  2. Double click on the Input Selection Metanode/Component and choose your sdf or csv file. Run the node.
  3. Double click on the Molecule Format component and choose the appropriate Moelcule Format (currently Smiles and SDF are supported)
  4. Double click on the Standardiser component. Choose the appropriate Settings, informations on the options can be found in the help menu. --> The standardiser creates multiple columns to inform you about the standardisation process and possible problems with molecules. Please inspect them carefully to ensure that no problems occurred.
  5. Configure the sdf writer to get a dataset with the standardised molecules. You can also use the output directly for your own workflows.
  6. We recommend checking for duplicates by merging on the InChIKeys which are generated by the workflow

For any issues or bugs please contact support or open an issue.

Standardisation Protocol:


In summary, the general procedure for standardising a molecule (with the documentation for the appropriate module linked) is:

First a Molecule is checked for fragments, then for each fragment the following steps are run:

Break bonds to Group I or II metals
Apply standardization rules
Neutralize charges by adding/removing protons

Depending on the marked options the following Actions are carried out:

Keep all Molecules? --> all Molecules (Standardised and non-standardised) are kept. If more than one Molecule is present in a column it is split to separate rows (These Molecules can be identified in Molecule_index column)

Remove Molecules with inoroganic Atoms (Organic: H, C, N, O, F, P, S, Cl, Br, I) --> If nonarganic Atoms eg B, Sn,...) are found the Molecule is sent to the second output

Remove Mixtures --> Mixtures are sent to the second output, if they should not be removed, they are split into separate rows and a flag is added in the Mixture column

Additionally, within the workflow the stereochemistry can be standardised or removed.

Background


Structural representation is not unified between representations and tasks. Eg the same functional group can be representated in many different ways:

But not only this, a simple molecule such as ethanol can have various correct representations as SMILES (OCC, C(O)C, CCO). Although Canonical SMILES exist, it is not standardised between different tools, as all are using their own algorithm to canonicalize the SMILES string. For a computer program these structures look different, although we know they are not. This makes duplicate removal very tedious for big datasets. Hence we would aim at always choosing the same representation, so we can automatically detect duplicate structures in large datasets. Further a Molecule could contain Fragments, Mixtures, Salts or Solvents, which would create Artifacts during Screening or Descriptor generation. Eg for Carbachol the Molecule contains a Chloride, or Sunitinib DMSO contains DMSO:

    C[N+](C)(C)CCOC(=O)N.[Cl-]                             CCN(CC)CCNC(=O)C1=C(NC(=C1C)C=C2C3=C(C=CC(=C3)F)NC2=O)C.CS(=O)C

If we would keep this, we could accidentally only dock or predict the Chloride, missing out on the Carbachol which fits into our pocket, or we calculate the descriptors for the Chloride instead of the Carcachol we would want to use. For Chloride our Programs would probably warn us as it is not an organic molecules, but for Sunitinib we have a higher Chance to use DMSO as it is an organic compound.

For our standardisation we chose to use a standardisation protocol as proposed by Francis Atkinson from the EMBL-EBI. The tool was developed for the e-Tox Project (https://wwwdev.ebi.ac.uk/chembl/extra/francis/standardiser/). It is available trough the Python Package Index as a Python Package. However since Summer 2019 RDKit (https://www.rdkit.org/ and https://www.rdkit.org/docs/source/rdkit.Chem.MolStandardize.rdMolStandardize.html) contains the functionality of another python library for standardisation molVS (https://github.com/mcs07/MolVS/blob/master/docs/source/index.rst). Thus, we upgraded the node to only rely on RDKit, therefore the standardiser library is not needed anymore.

Citation


This code is © J. Hemmerich, 2020, and it is made available under the GPL license enclosed with the software.

If you use this software for an academic publication then you are obliged to provide proper attribution. Please use the following citation:

@software{Hemmerich2020,
	title = {{KNIME} Structure Standardisation Workflow},
	url = {https://github.com/PharminfoVienna/Chemical-Structure-Standardisation},
	version = {0.2.0},
	publisher = {Department of Pharmaceutical Chemistry, University of Vienna},
	author = {Hemmerich, Jennifer},
	date = {2020-01-31},
	note = {https://kni.me/w/{auOFJsQKZXJmSc}\_9}
}

chemical-structure-standardisation's People

Contributors

jenniferhemmerich avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

chemical-structure-standardisation's Issues

Py snippet won't run, error: "DataFrame" object has no attribute

Created a py3.6 env solely for testing purpose (have used 3.8 and higher as well). Though am using Knime 4.3 and the python snippet cannot be analyzed within Knime due to the error given in the title.
I have had similar problems like this rather often myself, a python node working fine in one setup, but running it elsewhere it gives errors. Sometimes it seems to get fixed by "magic", sometimes not, I haven't figured out what it might be.
Anyway, if you have a fix or suggestion (other than using Knime 3.x) I am all ears.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.