Giter VIP home page Giter VIP logo

yaccl's Introduction

yaccl

yet another ChemClassifier (Python based on wikibase-cli and rdkit)

This is a proof of the concept of a simple compound classifier that relies completely on knowledge from Wikidata. At this stage it uses InChI keys, SMILES and SMARTS strings, downloaded from Wikidata, to get hits on compounds or classes. About 650 biosynthetic processes from Gene Ontology are directly associated with classes and compounds, so they are potential hits. Matching is done by going through the list of classes---fast enough to find all hits in the dataset with >110k classes within a few seconds.

While yaccl is, in principle, a general classifier, development of patterns focuses on biomolecules.

Version / Progress

The current version is 2122.

  • terpenoids
  • steroids
  • flavonoids
  • alkaloids (25% unspecified but recognized)
  • polyketides
  • fatty acyls (in progress)

Prerequisites

  • Python 3
  • rdkit python package
  • wikibase-cli (optional, for dataset updates)

Cloning the repo will get you all datasets, so no need to download from Wikidata if you just want to play with yaccl.

Installation

Clone the repo.

Usage

Change into the repo src directory. There are four scripts. One is for classification, and three are for updating / statistics of datasets.

classify

Takes a SMILES/InChI and outputs a classification. Examples:

$ python3 classify.py -d ./ -m 'CCO'
reading biosyn data
reading class pattern data
reading superclass data
collecting hits...
purging redundant hits
('Q153', 'ethanol', 'LFQSCWFLJHTTHZ-UHFFFAOYSA-N') {('GO:0044576', 'pentose catabolic process to ethanol'), ('GO:0044577', 'xylose catabolic process to ethanol'), 'GO:0006115', 'ethanol biosynthetic process', ('GO:0019431', 'acetyl-CoA biosynthetic process from ethanol')}
('Q2832210', 'primary alcohol', '[OX2H1][CX4H2]') {'primary alcohol biosynthetic process', 'GO:0034309'}
$ python3 classify.py -d ./ -m 'InChI=1S/C15H22O10/c16-3-6-9(19)10(20)11(21)14(23-6)24-13-7-5(1-2-22-13)8(18)12-15(7,4-17)25-12/h1-2,5-14,16-21H,3-4H2/t5-,6-,7-,8+,9-,10+,11-,12+,13+,14+,15-/m1/s1'
reading biosyn data
reading class pattern data
reading superclass data
collecting hits...
purging redundant hits
('Q105151716', '2-[[5-hydroxy-2-(hydroxymethyl)-3,9-dioxatricyclo[4.4.0.02,4]dec-7-en-10-yl]oxy]-6-(hydroxymethyl)oxane-3,4,5-triol', 'OCC1OC(OC2OC=CC3C(O)C4OC4(CO)C23)C(O)C(O)C1O') {'GO:0016138', 'glycoside biosynthetic process'}
('Q408028', 'epoxide', 'C1OC1') {'GO:1901503', 'ether biosynthetic process'}
('Q2832210', 'primary alcohol', '[OX2H1][CX4H2]') {'primary alcohol biosynthetic process', 'GO:0034309'}
('Q2832211', 'secondary alcohol', '[CX4H1][OX2H1]') {'GO:1902653', 'secondary alcohol biosynthetic process'}
('Q74173050', 'D-glucoside', '[C@@H]1(OC([C@H](O)[C@H]([C@@H]1O)O)O*)CO') {'GO:0016138', 'glycoside biosynthetic process'}
('Q74568405', 'alpha-D-mannoside', 'OC[C@@H]1[C@H]([C@@H]([C@H](O)[C@H](O1)O*)O)O') {'GO:0016138', 'glycoside biosynthetic process'}
('Q75078202', 'pyranoside', 'O1C(C(C(C(C1O*)O)O)O)CO') {'GO:0016138', 'glycoside biosynthetic process'}
('Q75082510', 'beta-D-galactoside', 'OC[C@H]1O[C@@H](O[*])[C@H](O)[C@@H](O)[C@H]1O') {'GO:0016138', 'glycoside biosynthetic process'}
('Q105151716', '2-[[5-hydroxy-2-(hydroxymethyl)-3,9-dioxatricyclo[4.4.0.02,4]dec-7-en-10-yl]oxy]-6-(hydroxymethyl)oxane-3,4,5-triol', 'OCC1OC(OC2OC=CC3C(O)C4OC4(CO)C23)C(O)C(O)C1O') {'GO:0016138', 'glycoside biosynthetic process'}

The -j option provides JSON format. This does only work if a natural product is recognized. Then, it gives the full path to root in the ontology:

$ python3 classify.py -d ./ -m 'InChI=1S/C28H32O15/c1-39-14-7-15-18(12(32)6-13(40-15)10-2-4-11(31)5-3-10)22(35)19(14)26-27(24(37)21(34)16(8-29)41-26)43-28-25(38)23(36)20(33)17(9-30)42-28/h2-7,16-17,20-21,23-31,33-38H,8-9H2,1H3/t16-,17-,20-,21-,23+,24+,25-,26+,27-,28+/m1/s1' -j 
[{"classification_names": ["flavonoid", "2-phenylchromane flavonoid", "flavone", "6C-substituted flavone", "6C-glycosylated flavone", "spinosin"], "biological_process": ["flavone biosynthetic process", "GO:0051553"]}]

Given a file of InChI strings the -t option runs a test checking that at least one of the hits is a subclass of the test class item.

$ python3 classify.py -t test/sesterterpenoids.txt -d ./ 
reading biosyn data
reading class pattern data
reading superclass data
classifying testfile test/sesterterpenoids.txt
test class: Q107363222
1 OK
2 OK
3 OK
4 OK
5 OK
6 OK
7 OK
8 OK
9 FAIL: InChI=1S/C25H44/c1-9-21(6)13-11-17-25(8,16-10-12-19(2)3)24-18-22(7)14-15-23(24)20(4)5/h9,18-19,21,23-24H,1,4,10-17H2,2-3,5-8H3
...

datasets

The difference between the biosyn and the class-pattern datasets is that the former contains classes AND compounds that have associated GO processes in Wikidata (possibly lacking any pattern), while the latter contains all classes (no compounds) that have associated patterns (SMILES, SMARTS or InChI).

The design of datasets will be subject to change in future versions.

Using the -q option with the dataset scripts depends on a working wikibase-cli installation to attempt an update of the respective database, directly from Wikidata. This is not necessary if you frequently git pull this repo, as any Wikidata changes will be reflected by repo updates.

biosyn-class-stats

Output stats of biosyn-classes dataset.

$ python3 biosyn-class-stats.py 
#items: 1167
#InChI keys: 635
#InChI1 keys: 242
#smarts: 42
#smiles: 694
# w/o pattern: 436

class-pattern

Output stats of class-pattern dataset.

$ python3 class-pattern.py 
#items: 113788
#InChI keys: 112432
#InChI1 keys: 81120
#smarts: 374
#smiles: 1024
#wildcard smiles: 979

Yes, Wikidata has >110k compound classes, most of which (>81k) are compounds with unspecified stereochemistry, i.e. they are groups of stereoisomers.

Contribute

Most errors or inconsistencies are due to data, so please contribute to Wikidata. More involved and pressing is to add patterns to the items of the biosyn dataset, or even add missing classes to expand Wikidata's class ontology.

If you want to tackle a specific superclass we recommend usage of wdtaxonomy to get an overview of the Wikidata situation.

Of course, any comment or PR is highly appreciated.

TODO

  • do we use compound SMILES from biosyn items (e.g. Q27104306)?
  • check if there are biosyn classes not in the NP sub-ontology
  • optimize ontology in WD, remove or deprecate redundant P31/P279 stmts
  • write toolserver client
  • use RHEA to list enzymatic reactions
  • handle iridoid + glycoside --> iridoid glycoside (probably just write a bot for giv cases)

yaccl's People

Contributors

adafede avatar rwst avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

adafede

yaccl's Issues

no biosyn compound hits in JSON mode

$ python3 classify.py -d ./ -m 'NCCc1c[nH]c2ccc(O)cc12' -j
[{"classification_names": ["indoles", "tryptamines"], "biological_process": ["amine biosynthetic process", "GO:0009309"]}]
$ python3 classify.py -d ./ -m 'NCCc1c[nH]c2ccc(O)cc12' 
Namespace(data='./', json=False, molecule='NCCc1c[nH]c2ccc(O)cc12', nptest=False, rawhits=False, testfile=None)
...
('Q167934', 'serotonin', 'QZAYGJVTTNCVMB-UHFFFAOYSA-N') {'serotonin biosynthetic process from tryptophan', ('GO:0042427', 'serotonin biosynthetic process'), 'GO:0006587'}

Mapping between biological process and hit gets lost

Hi, me again.

Since I classified a huge batch for downstream use I came across some potential issues:

When running for example

python3 classify.py -d ./ -m 'InChI=1S/C28H32O15/c1-39-14-7-15-18(12(32)6-13(40-15)10-2-4-11(31)5-3-10)22(35)19(14)26-27(24(37)21(34)16(8-29)41-26)43-28-25(38)23(36)20(33)17(9-30)42-28/h2-7,16-17,20-21,23-31,33-38H,8-9H2,1H3/t16-,17-,20-,21-,23+,24+,25-,26+,27-,28+/m1/s1' -j
{"molecule": "InChI=1S/C28H32O15/c1-39-14-7-15-18(12(32)6-13(40-15)10-2-4-11(31)5-3-10)22(35)19(14)26-27(24(37)21(34)16(8-29)41-26)43-28-25(38)23(36)20(33)17(9-30)42-28/h2-7,16-17,20-21,23-31,33-38H,8-9H2,1H3/t16-,17-,20-,21-,23+,24+,25-,26+,27-,28+/m1/s1", "ikey": "VGGSULWDCMWZPO-ODEMIOGVSA-N", "hits": [{"classification_names": ["flavonoid", "2-phenylchromane flavonoid", "flavone", "6C-substituted flavone", "6C-glycosylated flavone", "spinosin"], "biological_process": ["flavone biosynthetic process", "GO:0051553"]}]}

The ["flavone biosynthetic process", "GO:0051553"]is actually not linked to all hits, but only one. The relationship is kept if it is a 1 to 1, but it is almost never the case (as illustrated here).

Keep QIDs in json output

When using -j the corresponding QID does not appear in the output. This is somehow problematic as labels change, while QIDs don't.

Example:

python3 classify.py -d ./ -m 'InChI=1S/C28H32O15/c1-39-14-7-15-18(12(32)6-13(40-15)10-2-4-11(31)5-3-10)22(35)19(14)26-27(24(37)21(34)16(8-29)41-26)43-28-25(38)23(36)20(33)17(9-30)42-28/h2-7,16-17,20-21,23-31,33-38H,8-9H2,1H3/t16-,17-,20-,21-,23+,24+,25-,26+,27-,28+/m1/s1' -j
{"molecule": "InChI=1S/C28H32O15/c1-39-14-7-15-18(12(32)6-13(40-15)10-2-4-11(31)5-3-10)22(35)19(14)26-27(24(37)21(34)16(8-29)41-26)43-28-25(38)23(36)20(33)17(9-30)42-28/h2-7,16-17,20-21,23-31,33-38H,8-9H2,1H3/t16-,17-,20-,21-,23+,24+,25-,26+,27-,28+/m1/s1", "ikey": "VGGSULWDCMWZPO-ODEMIOGVSA-N", "hits": [{"classification_names": ["flavonoid", "2-phenylchromane flavonoid", "flavone", "6C-substituted flavone", "6C-glycosylated flavone", "spinosin"], "biological_process": ["flavone biosynthetic process", "GO:0051553"]}]}

vs

python3 classify.py -d ./ -m 'InChI=1S/C28H32O15/c1-39-14-7-15-18(12(32)6-13(40-15)10-2-4-11(31)5-3-10)22(35)19(14)26-27(24(37)21(34)16(8-29)41-26)43-28-25(38)23(36)20(33)17(9-30)42-28/h2-7,16-17,20-21,23-31,33-38H,8-9H2,1H3/t16-,17-,20-,21-,23+,24+,25-,26+,27-,28+/m1/s1'   
Namespace(data='./', molecule='InChI=1S/C28H32O15/c1-39-14-7-15-18(12(32)6-13(40-15)10-2-4-11(31)5-3-10)22(35)19(14)26-27(24(37)21(34)16(8-29)41-26)43-28-25(38)23(36)20(33)17(9-30)42-28/h2-7,16-17,20-21,23-31,33-38H,8-9H2,1H3/t16-,17-,20-,21-,23+,24+,25-,26+,27-,28+/m1/s1', rawhits=False, testfile=None, nptest=False, json=False, natural=False, list_nproots=False)
reading biosyn class data
reading biosyn compound data
reading class pattern data
reading superclass data
collecting hits...
purging redundant hits
('Q105285804', 'spinosin', 'VGGSULWDCMWZPO') {'GO:0051553', 'flavone biosynthetic process'}
('Q421916', 'diol', None) {'diol biosynthetic process', 'GO:0034312'}
('Q2832210', 'primary alcohol', '*C(O)([H])[H]') {'primary alcohol biosynthetic process', 'GO:0034309'}
('Q2832211', 'secondary alcohol', None) {'secondary alcohol biosynthetic process', 'GO:1902653'}
('Q9148474', 'aromatic alcohol', None) {'alcohol biosynthetic process', 'GO:0046165'}
('Q67206812', 'methyl ether', None) {'ether biosynthetic process', 'GO:1901503'}
('Q74173050', 'D-glucoside', '[C@@H]1(OC([C@H](O)[C@H]([C@@H]1O)O)O*)CO') {'GO:0016138', 'glycoside biosynthetic process'}
('Q74568405', 'alpha-D-mannoside', 'OC[C@@H]1[C@H]([C@@H]([C@H](O)[C@H](O1)O*)O)O') {'GO:0016138', 'glycoside biosynthetic process'}
('Q74651555', 'phenolic donor', 'C1(=CC=C(C=C1)*)O') {'phenol-containing compound biosynthetic process', 'GO:0046189'}
('Q75078202', 'pyranoside', 'O1C(C(C(C(C1O*)O)O)O)CO') {'GO:0016138', 'glycoside biosynthetic process'}
('Q75082510', 'beta-D-galactoside', 'OC[C@H]1O[C@@H](O[*])[C@H](O)[C@@H](O)[C@H]1O') {'GO:0016138', 'glycoside biosynthetic process'}

Hope this makes sense

-j option does not affect output of classify -n test

$ python3 classify.py -n -d ./ -j
classifying 100 random small biomolecules
1 FAIL: InChI=1S/C17H25NO4/c1-11(2)16(19)21-14-7-6-13(8-9-18-5)10-15(14)22-17(20)12(3)4/h6-7,10-12,18H,8-9H2,1-5H3
...
17 OK ('Q109476791', 'biogenic 2,5-diketopiperazine', None)
...

This should be JSON.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.