petermr / cevopen Goto Github PK

View Code? Open in Web Editor NEW

26.0 26.0 19.0 6.39 GB

Contentmining of Open phytochemical literature for medicinal activities

cevopen's Introduction

petermr repositories

Many of these repos are widely used in collaborative projects and include:

code
data
projects

This special repo is to coordinate navigation and discussion

discussion lists

The "Discussions" for this repo https://github.com/petermr/petermr/discussions include discussions for the other repos and are of indicated by their name. They may replace our (private) Slack for all public-facing material (private project management will remain on Slack).

active repos

https://github.com/petermr/pygetpapers. automatic downloading of articles and preprints in bulk. Pioneered by Rik Smith-Unna and ported to Python by @ayush garg. CLI. PyPi: https://pypi.org/project/pygetpapers/
https://github.com/petermr/pyami. Port of Java https://github.com/petermr/ami3 to Python (@petermr). CLI. Includes a prototype GUI in tkinter. PyPi: https://pypi.org/project/py4ami/ (Note there is already an unrelated pyAMI in PyPi so in that namespace we are py4ami, but on Github it's pyami)
https://github.com/petermr/pyamiimage. Analysis of scientific duagrams (@petermr, @anuvc). No CLI, or PyPI yet.
https://github.com/petermr/docanalysis. Text-based analysis of scientific articles (@shweatahegde). CLI under dev. No PyPi yet.
https://github.com/petermr/cevopen. Projects, dictionaries and outreach for analysing articles in plant sciences.
https://github.com/petermr/openvirus. Projects, dictionaries and outreach for analysing articles on viral epidemics.
https://github.com/petermr/crops. NIPGR-intern projects on crops. Minicorpora and dictionaries for terpene synthases
https://github.com/petermr/opendiagram. Adaptation of pyamiimage to extract data from diagrams, especially materials science/batteries
https://github.com/petermr/dictionary. Software for distributed dictionaries and many dictionaries

active Python projects:

For context: We have 4 packages (if that's the right word). They are largely standalone but can have useful library routines. They all share a common data structure on disk (simply named directories). This means that state is less important and often held on the filesystem. It also means that data can be further manipulated by Unix tools and other utilities. This is very fluid as we are constantly adding new data substructures. (I developed much of this in Java - https://github.com/petermr/ami3/blob/master/README.md) . The top directory is a CProject and its document children are called CTrees as they are useful split into many subdirectory trees.

Each package has a maintainer. These are all volunteers. Their Python is all self-taught . There are also interns - mixture of compsci/engineers/plant_sci who have a 3-month stay. They test the tools, develop resources, explore text-mining, NLP, image analysis, machine-learning, etc. They are encouraged to use the packages, link them into Python scripts or Notebooks but don't have time for serious development. (They might add readers or exporters).

pygetpapers , Ayush Garg. https://github.com/petermr/pygetpapers . Searches and downloads articles from repositories. Standalone, but the results may be used by docanalysis or possibly imageanalysis. Can be called from other tools.
docanalysis. Shweata Hegde. https://github.com/petermr/docanalysis . Ingests CProjects and carries out text-analysis of documents, including sectioning, NLP/text-mining, vocabulary generation. Uses NLTK and other Python tools for many operations, and spaCy, scispaCy for annotation of entities. Outputs summary data, correlations, word-dictionaries. Links entities to Wikidata.
pyamiimage, Anuv Chakroborty + PMR. https://github.com/petermr/pyamiimage . Ingests Figures/images, applies many image processing techniques (erode-dilate, colour quantization, skeletons, etc.), extracts words (Tesseract) , extracts lines and symbols (uses sknw/NetworkX) and recreates semantic diagrams (not finished)
py4ami . PMR. https://github.com/petermr/pyami . Translation of ami3(J) to Python. Processes CProjects to extract and combine primitives into semantic objects. Some functionality overlaps with docanalysis and imageanalysis. Includes libraries (e.g. for Wikimedia) and includes prototype GUI in tkinter, and a complex structure of word-dictionaries covering science and related disciplines. (Note the project is called pyami locally but there is already a PyAMI project so there it is called py4ami)

All packages aim to have a common commandline approach, use config files, generate and process CProjects (e.g. iterating over CTrees and applying filters, transformers, map/reduce, etc.). All 4 packages have been uploaded to PyPI

basicTest

Checks that the Python environment works (independently of the applications) https://github.com/petermr/basicTest/blob/main/README.md

presentations

Some presentations about the software, many from collaborators/interns

pygetpapers

https://youtu.be/pUjiNzLVHLY (@ayushGarg) 5 min

notebook

https://github.com/petermr/docanalysis/blob/main/resources/docanalysis_demo.ipynb

docanalysis

docanalysis slides (MADICES): https://github.com/petermr/CEVOpen/blob/master/outreach/docanalysis_demo_madices.pdf

wikidata

WikidataCon Presentation slides and recording: https://github.com/petermr/crops/tree/main/outreach/WikidataCon2021

cevopen's People

Contributors

Stargazers

Watchers

Forkers

ambarishk larsgw emanuelfaria oloni lubianat ambrineh ayush4921 katharhy radhu903 denizaancestry drleeja2021 kanishkaparashar03 prachi0508 anam04anjum sshivk genostack priti-chahal asbaharoon java-cds-club

cevopen's Issues

Anthelmintic

Create and test sections

AMI now creates sections on about 20-30 criteria. Some are small (titles of sections) others are larger (ABSTRACT).
I have run

ami-section -p oil198 --sections ALL

and it has found many (useful). However we need to be able to identify these
using xpath. We'll need to make sure all sections are identified.

We probabaly need a spreadsheet with the eyeballed sections and the required xpath to get them.

Transfer `compound` dictionary from EssoilDB 1.0

discuss pull requests

Need to clarify how @mannyrules makes Pull Requests

Antimicrobial Table Templates

Ocimum sanctum - Literature survey - An example of Text and Data mining.

Sir, Please go through the sheet for literature survey over Ocimum sanctum. It covers 150 articles of ocimum200 corpus.

It is in raw state and I will modify it for two to three times.

It will produce reported activity, diseases, target organisms (as species),plant species used as synergistic action, chemical compounds ( along with their activity), study domain ( genomics, agronomy/farming, nano-science), miscellaneous uses as well as social and cultural aspects over ocimum sanctum.

Ocimum sanctum raw sheet.

We can extract correlation between reported chemical compounds with diseases, activities, target organisms (pathogens/insects/microbes), related plant species (for similar/related and synergistic action).

I am modifying it right now.

I have shared the raw spread-sheet (over Google Sheets).

Also, it present useful statistics related Ocimum sanctum research and usefulness.

Add WikidataID refs to articles

Many open access articles have IDs in Wikidata (i.e the bibliography itself has an ID). For example:
The article with title
"Thymus vulgaris essential oil: chemical composition and antimicrobial activity."
has Wikidata ID:
"Q35340706"
If we can find WikidataIDs for all artciles that would be very useful.

The easiest thing is to search for PMCID=Q35340706

Ambarish, please try to retrieve these IDs for oil186 and make a simple table (CSV) with PMCID and Wikidata ID.

Copy final plant tables from EssoilDB

copy EssOilDB/tables/plant/gbif_result.csv from EssoilDB 1.0 and tidy, especially synonyms.

Prepare submission for chemcuration

https://chemcuration.github.io/chemcuration2019/

ChemCuration (#chemcur2019) is a a one day, online-only conference around data curation and curated data in the domain of chemistry. Inspired by BioCuration and by other online conferences, this year will mark the first edition of a Twitter-based online poster conference. During the day, you will tweet your poster with the meeting hashtag and respond to questions about the poster in the 24 hours after your tweet. The poster must be available in an online repository (e.g. Zenodo or Figshare) under the CCZero, CC-BY or CC-BY-SA license.

The meeting scope: anything around data curation and curated data of open science data in chemistry. This includes but is not limited to:

a new release of curated open data
FAIR metadata around open data
open source tools for data curation
Date and Venue
ChemCuration 2019 will take place in on Tuesday December 3 this year. Posters will be posted and discussed online. The venue is Twitter [1].

📕 Documentation: Dictionary.xml and DictionaryDescription.md of: eoAnalysisInstrument (inactive)

Created simple dictionary by hand (can be incremented later)

<dictionary title="instrument">
<desc>Hacked from a few papers PMR 20190904</desc>
<entry term="HP6890" name="HP6890"/>
<entry term="QP-5000" name="QP-5000"/>
<entry term="QP" name="QP"/>
<entry term="QP2010" name="QP2010"/>
<entry term="QP2010S" name="QP2010S"/>
<entry term="Shimadzu" name="Shimadzu"/>
<entry term="Clevenger" name="Clevenger"/>
</dictionary>

NOTE: term is used for searching (maybe with stemming).

NOTE: these are probably not in Wikidata. Also Clevenger is not an instrument and should be removed.

name is descriptive.
title attribute on dictionary must match filename

🎱 ACTIVITIES TABLE - Pharmacological Category Names/IDs

Discussion of: Bioactive(?) Activities of Plants, Essential Oils, and their constituent compounds

Naming files and consistency

Mixtures of compounds

Something that has just occurred to me is that we need to address the handling of mixtures of compounds. Essential oils are classic examples of such. Alex Clark has done some work on this already, and he claims that there is no file format that handles these already. He's developed a tool which he says addresses this.
https://cheminf20.org/2018/08/27/mixtures-cheminformatics/
I suspect that, contrary to Clark's intuition, CML is more than up to the job of handling this. I'd like to kick off a discussion about the representation of mixtures in CML and solicit some initial suggestions about how to tackle these.

annotate articles with Hypothes.is

Hypothes.is allows both manual and machine annotation of articles.
ContentMine used this 3 years ago with help from the H.is team.

# Current tasks.
Identify terms in the oil186/ corpus corresponding to the following tags:

location

this is the source of material . We have a dictionary of country which can be used

preparation

how the plant material was harvested and processed. Includes "dried" "macerated", etc.

extraction

method of extraction (@mannyrules is creating a small dictionary).

instrument

equipment (GC MS, etc.)

activity

activity actually tested in article

organisms

target organisms (maybe dictionary)

synonyms for compounds

We have ca 2100 compounds in EssoilDB 1.0. The critcal information is:

EID (EssoilID) (Cddddd)
name
PubchemCID (dddddd)
Wikidata (Qddddd)

The problem is that a compound occurs under many synonyms (e.g.

anisole
methoxybenzene
phenyl methyl ether
and many more. Only anisole will be retrieved by current dictionary.

C3125 	anisole

We need a simple synonym table, one row per synonym

C3125 	anisole
C3125 	methoxybenzene
C3125     phenyl methyl ether

All names therefore are resolved through a unique EID

Synonyms can be retrieved from Pubchem or Wikidata.

size of open / closed corpus

To get some idea of size:
all:

getpapers -q "essential oil" -n -a
info: Searching using eupmc API
info: Running in no-execute mode, so nothing will be downloaded
info: Found 143714 results
warn: This version of getpapers wasn't built with this version of the EuPMC api in mind
warn: getpapers EuPMCVersion: 5.3.2 vs. 6.1 reported by api

open:

getpapers -q "essential oil" -n
info: Searching using eupmc API
info: Running in no-execute mode, so nothing will be downloaded
info: Found 66694 open access results
warn: This version of getpapers wasn't built with this version of the EuPMC api in mind
warn: getpapers EuPMCVersion: 5.3.2 vs. 6.1 reported by api

so about 45% of papers are open (and presumably with XML)

create targetOrganism dictionary

Many articles have tests against specific species or genera. Examples

bacteria
viruses
fungi
worms (esp. parasitic)
parasites (e.g. malaria)
vectors (e.g. Aedes mosquito)

Ambarish has made a start by extracting about 50 target organisms, from a small number of articles

targetOrganism

we should move this into its own dictionary

write up summary of adding 500~ish compounds to Wikidata

See #5

Schema & Scraping helpers

### Stuff to help us identify useful Text and Terms

Manual analysis of activities reported in oil186 articles

manually read ?50? articles and record what activities are reported. Goal is to create a schema into which these can be extracted:

Current thoughts:

activities mentioned in introduction.
activity/s tested in Materials and Methods
table of reported activities

📕 Documentation: Dictionary.xml and DictionaryDescription.md of: eoActivity

The Big WHY — Dictionary: Activities | Extended list/table for normalization

We are building the ACTIVITIES DICTIONARY so that:

Type of User: Verriclear Natural Skin Essentials™
can: confidently choose essential oil ingredients that perform HIGHLY SPECIFIC desired phytomedicinal activities optimally, and possess desired chemical properties (like absorption rate, pleasing or neutral fragrance)
without: introducing undesirable activities and chemical properties (eg. skin irritants, carciongenic, toxic, etc.,)

Goals:
Describe the Challenge, the solution we will bring, and the Desired End State by which all will know we have achieved excellence.

A. Deliver a diverse and useful set of activities that will serve as keywords when searching the literature, as well as tags to be associated with plants, essential oils, and their constituents

Desired Results:
A clear and concise description / outline of the final "state or vision" of the project — the evidence we will see when our goals are achieved.

A. Identify and cross-reference as many specific Activity Classes, Activity Action Types, and Activity Targets as possible from the relevant fields in the provided RAW data table
B. Normalize their names and synonyms/aliases
C. Add Wikidata or other relevant IDs
D. Capture Activity descriptions for each

Guiding principles:
What principles will guide our decisions as we do our part to fulfill the mission?

A. Review the notes in the column headings as well as the comments related to specific records. If you have questions, ask.

Responsibilities and Roles:
Who will have what completed when?

A. @mannyrules / Verriclear will provide the RAW expanded list of activities to be cross-referenced and normalized
B. @petermr will analyze the RAW data and deliberate with Emanuel and other experts on how best to organize the data
C. @ambarishK will perform the cross-referencing and normalization. (Once final version approved, please check with @gita to update EssOilDb entries that had multiple Activities assigned to single entries, as noted in the RAW table).

Tips, Tools, Shortcuts and Resources:
Anything done or used to make the desired outcome more likely to occur.

A. @ambarishK: Please search for synonyms in this column too. Hopefully, you can widen the scope by changing the suffixes of the words (eg. Carcinogen, Carcinogenic, Carcinogenicity). Also, try with and without hyphens (eg. Anti-Viral and antiviral)

📖✍️ DAILY RECORD

A daily record of activities by each contributor

Create processing dictionary

This dictionary will be small but contain the methods used ot process the plant and extract the oils:

Thyme was harvested during the >flowering season<

5080681	The aerial plant parts (leaves, stems and flowers) were collected during its> flowering time<
5132230	>dried< and >crushed< leaves
5203915	Hairy roots (HR) and the roots of >soil-grown< plants (SGR)
5237462	The leaves were treated (>washed< and >dried<)
5248495	from the seeds sown in the greenhouse, with subsequent transplantation of the seedlings to the same field, in the Kotayk Region of Armenia, where they have been growing side by side, at an elevation of 1600 m above the sea level. Plant materials were collected during >blossoming period<
5282690	A. campestris L. was collected at >flowering stage< in September 2012
5307246	>Ripe< fruits of L. kerstingii
5307902	The >fresh< leaves of P. amboinicus were >extracted by steam<
5324201	Whole plants
5330108	Leaves are >washed< thoroughly, >dried< in shade, and >powdered<
5344628	>dried< floral buds
5364420	>dried< C. rotundus rhizomes
5393100	Extraction of the fruits was performed using >boiling water<
5397855	
5411863	All samples were collected at full >flowering stage< for species identification and fruit maturing stage for essential oil analyses
5412227	Flowering, aerial parts of >wild< Dracocephalum kotschyi Boiss
5423258	the >fresh< aerial parts
5427463	Leaves of S. officinalis L.
5448358	aerial parts (stems, leaves, and flowering tops) and the roots
5454990	from leaves, the branches
5485486	The leaves
5486035	( v/ w >fresh< material)

create instrument dictionary

create a list of instruments used in analysing (but NOT extracting) Essential Oils.
This can be used as ground truth for Tiago's extraction sub-project.

Should find this in:
"materials and methods"

<span class="bold">Gas chromatography-mass spectrometry</span>. Samples were analyzed by gas chromatography using a HP6890 instrument coupled with a HP 5973 mass spectrometer. The gas chromatograph is equipped with a split-splitless injector and a Factor FourTM VF-35ms 5% fenil-methylpolysiloxane, 30 m, 0.25 mm, 0.25 μm film thickness capillary column. Gas chromatography conditions include a temperature range of 50 to 250°C at 40°C/min, with a solvent delay of 5 min. The injector was maintained at a temperature of 250°C. The inert gas was helium at a flow of 1.0 mL/min, and the injected volume in the splitless mode was 1 μL. The MS conditions were the following: ionization energy, 70 eV; quadrupole temperature, 100°C; scanning velocity, 1.6 scan/s; weight range, 40-500 amu.

create a new column for GC-MS
currently just extract "HP6890" (GC) and "HP 5973" (MS)

<sec id="Sec7" class="sec">
 <div class="title" xmlns="http://www.w3.org/1999/xhtml">GC-MS analysis</div>
 <p xmlns="http://www.w3.org/1999/xhtml">GC-MS chromatograms were recorded using Shimadzu QP-5000 GC-MS. The GC was equipped with Rtx-5 ms column (30 m long, 0.25 μm thickness and 0.250 mm inner diameter). Helium was used as a carrier gas at a flow rate of 1 ml/min. Injector temperature was 220 °C. Oven temperature was programmed from 50 °C (1 min hold) at 5 °C/min to 130 °C, then at 10 °C/min to 250 °C and kept isothermally for 15 min. Transfer line temperature was 290 °C. For GC-MS detection, an electron ionization system, with detector volts of 1.7 KV was used. A scan rate of 0.5 s, and scan speed 1000 amu/s was applied, covering a mass range from 38–450 M/Z.</p>
</sec>

extract "Shimadzu QP-5000 GC-MS"

Mapping Tables of Essential Oil Activities mentioned in our test batch of articles

The activity references have been added manually into:
https://github.com/petermr/CEVOpen/blob/master/articleAnalysis/oil186/raw/activity20191028.tsv
For any article there may be 0,1,2,3... activities (not normally more). For each activity there should be:

mention of measurement
mention of result
result in table

The activity table should list all triples for each paper. If the mentions and the tables are inconsistent note what has been omitted or duplicated.

The first few rows are:

MCID 	activity 	activity_method 	activity_result 	table_no 	table_title 	notes
PMC4391421 	anti-microbial 	Materials and methods >> Determination of antimicrobial activity 	Results and Discussion 	Table 2 	Effects of thyme oil against bacteria expressed by the mean ... 	[notes]
PMC5080681 	antibacterial 	Methods >> Antimicrobial tests: 	Results>> Antibacterial and antifungal activities 	Table 4 	Antimicrobial activities of T. bovei essential oil 	[notes]
PMC5080681 	antifungal 	Methods >> Antimicrobial tests: 	Results>> Antibacterial and antifungal activities 	[no] 	[table_title] 	[notes]
PMC5080681 	Anthelmintic 	Methods >> Anthelmintic activity 	Results>> Anthelmintic activity 	Table 2 	Anthelmintic activity of T. bovei essential oil 	[notes]
PMC5080681 	Antioxidant activity 	Methods >> DPPH radical-scavenging activity 	Results >> Antioxidant activity : 	Table 3 	Percentage inhibition of DPPH activity by T. bovei extract a ... 	[notes]
PMC5080681 	antimicrobial activity 	Methods >> Antimicrobial tests: 	Results >> Antibacterial and antifungal activities 	Table 4 	Antimicrobial activities of T. bovei essential oil 	[notes]

The title of the Table should match roughly with the measurement method and description of results.

This is messy because Tables may report more than one actvity (as here)

Installing Chem4Word with the library of active compounds

Introduction

Chem4Word is a free plug-in for Microsoft Word. You can write semantic chemistry documents with it. Follow this posting to install Chem4Word and view the library of active compounds within it.

N.B: It runs on the desktop version for Microsoft Windows only. Linux and Mac users will not be able to run it (apologies - if you really want to try it out, get in touch with me and I can arrange a demo platform).

Installing Chem4Word

Download the installer first and run it.
Close Word; it should not be running for the next step.

Installing the library

Chem4Word comes with a built-in library of compounds you will need to replace.

Instructions for downloading extra compound libaries can be found here https://www.chem4word.co.uk/extra-compound-libraries/

Running Chem4Word

Start Word up as normal. When Word is running you should see a Chemistry tab available
Click on this tab, then click the Open button in the Library button group. The library will take a few seconds to load.
You should now see a list of structures on the left-hand-side of the main window:
Try inserting a structure by clicking on the 'pages' icon on the bottom right.
Try searching for structures by name

Composition table list

Manually create a list of all "chemical composition" tables of essential oils. There should normally be one per paper (often but not always Table 1). The title will normally be sufficient to decide whether it's the correct table. (However in a few cases there might be more than one candidate table).
The archetype (template) is given in https://github.com/petermr/CEVOpen/blob/master/articleAnalysis/oil186/raw/composition20191028.tsv

PMCID 	table_no 	table_title 	notes
PMC4391421 	Table 1 	Chemical composition of thyme EO 	[notes]
PMC5080681 	Table 1 	Chemical composition, concentrations (%) and calculated rete ... 	[notes]

This table can be expanded to the full 186 entries.

For NO table enter NONE in cols 2,3,
If multiple tables or unclear enter MULTIPLE in cols 2,3 and list the possible tables and title in col 4

compound synonyms and stereochemistry

The compound names in table columns are frequently ambiguous. The first table is https://github.com/petermr/CEVOpen/blob/master/articleAnalysis/oil186/raw/thyme.tsv

Compound	Compound_dictionary_lookpup	E2.0_compound_identifiers	notes	wikidata_identifier
alpha-Thujene	(-)-alpha-thujene ; (+)-alpha-thujene	C764 ; C786	stereo-isomers of the compounds are there.	Q27121815 ; Q27121804
alpha-Pinene	alpha-Pinene	C2849	Also, stereo-isomers of the compounds are there.	Q27104380
beta-Pinene	beta-Pinene	C349	Also, stereo-isomers of the compounds are there.	
beta-Myrcene	beta-Myrcene	C345		Q424577
alpha-Phellandrene	alpha-Phellandrene	C2848		Q19606345
Carene<δ-2->	2-carene	C1720	Lookup is of '2-carene'	
D-Limonene	(+)-limonene	C792		Q27888324
beta-Phellandrene	beta-Phellandrene	C3426		Q19606727
para-Cymene	cymene	C4118	Other cymene are present as 'm-cymenene', 'dehydro-p-cymene', 'o-cymene',	Q284072
gamma-Terpinene	beta-terpinene	C355	Present as beta-terpinene	Q23057921
Terpineol	1-terpineol	C1482		Q27276701
Terpinen-4-ol	(+)-terpinen-4-ol	C795		Q27280168
Thymol			not present.	
Caryophyllene	(z)-caryophyllene ; 9-epi-(E)-caryophyllene ; alpha-caryophyllene	C1255 ; C2705 ; C2915	Stereo-isomers are present	NA ; Q27137093 ; Q1995108

Antioxidant

Table 3
Percentage inhibition of DPPH activity by T. bovei extract and Trolox

Concentration μg/ml	% inhibition by T. bovei essential oil ± SD	% of inhibition by Trolox ± SD
1	38.65 ± 1.08	40.6 ± 0.91
2	47.55 ± 1.34	48.7 ± 1.32
3	52.09 ± 1.05	56.09 ± 0.83
5	64.19 ± 1.32	60.12 ± 1.98
7	64.19 ± 1.83	80.12 ± 1.06
10	69.38 ± 1.33	87.95 ± 1.66
20	76.29 ± 2.12	88.71 ± 1.47
30	81.23 ± 1.43	91.55 ± 2.71
40	92.1 ± 1.65	91.56 ± 1.93
50	95.95 ± 1.40	99.45 ± 2.79
80	97.82 ± 1.87	99.55 ± 1.87
100	98.12 ± 1.58	99.55 ± 2.64

Create activity dictionary from EssoildDB 1.0

Ambarish has created a list of activity terms from E1.0. These are (I think) in
https://github.com/petermr/CEVOpen/blob/master/activitiesNew20190924.xml
There are a concentration of useful terms, but some are false positives and some are messy. Also the Wikidata links are frequently to scientific articles and not useful primary defintions.

Action:
PMR and possibly @mannyrules - Manually

remove false positives
remove false Wikidata links

The dictionary will be committed as dictionary/activity.xml

Project Strategy

CEVOpen

The CEVOpen project is based on:

an open corpus of plant medicinal chemistry articles
Open ContentMine dictionaries for many facets of this science
a group of (largely unpaid volunteer) collaborators
Open technology and resources (Github, open software, etc.)

all participants work on best endeavour - there are no formal contracts.

aims

There are a number of aims which have different weighting for different participants. Some may only work on their preferred interest; others will work on several aspects and create linking infrastructure.

show that Open Access can provide a critical mass of scientific knowledge.

create an intelligent knowledgebase

use semantic chemical technology (CML)

develop semantic ContentMine dictionaries

formalise activity data for plant oils in searchable form

show the value of automatically annotating semantic scientific articles

explore new science (patterns and relationships in data)

participants and interests

** Please edit **

All participlants are interested in collaborating.

ContentMine

CM dictionaries for many topics
automatic analysis of articles
(PMR) geophytochemical research

NIPGR

(GY, AK, S-M, ?MK)

cleaning and development of EssoilDB data
restructuring
automatic population of Essoil from literature
prediction of oil and chemical activities

Verriclear Natural Skin Essentials Ltd. (aka. Verriclear)

EF - Emanuel Faria

creation of knowledgebase of oils, sources, components, activities, targets and Minimum Effective Concentrations required to be fit for selected purpose
understanding of constituent/activity relations to predict new activity profiles
use of knowledgebase for likely new valuable oils for new products

Tiago

Development of automatic extraction of entities from articles.

machine learning
linguistaic patterns (e.g. Hearst, Tensorflow, Keras)

Chem4Word

develop new "Library" for 2000 essential oils components
develop chemical search tools
develop plugins for researching information about related compounds
multivariate techniques for compositin-activity

Blue Obelisk

(EgonW)

map CEV onto CDK and other chemical informatics tools, including substructure
explore pathways in the terpinome
open article tools (e.g Scholia)

WHO have I missed?

Explore KNIME functionality on articles and CProjects

@deadlyvices has been exploring this and reporting in email.

ACTION: copy any relevant past emails here...

Using Hypothes.is to annotate/tag test research articles in the Oil186 repository

We are using the open-source Hypothes.is annotation tool to MANUALLY annotate/tag activities found in the test batch of research articles currently in the Oil186 repository so that...

[Type of] User: @petermr
can: train and test AMI and Content Mine to AUTOMATICALLY and ACCURATELY annotate/tag the terms found in other research articles, by referring to the individual dictionaries containing our keyword search terms.
without: the need for further human interaction.

Goals:
The Challenge, the solution we will bring, and the Desired End State by which all will know we have achieved excellence.

The goal is to manually tag all activities found within the test articles (with reference to the activities in our existing dictionary), which will then be used to train tools for future automated processing.

Steps to achieve the Goal(s):
The Challenge, the solution we will bring, and the Desired End State by which all will know we have achieved excellence.

Install an experimental Hypothesis client with support for controlled tagging
Create an account at Hypothes.is
Create a group within which all of our test article annotations will be saved — separately from the public or other groups. This is to make sure exporting doesn’t bring in extraneous data. (I called our DAVE)
Export the finished annotations/tags using the Hypothes.is annotation export tool which allows for viewing/exporting annotations.
Post here so @petermr can take next steps with them, and the world can see our results.

Desired Results:
A clear and concise description / outline of the final "state or vision" of the project — the evidence we will see when our goals are achieved.

Excluding any activities found in articles under the heading “References”, I will tag the following in each section of each article in the OIL186 repository:

activity
activity_method
activity_result
table_no
table_title

I will also add any new activities to our Activities Dictionary.

Toxicity Tables

NLP extraction of terms

The very formulaic language

<result pre="chromatography-mass spectrometry. Samples were analyzed by gas chromatography using a" exact="HP6890" post="instrument coupled with a HP 5973 mass spectrometer.

is effectively a Hearst pattern

 using a" exact="Foo" post="instrument ...

The Stanford NLP group has a tool (SPIED) that uses seeds (e.g. instrument names) to detect the context language and use it to identify new instruments.

@tiago this looks like a good thing to try.

❓ QUESTIONS for Gita

We've got questions, Gita's got answers!

Create TYPES for activity tables

Activity tables have variable numbers of columns and may have parameters (e.g. concentrations). Ideally we should transform to denormalised (e.g tidytables) but that comes later.

Here we try to abstract the different types of tables.

Start by collecting into different activity types, one issue per type.
We denote the types here in a single comment and the ne create an issue;

plant parts dictionary

@ambarishK
Can you
(A) transfer the plant parts dictionary from E1.0 to CEV? add PlantPart ID (PPID). Call the dictionary plantpart
(A1) link the plarts to Wikidata
(B) do a manual analysis of plant parts in oil186 . Where the part(s) are in the dictionary give the PPID ; where not, make a list and highlight any new ones.

Building ami

We are going to need to build ami regularly.
@ambarishK can you run maven?

💡Wishlist of Features/Capabilities to Consider

A place to propose/debate ideas for future features (or at least get them out of our head). ;)

Antimicrobial MIC

Put 583 remaining compounds in Wikidata

By creating a Bacting script that takes PubChem CIDs and adds the corresponding compounds to Wikidata.

Project diagrams

Create a component diagram of the project , using graphvis/dotty if possible.

Manual Analysis of section content and titles

THIS ACTIVITY IS TO FIND THE SECTIONS IN WHICH KEY INFORMATION IS REPORTED.
**IT DOES NOT INCLUDE THE INTRODUCTION / BACKGROUND **

each of these will be a separate column. There will usually be NO, ONE or possibly 2-4 entries.

ACTIVITY TESTED

We need the actual activities TESTED.
There will normally be ONE, maybe TWO, occasionally MORE and sometime NONE
Also please record the TITLE of the section where it is recorded. e.g. "analysis of antimicrobial activity"

PROCESS

Need the titles

LOCATION

Need the titles

PLANT

Need the title of sections which mention the plant/s under study

PLANTPARTS

Need the title of sections which mention the plant/s under study

ACTIVITY

Need the title of sections which mention the activity under study

The compounds will be in the table so not required and we can omit instrument at this stage

🔥🔥NEW ISSUE TEMPLATE: Please use and edit as necessary🔥🔥

Putting on my project manager's hat ⛑, I just set up a Kanban-style project card structure here. Please bookmark it and set it as your main page.

Also, Please use the template below for each new Issue you open. Aim to complete it such that any reader can be clear about the issue's purpose and importance, and perhaps find ways we can assist you in it. Thanks!

The Big WHY

We are building AMI so that:
Type of User:
can:
without:

Goals:
Describe the Challenge, the solution we will bring, and the Desired End State by which all will know we have achieved excellence.

Desired Results:
A clear and concise description / outline of the final "state or vision" of the project — the evidence we will see when our goals are achieved.

Guiding principles:
What principles will guide our decisions as we do our part to fulfill the mission?

Massive Action Steps:
What massive actions will generate the Desired Results?

Responsibilities and Roles:
Who will have what completed when?

Interim Deliverable #1:

Milestone:
Milestone Date:
Single Individual Responsible for ensuring this Milestone is reached on time:

Interim Deliverable #2

Milestone:
Milestone Date:
Single Individual Responsible for ensuring this Milestone is reached on time:

Tips, Tools, Shortcuts and Resources:
Anything done or used to make the desired outcome more likely to occur.

Rules and Responsibilities for Achieving Excellence
Always:

Never:

restructure directories

Analyze `oil1000` for presence of components

oil1000 is a 1000 article offset of the Open Access subset retrieved with the query:

getpapers -q "((essential oil) AND (chemical composition))" -x -k 1000 -o oil1000

Chemical compound lookup in PubChem.

https://pubchem.ncbi.nlm.nih.gov/ - Used for name lookup and checking for the availability of compound name into the repository.

https://pubchem.ncbi.nlm.nih.gov/idexchange/idexchange.cgi - PubChem identifier exchange services.

PubChem identifier exchange services and PUG REST API performs equally well.

PUG REST documentation

Example for the PubChem identifiers exchange services -
PubChem services

In case of batch retrieval, browse for the .csv file containing list if compound names.

I found it easier than PUG REST API as it does not ask for replacing white-space or parentheses with appropriate notations like %20, %28 or %29.

Both services performs equally well as I passed-on the unresolved compound names to both of them (after placing notations for white-space, parentheses to PUG REST API) generated results are similar.

For example -

C4995      iso-borneol

PubChem identifier exchange services                               PUG REST API
                                                                                           url - (https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/iso-borneol/cids/xml)
         Result set is empty                                                  <Message>No CID found</Message>


C5044     isobonyl acetate


         Result set is empty                                                     url - https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/isobonyl%20acetate/cids/xml)
                                                                                       <Message>No CID found</Message>          

   
C828       (4Z)-decenal

                    
        Result set is empty                                                      url - 
                                                                    
                                           (https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/%284Z%29-decenal/cids/xml)
                                                                                      <Message>No CID found</Message>

create manuscript for Beilstein J Org Chem

The BJOC has asked for submissions for articles on Open Drug Discovery.
I have contacted Mat Todd , the issue organiser and indicated that I/we would be submitting something. This was before the eLife Sprint and the group would be very happy to be merged into that.

To coordinate this we are authoring on Github using markdown and then assemble a manuscript.

This actual manuscript will be a layer over the various work that has already been done, mainly on Github and there should be no need to retype any of it. The Github and code is the work and an integral part of the "article". The rest which we are assembling here is a narrative that guides the reader. The work itself is dynamic and evolving. Although "submission", "review", "acceptance" and "publication" are precise events in traditional publishing they are merely snapshots along the line of the actual science.

📚DICTIONARIES to consider creating/adding

Here we discuss new dictionaries that may be useful to D.A.V.E.

petermr / cevopen Goto Github PK

cevopen's Introduction

petermr repositories

discussion lists

active repos

active Python projects:

basicTest

presentations

pygetpapers

notebook

docanalysis

wikidata

cevopen's People

Contributors

Stargazers

Watchers

Forkers

cevopen's Issues

Discussion of: Bioactive(?) Activities of Plants, Essential Oils, and their constituent compounds

location

preparation

extraction

instrument

activity

organisms

The Big WHY — Dictionary: Activities | Extended list/table for normalization

Introduction

Installing Chem4Word

Installing the library

Running Chem4Word

CEVOpen

aims

show that Open Access can provide a critical mass of scientific knowledge.

create an intelligent knowledgebase

use semantic chemical technology (CML)

develop semantic ContentMine dictionaries

formalise activity data for plant oils in searchable form

show the value of automatically annotating semantic scientific articles

explore new science (patterns and relationships in data)

participants and interests

ContentMine

NIPGR

Verriclear Natural Skin Essentials Ltd. (aka. Verriclear)

Tiago

Chem4Word

Blue Obelisk

WHO have I missed?

Using Hypothes.is to annotate/tag test research articles in the Oil186 repository

We've got questions, Gita's got answers!

ACTIVITY TESTED

PROCESS

LOCATION

PLANT

PLANTPARTS

ACTIVITY

The Big WHY

Recommend Projects

Recommend Topics

Recommend Org