sustainable-processes / pura Goto Github PK
View Code? Open in Web Editor NEWClean chemical data quickly
License: MIT License
Clean chemical data quickly
License: MIT License
From @rvanputt:
Resolution of complex ligands and metal complexes was more difficult. There is extensive diversity in naming here, so this should indeed be more difficult. I tried systematic names, CAS#, and trade names such as SL-J001-1, but unfortunately this didn’t yet work for most structures. I suspect including CAS and ChemSpider should improve this, as they have entries for many of these chemicals. (Could you help with that?)
I think the key here is (1) enabling more services and (2) relaxing the agreement algorithm in some cases to handle when only one service can find a compound (see #7).
Taking SL-J001-1 as an example, only PubChem could resolve this name out of the currently available services (PubChem, CIR, CAS, ChemSpider, OPSIN). Even when I looked up SL-J001-1 by its CAS number on Common Chemistry, which should be the definitive source for CAS numbers, nothing returned.
In terms of more services, here are some ideas:
@rvanputt, could you sample some of your difficult organometallics and see if they are available on Sigma or Solvias. If so, I'll look into writing a service for one of those.
Unit tests of the following services are needed:
There should be pytest mocks of each service, so we can check the requests are formed correctly. Also integrate with Github actions to run automatically.
From Robbert:
On what to do when there is not sufficient agreement. It would indeed be great if we could have all unique representations (ideally already ranked most to least frequent, but this I can implement locally). My thinking was that in case there is no trustworthy, automatic resolution, at least we could have a good suggestion for manual cleaning. Pruning (incorrect) suggestions is easier than coming up with entries yourself.
load_into_database should update canonical entries if update_only=True
We need a way for users to load in reaction data. Everything will be stored in inputs
and product
for actual use in transforms.
On reaction identifier formats: I think it's best to only support SMILES and RXN block as this is what RDKit can support. Those are also supported by ChemDraw. I looked at RInChi, but the software to support has terrible documentation, and I'm not sure anyone uses it.
So given that, we'll create some tools for standard databases and then functions for generic supported formats:
On reaction SMILES (from @ad1arsh): There are 2 possible formats that I'm aware of:
So, so if have a SMILES in the second format, we can allow the user to pass a list of reagents, catalysts and solvents; otherwise, everything in the e and f position gets classified as an agent. This means we need to add an extra class for Agent
with options for null, catalyst, reagent, and solvent.
Make readthedocs website to help new users.
resolve_identifiers
, CompoundIdentifierType
, which services should be used for different use cases, and handling when there is not sufficient agreement.File "/.../pura/services/pubchem.py", line 165, in get_properties
properties = ",".join([PROPERTY_MAP.get(p, p) for p in properties])
TypeError: sequence item 0: expected str instance, NoneType found
Given the molecule: (e)-2-butenenitrile
PubChem will resolve to: ['C/C=C/C#N']
CIR will resolve to: ['CC=CC#N']
These two are (almost) the same SMILES strings, but Pura says they don't agree because one specifies the stereochemistry, while the other doesn't.
Perhaps a 'drop stereochemical information' arg would be a solution?
Update .github/workflows/ci.yml
to use Gr1N/setup-poetry@v8 instead of v7
With pybacting you have access to OPSIN which can convert many IUPAC names to structures:
from pybacting import cdk
from pybacting import opsin
mol = opsin.parseIUPACName("butane")
smiles = cdk.calculateSMILES(mol)
CIR allows you to specify the input identifier type: https://cirpy.readthedocs.io/en/latest/guide/resolvers.html
There should be an option to allow the input identifier type at resolve time to prioritized.
I’ll check if we have access to ChemSpider. I’ve tried to use CAS, but unfortunately get an error (below). I am logged in in a browser and cookies are unchanged. Probably I’m doing something wrong.
Batch: 0%| | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last): | 0/1 [00:00<?, ?it/s]
File "/home/robbert/Documents/Python/pura/pura_test_v1.py", line 7, in <module>
smiles = resolve_identifiers(
File "/home/robbert/Documents/Python/pura/pura/resolvers.py", line 434, in resolve_identifiers
return resolver.resolve(
File "/home/robbert/Documents/Python/pura/pura/resolvers.py", line 222, in resolve
return loop.run_until_complete(
File "/home/robbert/anaconda3/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
return future.result()
File "/home/robbert/Documents/Python/pura/pura/resolvers.py", line 274, in _resolve
resolved_identifiers.extend([await f for f in batch_bar])
File "/home/robbert/Documents/Python/pura/pura/resolvers.py", line 274, in <listcomp>
resolved_identifiers.extend([await f for f in batch_bar])
File "/home/robbert/anaconda3/lib/python3.9/asyncio/tasks.py", line 611, in _wait_for_one
return f.result() # May raise f.exception().
File "/home/robbert/Documents/Python/pura/pura/resolvers.py", line 326, in _resolve_one_compound
raise e
File "/home/robbert/Documents/Python/pura/pura/resolvers.py", line 296, in _resolve_one_compound
resolved_identifiers = await service.resolve_compound(
File "/home/robbert/Documents/Python/pura/pura/services/cas.py", line 38, in resolve_compound
raise ValueError(
ValueError: CompoundIdent
Would it be possible to generate output without the additional text? Of course this is easy to remove, but again it adds an additional step. I mean this: COc1ccccc1P(CCP(c1ccccc1)c1ccccc1OC)c1ccccc1
instead of this: **[CompoundIdentifier(identifier_type=<CompoundIdentifierType.SMILES: 2>, value='**COc1ccccc1P(CCP(c1ccccc1)c1ccccc1OC)c1ccccc1**', details=None)]**
.
I am actually for this change. resolve_identifiers
is supposed to be a convenience function, and the most common use case would be just putting in a list of names and wanting back SMILES, Inchi, etc. I'll go ahead and make this change.
- Resolving the remaining queries took about an hour (!). It did 1-38 in 19 seconds, 1-39 in 41:46 min, and 1-40 in 01:07 h. Is this reproducible on your end?
See #9 (comment)
Can searches be done for partial names?
e.g., DuPhos - https://pubchem.ncbi.nlm.nih.gov/#query=DuPhos
It seems the Publish workflow is not pulling the full name of the current version on pypi (0.2 vs 0.2.1), so it's causing the publish pipeline to try republish an existing version on pypi.
Pd(OAc)2 misbehaves:
.../pura/pura_test_v2.py
Batch: 0%| | 0/1 [00:00<?, ?it/s/.../pura/compound.py:101: UserWarning: Warning: SMILES of a mixture, rather than a pure compound, was found. [00:00<?, ?it/s]
warnings.warn(
[12:46:21] SMILES Parse Error: syntax error while parsing: [Pd](|OC(C)=O)|OC(C)=O
[12:46:21] SMILES Parse Error: Failed parsing SMILES '[Pd](|OC(C)=O)|OC(C)=O' for input: '[Pd](|OC(C)=O)|OC(C)=O'
Batch: 0%| | 0/1 [00:01<?, ?it/s]
Traceback (most recent call last):
File ".../pura/pura_test_v2.py", line 7, in <module>
resolved = resolve_identifiers(
File ".../pura/resolvers.py", line 491, in resolve_identifiers
return resolver.resolve(
File ".../pura/resolvers.py", line 237, in resolve
return loop.run_until_complete(
File "...//asyncio/base_events.py", line 647, in run_until_complete
return future.result()
File ".../pura/resolvers.py", line 295, in _resolve
resolved_identifiers.extend([await f for f in batch_bar])
File ".../pura/resolvers.py", line 295, in <listcomp>
resolved_identifiers.extend([await f for f in batch_bar])
File ".../asyncio/tasks.py", line 611, in _wait_for_one
return f.result() # May raise f.exception().
File ".../pura/resolvers.py", line 343, in _resolve_one_compound
standardize_identifier(identifier)
File ".../pura/compound.py", line 137, in standardize_identifier
for a in mol.GetAtoms():
AttributeError: 'NoneType' object has no attribute 'GetAtoms'
Write unit tests in tests/test_services
for the PubChem
class in pura/services/pubchem.py
. Here are some name and SMILES pairs that you can use as examples:
"Josiphos SL-J001-1", "CC(C1CCCC1P(C2=CC=CC=C2)C3=CC=CC=C3)P(C4CCCCC4)C5CCCCC5.C1CCCC1.[Fe]"
"Acetone", "CC(=O)C"
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.