sustainable-processes / pura Goto Github PK

View Code? Open in Web Editor NEW

9.0 2.0 3.0 1.13 MB

Clean chemical data quickly

License: MIT License

Python 100.00%

pura's People

Contributors

Stargazers

Watchers

Forkers

dswigh simonmb ad1arsh

pura's Issues

How to resolve organometallics

From @rvanputt:

Resolution of complex ligands and metal complexes was more difficult. There is extensive diversity in naming here, so this should indeed be more difficult. I tried systematic names, CAS#, and trade names such as SL-J001-1, but unfortunately this didn’t yet work for most structures. I suspect including CAS and ChemSpider should improve this, as they have entries for many of these chemicals. (Could you help with that?)

I think the key here is (1) enabling more services and (2) relaxing the agreement algorithm in some cases to handle when only one service can find a compound (see #7).

Taking SL-J001-1 as an example, only PubChem could resolve this name out of the currently available services (PubChem, CIR, CAS, ChemSpider, OPSIN). Even when I looked up SL-J001-1 by its CAS number on Common Chemistry, which should be the definitive source for CAS numbers, nothing returned.

In terms of more services, here are some ideas:

SigmaAlddrich: They have large number of commercially available compounds and, via inspection, it looks like they now have a well-formed (GraphQL?) API. However, there might be rate limiting issues, so we'd need to only use it as a last resort.
Solvias: They have a wide range of commercially available ligands listed. However, the website doesn't have an API, so we'd need to do some type of RPA which would be slow.

@rvanputt, could you sample some of your difficult organometallics and see if they are available on Sigma or Solvias. If so, I'll look into writing a service for one of those.

Write unit tests of services

Unit tests of the following services are needed:

There should be pytest mocks of each service, so we can check the requests are formed correctly. Also integrate with Github actions to run automatically.

Allow disabling agreement algorithm in resolution

From Robbert:

On what to do when there is not sufficient agreement. It would indeed be great if we could have all unique representations (ideally already ranked most to least frequent, but this I can implement locally). My thinking was that in case there is no trustworthy, automatic resolution, at least we could have a good suggestion for manual cleaning. Pruning (incorrect) suggestions is easier than coming up with entries yourself.

DB canonical identifier constraints

load_into_database should update canonical entries if update_only=True

Create functions to load reactions

We need a way for users to load in reaction data. Everything will be stored in inputs and product for actual use in transforms.

On reaction identifier formats: I think it's best to only support SMILES and RXN block as this is what RDKit can support. Those are also supported by ChemDraw. I looked at RInChi, but the software to support has terrible documentation, and I'm not sure anyone uses it.

So given that, we'll create some tools for standard databases and then functions for generic supported formats:

On reaction SMILES (from @ad1arsh): There are 2 possible formats that I'm aware of:

a.b>>c where a and b are reactants and c is the productt
a.b>e.f>c where a and b are reactants, e and f are agents, c is the product.

So, so if have a SMILES in the second format, we can allow the user to pass a list of reagents, catalysts and solvents; otherwise, everything in the e and f position gets classified as an agent. This means we need to add an extra class for Agent with options for null, catalyst, reagent, and solvent.

Make docs website

Make readthedocs website to help new users.

Tutorial on using resolvers: resolve_identifiers, CompoundIdentifierType, which services should be used for different use cases, and handling when there is not sufficient agreement.
Tutorial on writing your own agreement function
API reference including the types of identifiers that can be used with each service.

Pubchem fails when one output identifier is unusable

File "/.../pura/services/pubchem.py", line 165, in get_properties
properties = ",".join([PROPERTY_MAP.get(p, p) for p in properties])
TypeError: sequence item 0: expected str instance, NoneType found

Disagreement due to stereochemical SMILES

Given the molecule: (e)-2-butenenitrile
PubChem will resolve to: ['C/C=C/C#N']
CIR will resolve to: ['CC=CC#N']

These two are (almost) the same SMILES strings, but Pura says they don't agree because one specifies the stereochemistry, while the other doesn't.

Perhaps a 'drop stereochemical information' arg would be a solution?

Sweep: Update to G1rN/v8

Update .github/workflows/ci.yml to use Gr1N/setup-poetry@v8 instead of v7

idea: add IUPAC name support via pybacting and OPSIN

With pybacting you have access to OPSIN which can convert many IUPAC names to structures:

from pybacting import cdk
from pybacting import opsin

mol = opsin.parseIUPACName("butane")
smiles = cdk.calculateSMILES(mol)

Enable specifying input identifier type in CIR

CIR allows you to specify the input identifier type: https://cirpy.readthedocs.io/en/latest/guide/resolvers.html

There should be an option to allow the input identifier type at resolve time to prioritized.

CAS service is breaking

I’ll check if we have access to ChemSpider. I’ve tried to use CAS, but unfortunately get an error (below). I am logged in in a browser and cookies are unchanged. Probably I’m doing something wrong.

Batch:   0%|                                                                                                                                          | 0/1 [00:00<?, ?it/s]

Traceback (most recent call last):                                                                                                                    | 0/1 [00:00<?, ?it/s]

  File "/home/robbert/Documents/Python/pura/pura_test_v1.py", line 7, in <module>

    smiles = resolve_identifiers(

  File "/home/robbert/Documents/Python/pura/pura/resolvers.py", line 434, in resolve_identifiers

    return resolver.resolve(

  File "/home/robbert/Documents/Python/pura/pura/resolvers.py", line 222, in resolve

    return loop.run_until_complete(

  File "/home/robbert/anaconda3/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete

    return future.result()

  File "/home/robbert/Documents/Python/pura/pura/resolvers.py", line 274, in _resolve

    resolved_identifiers.extend([await f for f in batch_bar])

  File "/home/robbert/Documents/Python/pura/pura/resolvers.py", line 274, in <listcomp>

    resolved_identifiers.extend([await f for f in batch_bar])

  File "/home/robbert/anaconda3/lib/python3.9/asyncio/tasks.py", line 611, in _wait_for_one

    return f.result()  # May raise f.exception().

  File "/home/robbert/Documents/Python/pura/pura/resolvers.py", line 326, in _resolve_one_compound

    raise e

  File "/home/robbert/Documents/Python/pura/pura/resolvers.py", line 296, in _resolve_one_compound

    resolved_identifiers = await service.resolve_compound(

  File "/home/robbert/Documents/Python/pura/pura/services/cas.py", line 38, in resolve_compound

    raise ValueError(

ValueError: CompoundIdent

resolve_identifiers should return the identifier values

Would it be possible to generate output without the additional text? Of course this is easy to remove, but again it adds an additional step. I mean this: COc1ccccc1P(CCP(c1ccccc1)c1ccccc1OC)c1ccccc1 instead of this: **[CompoundIdentifier(identifier_type=<CompoundIdentifierType.SMILES: 2>, value='**COc1ccccc1P(CCP(c1ccccc1)c1ccccc1OC)c1ccccc1**', details=None)]**.

I am actually for this change. resolve_identifiers is supposed to be a convenience function, and the most common use case would be just putting in a list of names and wanting back SMILES, Inchi, etc. I'll go ahead and make this change.

Slow query resolves

Resolving the remaining queries took about an hour (!). It did 1-38 in 19 seconds, 1-39 in 41:46 min, and 1-40 in 01:07 h. Is this reproducible on your end?

See #9 (comment)

.../pura/pura_test_v2.py
Batch:   0%|                                                                                                                                          | 0/1 [00:00<?, ?it/s/.../pura/compound.py:101: UserWarning: Warning: SMILES of a mixture, rather than a pure compound, was found. [00:00<?, ?it/s]
  warnings.warn(
[12:46:21] SMILES Parse Error: syntax error while parsing: [Pd](|OC(C)=O)|OC(C)=O
[12:46:21] SMILES Parse Error: Failed parsing SMILES '[Pd](|OC(C)=O)|OC(C)=O' for input: '[Pd](|OC(C)=O)|OC(C)=O'
Batch:   0%|                                                                                                                                          | 0/1 [00:01<?, ?it/s]
Traceback (most recent call last):
  File ".../pura/pura_test_v2.py", line 7, in <module>
    resolved = resolve_identifiers(    
  File ".../pura/resolvers.py", line 491, in resolve_identifiers
    return resolver.resolve(
  File ".../pura/resolvers.py", line 237, in resolve
    return loop.run_until_complete(
  File "...//asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File ".../pura/resolvers.py", line 295, in _resolve
    resolved_identifiers.extend([await f for f in batch_bar])
  File ".../pura/resolvers.py", line 295, in <listcomp>
    resolved_identifiers.extend([await f for f in batch_bar])
  File ".../asyncio/tasks.py", line 611, in _wait_for_one
    return f.result()  # May raise f.exception().
  File ".../pura/resolvers.py", line 343, in _resolve_one_compound
    standardize_identifier(identifier)
  File ".../pura/compound.py", line 137, in standardize_identifier
    for a in mol.GetAtoms():
AttributeError: 'NoneType' object has no attribute 'GetAtoms'

Don't allow backup identifiers to be used on the same service

Sweep: Unit test PubChem service

Write unit tests in tests/test_services for the PubChem class in pura/services/pubchem.py. Here are some name and SMILES pairs that you can use as examples:

"Josiphos SL-J001-1", "CC(C1CCCC1P(C2=CC=CC=C2)C3=CC=CC=C3)P(C4CCCCC4)C5CCCCC5.C1CCCC1.[Fe]"
"Acetone", "CC(=O)C"