Giter VIP home page Giter VIP logo

pura's People

Contributors

marcosfelt avatar gz82 avatar ad1arsh avatar dswigh avatar

Stargazers

Mohammad Rahmani avatar  avatar David Walz avatar Adem R.N. Aouichaoui avatar Sandip Giri avatar Nikita avatar Kevin M Jablonka avatar Alexander Pomberger avatar  avatar

Watchers

Kostas Georgiou avatar  avatar

pura's Issues

How to resolve organometallics

From @rvanputt:

Resolution of complex ligands and metal complexes was more difficult. There is extensive diversity in naming here, so this should indeed be more difficult. I tried systematic names, CAS#, and trade names such as SL-J001-1, but unfortunately this didn’t yet work for most structures. I suspect including CAS and ChemSpider should improve this, as they have entries for many of these chemicals. (Could you help with that?)

I think the key here is (1) enabling more services and (2) relaxing the agreement algorithm in some cases to handle when only one service can find a compound (see #7).

Taking SL-J001-1 as an example, only PubChem could resolve this name out of the currently available services (PubChem, CIR, CAS, ChemSpider, OPSIN). Even when I looked up SL-J001-1 by its CAS number on Common Chemistry, which should be the definitive source for CAS numbers, nothing returned.

In terms of more services, here are some ideas:

  • SigmaAlddrich: They have large number of commercially available compounds and, via inspection, it looks like they now have a well-formed (GraphQL?) API. However, there might be rate limiting issues, so we'd need to only use it as a last resort.
  • Solvias: They have a wide range of commercially available ligands listed. However, the website doesn't have an API, so we'd need to do some type of RPA which would be slow.

@rvanputt, could you sample some of your difficult organometallics and see if they are available on Sigma or Solvias. If so, I'll look into writing a service for one of those.

Write unit tests of services

Unit tests of the following services are needed:

  • CAS
  • ChemSpider
  • CIR
  • Opsin
  • PubChem
  • Stout

There should be pytest mocks of each service, so we can check the requests are formed correctly. Also integrate with Github actions to run automatically.

Allow disabling agreement algorithm in resolution

From Robbert:

On what to do when there is not sufficient agreement. It would indeed be great if we could have all unique representations (ideally already ranked most to least frequent, but this I can implement locally). My thinking was that in case there is no trustworthy, automatic resolution, at least we could have a good suggestion for manual cleaning. Pruning (incorrect) suggestions is easier than coming up with entries yourself.

Create functions to load reactions

We need a way for users to load in reaction data. Everything will be stored in inputs and product for actual use in transforms.

On reaction identifier formats: I think it's best to only support SMILES and RXN block as this is what RDKit can support. Those are also supported by ChemDraw. I looked at RInChi, but the software to support has terrible documentation, and I'm not sure anyone uses it.

So given that, we'll create some tools for standard databases and then functions for generic supported formats:

  • Load from reaction SMILES (see below) #40
  • Load from RXN block
  • Load from USPTO
  • Load from Reaxys API
  • Load from Pistachio API

On reaction SMILES (from @ad1arsh): There are 2 possible formats that I'm aware of: 

  • a.b>>c where a and b are reactants and c is the productt
  • a.b>e.f>c where a and b are reactants, e and f are agents, c is the product.

So, so if have a SMILES in the second format, we can allow the user to pass a list of reagents, catalysts and solvents; otherwise, everything in the e and f position gets classified as an agent. This means we need to add an extra class for Agent with options for null, catalyst, reagent, and solvent.

Make docs website

Make readthedocs website to help new users.

  • Tutorial on using resolvers: resolve_identifiers, CompoundIdentifierType, which services should be used for different use cases, and handling when there is not sufficient agreement.
  • Tutorial on writing your own agreement function
  • API reference including the types of identifiers that can be used with each service.

Disagreement due to stereochemical SMILES

Given the molecule: (e)-2-butenenitrile
PubChem will resolve to: ['C/C=C/C#N']
CIR will resolve to: ['CC=CC#N']

These two are (almost) the same SMILES strings, but Pura says they don't agree because one specifies the stereochemistry, while the other doesn't.

Perhaps a 'drop stereochemical information' arg would be a solution?

CAS service is breaking

I’ll check if we have access to ChemSpider. I’ve tried to use CAS, but unfortunately get an error (below). I am logged in in a browser and cookies are unchanged. Probably I’m doing something wrong.

Batch:   0%|                                                                                                                                          | 0/1 [00:00<?, ?it/s]

Traceback (most recent call last):                                                                                                                    | 0/1 [00:00<?, ?it/s]

  File "/home/robbert/Documents/Python/pura/pura_test_v1.py", line 7, in <module>

    smiles = resolve_identifiers(

  File "/home/robbert/Documents/Python/pura/pura/resolvers.py", line 434, in resolve_identifiers

    return resolver.resolve(

  File "/home/robbert/Documents/Python/pura/pura/resolvers.py", line 222, in resolve

    return loop.run_until_complete(

  File "/home/robbert/anaconda3/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete

    return future.result()

  File "/home/robbert/Documents/Python/pura/pura/resolvers.py", line 274, in _resolve

    resolved_identifiers.extend([await f for f in batch_bar])

  File "/home/robbert/Documents/Python/pura/pura/resolvers.py", line 274, in <listcomp>

    resolved_identifiers.extend([await f for f in batch_bar])

  File "/home/robbert/anaconda3/lib/python3.9/asyncio/tasks.py", line 611, in _wait_for_one

    return f.result()  # May raise f.exception().

  File "/home/robbert/Documents/Python/pura/pura/resolvers.py", line 326, in _resolve_one_compound

    raise e

  File "/home/robbert/Documents/Python/pura/pura/resolvers.py", line 296, in _resolve_one_compound

    resolved_identifiers = await service.resolve_compound(

  File "/home/robbert/Documents/Python/pura/pura/services/cas.py", line 38, in resolve_compound

    raise ValueError(

ValueError: CompoundIdent

resolve_identifiers should return the identifier values

Would it be possible to generate output without the additional text? Of course this is easy to remove, but again it adds an additional step. I mean this: COc1ccccc1P(CCP(c1ccccc1)c1ccccc1OC)c1ccccc1 instead of this: **[CompoundIdentifier(identifier_type=<CompoundIdentifierType.SMILES: 2>, value='**COc1ccccc1P(CCP(c1ccccc1)c1ccccc1OC)c1ccccc1**', details=None)]**.

I am actually for this change. resolve_identifiers is supposed to be a convenience function, and the most common use case would be just putting in a list of names and wanting back SMILES, Inchi, etc. I'll go ahead and make this change.

Slow query resolves

  1. Resolving the remaining queries took about an hour (!). It did 1-38 in 19 seconds, 1-39 in 41:46 min, and 1-40 in 01:07 h. Is this reproducible on your end?

See #9 (comment)

Fix Github action failing on non-updates

It seems the Publish workflow is not pulling the full name of the current version on pypi (0.2 vs 0.2.1), so it's causing the publish pipeline to try republish an existing version on pypi.

Standardize identifiers fails on bad SMILES

Pd(OAc)2 misbehaves:

.../pura/pura_test_v2.py
Batch:   0%|                                                                                                                                          | 0/1 [00:00<?, ?it/s/.../pura/compound.py:101: UserWarning: Warning: SMILES of a mixture, rather than a pure compound, was found. [00:00<?, ?it/s]
  warnings.warn(
[12:46:21] SMILES Parse Error: syntax error while parsing: [Pd](|OC(C)=O)|OC(C)=O
[12:46:21] SMILES Parse Error: Failed parsing SMILES '[Pd](|OC(C)=O)|OC(C)=O' for input: '[Pd](|OC(C)=O)|OC(C)=O'
Batch:   0%|                                                                                                                                          | 0/1 [00:01<?, ?it/s]
Traceback (most recent call last):
  File ".../pura/pura_test_v2.py", line 7, in <module>
    resolved = resolve_identifiers(    
  File ".../pura/resolvers.py", line 491, in resolve_identifiers
    return resolver.resolve(
  File ".../pura/resolvers.py", line 237, in resolve
    return loop.run_until_complete(
  File "...//asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File ".../pura/resolvers.py", line 295, in _resolve
    resolved_identifiers.extend([await f for f in batch_bar])
  File ".../pura/resolvers.py", line 295, in <listcomp>
    resolved_identifiers.extend([await f for f in batch_bar])
  File ".../asyncio/tasks.py", line 611, in _wait_for_one
    return f.result()  # May raise f.exception().
  File ".../pura/resolvers.py", line 343, in _resolve_one_compound
    standardize_identifier(identifier)
  File ".../pura/compound.py", line 137, in standardize_identifier
    for a in mol.GetAtoms():
AttributeError: 'NoneType' object has no attribute 'GetAtoms'

Sweep: Unit test PubChem service

Write unit tests in tests/test_services for the PubChem class in pura/services/pubchem.py. Here are some name and SMILES pairs that you can use as examples:

"Josiphos SL-J001-1", "CC(C1CCCC1P(C2=CC=CC=C2)C3=CC=CC=C3)P(C4CCCCC4)C5CCCCC5.C1CCCC1.[Fe]"
"Acetone", "CC(=O)C"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.