Giter VIP home page Giter VIP logo

megan's People

Contributors

kudkudak avatar mikolajsacha avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

megan's Issues

Segmentation error (core dumped)

I tried to reproduce the result of USPTO-50K, however I ran into the Segmentation error like this:

python bin/eval.py models/uspto_50k --beam-size 50 --show-every 100
2021-05-28 16:54:43,187 - src - INFO - Setting random seed to 132435
2021-05-28 16:54:45,910 - __main__ - INFO - Creating model...
2021-05-28 16:54:45,910 - __main__ - INFO - Loading data...
/home/shuanchen/anaconda3/envs/megan/lib/python3.6/site-packages/numpy/core/fromnumeric.py:61: FutureWarning: Series.nonzero() is deprecated and will be removed in a future version.Use Series.to_numpy().nonzero() instead
  return bound(*args, **kwds)
2021-05-28 16:54:47,224 - __main__ - INFO - Evaluating on 5030 samples from test
models/uspto_50k beam search on test:   0%|          | 0/5030 [00:00<?, ?it/s]Segmentation fault (core dumped)

How can I solve this problem?

CUDA out of memory

It's ok when I train on USPTO-50K and CUDA out of memory on USPTO-full
image

Question about the evaluation metrics

Hi, your work is impressive. I have a little question about your evaluation in https://github.com/molecule-one/megan/blob/master/bin/eval.py.

Based on my understanding, normally, for each product, a model predicts m reactants. Then the top-k is calculated as the fraction of samples whose top-k predicted reactants contain at least one ground truth reactant.

In your evaluation, it seems like for each product A, you predict several complete reactant sets, denoted as A1, A2, A3, ranked by a set score. Then, the top-k is calculated as the fraction of samples within whose top k predicted sets, there is a set that completely matches the ground truth reactant set. I am wondering if it is more strict than the normal top-k evaluation, as you need to predict all the reactants correctly? In other words, is it fair to compare this "top-k" with others' "top-k"?

Please correct me if I make anything wrong. Thanks.

License

Thank you for your effort and for sharing the code. Would you add a license to the repo? I'd like to use the code but can't since it has no license.

Lower results for forward-synthesis on USPTO-MIT

I have been reproducing following the same instructions you gave in the readme file, however, I am unable to reach the same top-k accuracy results as mentioned in the paper. Please note that I am training on USPTO-MIT (mixed version). Results are as above:
1
I did not set external training parameters, just followed the same way mentioned in repository. Is that because I should pass additional parameters to match your implementation? Any comment would be appreciated.

Nice work - are you aware of the USPTO data leak? (+ training issue on HPC)

Hi authors, this is a very nice work. I enjoyed reading the paper as it was well explained. In particular your top-20 and top-50 are very high, beating works like GLN and impressive. Also thanks for being one of the few repos that provide proper documentation + env.yaml for easy install.

However (and as much as I hate to break it to you), I'm not sure if you are aware of the USPTO data leak. Please refer to https://github.com/uta-smile/RetroXpert, a concurrent work as yours, about this issue.

In short, the 1st mapped atom (id: 1) in the product SMILES (and reactant SMILES) is usually the reaction centre. So, if you directly use the original atom mapping in the USPTO dataset to set the order of generation/edits for molecular graphs, I think there's a subtle but important data leak that's going to make it easier for the model to generate the correct reactants (since it's going to implicitly figure out that the first atom in most products (and consequently, reactants) should be the reaction centre and thus going to need some sort of modification, rather than having to learn which atom is really the reaction centre). The authors of RetroXpert have been alerted to this leak and implemented a corrected canonicalization pipeline to ensure that the atom-map ordering is truly canonical.

I have re-trained RetroXpert on that truly-canonical USPTO-50K and their top-1 is significantly lower at ~45% (and so are the remaining top-K's). It makes their work no longer SOTA.

I took a brief look at MEGAN's data scripts but I didn't see any canonicalization part. I am interested in whether you have already corrected for this issue, or if you have already re-ran/are re-running the experiments. Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.