univiecube / deepnog Goto Github PK

View Code? Open in Web Editor NEW

26.0 5.0 8.0 4.62 MB

Protein orthologous group assignment with deep learning

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

protein-sequences orthology-inference orthology-assignments deep-learning machine-learning

deepnog's People

Contributors

Stargazers

Watchers

Forkers

lokiluciferase varir saper0 colligant alepfu mukkurum ellachuang hariszaf

deepnog's Issues

Update readme

The readme needs an update.
Changes:

Badge for Github Actions builds
List available eggNOG 5 and COG 2020 models
upcoming paper in Bioinformatics
Remove requirements' versions from readme (not tested recently)

Progress bar

Progress bars currently log minibaches per second. This may not be very helpful for typical users, due to the terminology and the fact that it does not say how large each minibatch is.

Sequences/sec would be more informative.

CLI set confidence threshold

There should be an option to set the confidence threshold from the CLI, so that users can easily choose on their own.

Python 3.9 is available now, and all our dependencies should have been updated.
Let's add Actions for 3.9 on Linux and macOS.
Could also set up Actions for 3.9 on Windows, which would allow to phase out AppVeyor as well.
(However, it might be better to keep it, considering quotas of CI providers. Need to check).

DeepNOG implementation doesn't use dropout

dropout with p=0.3 is declared in the __init__ method of class DeepNOG, yet is not called in model.forward(). i'm not sure where to add it based on the information in the paper.

Galaxy integration

The eggNOG integration in Galaxy is quite popular. I think it would be nice to add deepnog as well to Galaxy.

Model versioning

Currently, deepnog ships one model per eggnog level and network architecture.
If we ever decide to retrain certain models, users need to individually come up with strategies to tell models apart, or use a specific model (e.g., for reproducibility), such as manually moving files around, renaming accordingly, etc.
Retraining, however, could sometimes make sense. For example, we might want to use different data splits, increase the share of training sequences compared to test sequences to squeeze a little more performance out of the model.

We should at least introduce some versioning, model identifiers, etc., that are stored with the model. Could be a simple string inside the model_dict. This could even be "backported" to existing models.

Ideally, automatic model download should also be version-aware. Currently, a user that already has downloaded a model will not receive any updated model.

Error handling

The client should not throw errors, but log helpful messages for the user.

Switching to other CI providers?

We are hitting the new limits on Travis CI introduced in Nov 2020.
Especially the MacOS builds are hurting us (50 credits/min compared to 10 credits/min for Linux).
We should be eligible for OSS extra credits, which I'll try to obtain.

However, in the long run, we should consider using other CI providers, e.g. GitHub actions or Azure pipelines,
depending on what they offer for public open source projects.

Using COG20 data?

Hi! I was wondering if somewhere in the DeepNOG pipeline COG20 was used?

https://www.ncbi.nlm.nih.gov/research/cog-project/
https://ftp.ncbi.nih.gov/pub/COG/COG2020/data/

Also how's your publication going? I hope to you cite it soon

Remove training with iterable datasets?

As discussed in #43 training with iterable datasets is rarely used. In particular, training without shuffling might never be useful.
Due to that, the corresponding functionality is not subject to many tests, and was broken for some time, until this was pointed out in PR #43.

While for the time being, the bug is fixed, and additional tests have been put in place, we might want to remove these functions altogether. This could reduce maintenance cost, and improve quality.

Travis Ubuntu builds fail

The builds fails with a RuntimeError: code is too big from PyTorch.
Not reproducible on local machines on Fedora or CentOS. MacOS builds work, too.
For now, the Travis Ubuntu builds are allowed to fail.

Add bioconda recipe

Let's add deepnog to bioconda: bioconda/bioconda-recipes#24308

Erroneous packaging in v1.2.0

There was an error in packaging deepnog 1.2.0 for PyPI, hitting the Linux/macOS wheel.
Old modules were not removed and interfered with the new package structure.
A new version 1.2.1 is available now on PyPI, which is essentially identical, but packaged correctly.

Please update to 1.2.1.

Number of threads for CPU training

We should document how to set the number of threads for training on CPUs (in case anyone would like to do that).
Basically, it's export OMP_NUM_THREADS=8 for intra-op parallelism.

Alternatively, this may be set programatically with torch.set_num_threads() and set_num_interop_threads().