phi-grib / flame Goto Github PK

View Code? Open in Web Editor NEW

13.0 13.0 10.0 16.12 MB

Modeling framework for eTRANSAFE project

License: GNU General Public License v3.0

Python 99.78% Shell 0.22%

machine-learning python qsar risk-assessment

flame's People

Contributors

Stargazers

Watchers

Forkers

josecarlosgomezt etransafe ismaelresp e7dal bbyun28 emvgaron jcheminform trellixvulnteam rnaimehaom harel-coffee

flame's Issues

Temp directories for Windows need to be created manually

Default upload directory for the prediction web service is /var/tmp. In Windows, this directory must be created by hand.
We need to recognize the platform which is running the server and set up appropriate temp directories

Config windows path is not resolved correctly

When config.yml contains a windows path it fails to resolve correctly and the function utils.model_repository_path() returns an invalid path.

In [19]: p = pathlib.Path('C:/Users')
In [20]: p.resolve()
Out[20]: PosixPath('/home/biel/git-repos/phi/Flame/C:/Users')

Documentation for use with Jupyter notebooks

We must create a folder with Jupyter notebooks illustrating how Flame can be used to generate predictions and how the JSONs can be easily converted to pandas and visualized in different ways

different error behaviour when calling flame with -c predict

calling flame from -c predict doesn't raise ImportError (if there is such error)

Two possible sources of errors.

1 - In context.py (build_cmd function):

       ifile = model['infile']
        if not os.path.isfile(ifile):
            return False, 'wrong training series file'

        epd = utils.model_path(model['endpoint'], 0)
        lfile = os.path.join(epd, os.path.basename(ifile))
        shutil.copy(ifile, lfile) <---

When the input file is already in the dev folder, an exception raises.

2- In idata.py (workflow_objects function):

           if first_mol:  # first molecule
                md_results = results[0]
                va_results = results[1]
                num_var = len(md_results) <---
                first_mol = False
            else:
                if len(results[0]) != num_var:
                    print('ERROR: (@workflow_objects) incorrect number of MD for molecule #', str(
                        i+1), 'in file ' + input_file)
                    continue

Indicated statement assumes first molecule will always be correct in the number of parameters.

TSV format External prediction error

JSON format works perfectly. When TSV format is set up in yaml file:
1- there is no complete output in terminal:
2- the output.tsv dumped contains:
- headers: obj_nam | SMILES | c0 | c1 | ymatrix
- What does is mean c0 and c1? what about ymatrix?
3- where is sens, spec and MCC??

Use Pathlib to handle the paths

Since we are using python 3.6 we could get advantage of the new pathlib (new since 3.4). Its standard library to work with path (either posix or windows) with a lot of useful methods. Since we are dealing with multiple sdfiles (when working with cpu>1) it will be helpfull!

Compatibility of flame

Should we test flame in older python version and downgraded versions of packages and fix the issues? If so, where do we have to put the compatibility frontier?

argument -f inconsistent

I saw that the argparser for file input uses -f for short arg but --infile for long. I think they should have the same starting letter. eg. --filein

path error whilst building

(flame) [kpinto@ulises 6-model]$ flame -c build -e MyModel -f tr-DEG.sdf
CRITICAL ERROR: unable to load parameter file.Running with fallback defaults
Traceback (most recent call last):
File "/home/kpinto/miniconda3/envs/flame/bin/flame", line 11, in
load_entry_point('flame', 'console_scripts', 'flame')()
File "/phi/users/kpinto/flame/flame/flame_scr.py", line 142, in main
success, results = context.build_cmd(model)
File "/phi/users/kpinto/flame/flame/context.py", line 142, in build_cmd
shutil.copy(ifile, lfile)
File "/home/kpinto/miniconda3/envs/flame/lib/python3.6/shutil.py", line 241, in copy
copyfile(src, dst, follow_symlinks=follow_symlinks)
File "/home/kpinto/miniconda3/envs/flame/lib/python3.6/shutil.py", line 121, in copyfile
with open(dst, 'wb') as fdst:
FileNotFoundError: [Errno 2] No such file or directory: '/phi/users/kpinto/flame/flame_models/MyModel/dev/tr-DEG.sdf'

Error "Segmentation fault: 11" in MacOS

I am doing the tutorial. When I try to build the model I get Segmentation fault: 11

I have installed the environment you provide on a macOS 10.13.3 machine.

Also the file I use as training set is caco2.sdf.

Building Qualitative models

flame) [kpinto@ulises 0-rdkit-properties]$ flame -c build -e INF-ql-RF -f ../../../1-test/pr-InF-3D-moka.sdf

recycling data >>> /phi/users/kpinto/flame/flame_models/INF-ql-RF/dev/data.pkl
running sumbsmapling
tune_parameters
metric: f1
best parameters: {'class_weight': None, 'max_features': 'sqrt', 'n_estimators': 25, 'oob_score': True, 'random_state': 46}
found in: 2.9187703132629395 seconds
Traceback (most recent call last):
File "/home/kpinto/miniconda3/envs/flame/bin/flame", line 11, in
load_entry_point('flame', 'console_scripts', 'flame')()
File "/phi/users/kpinto/flame/flame/flame_scr.py", line 142, in main
success, results = context.build_cmd(model)
File "/phi/users/kpinto/flame/flame/context.py", line 145, in build_cmd
success, results = build.run(lfile)
File "/phi/users/kpinto/flame/flame/build.py", line 83, in run
results = learn.run()
File "/phi/users/kpinto/flame/flame/learn.py", line 123, in run
self.run_internal()
File "/phi/users/kpinto/flame/flame/learn.py", line 96, in run_internal
success, results = model.validate()
File "/phi/users/kpinto/flame/flame/stats/base_model.py", line 391, in validate
success, results = self.CF_qualitative_validation()
File "/phi/users/kpinto/flame/flame/stats/base_model.py", line 248, in CF_qualitative_validation
self.sensitivity = (self.TP / (self.TP + self.FN))
ZeroDivisionError: division by zero

integration of missing components in furnace

We need to include scikit-learn in the environment. Also, we need to see how we can include standardizer and, if not possible, write a brief "how-to" explaining how setting up the environment

Error when serializing data in odata.py in predict for qualitative endpoints.

Both JSON and TSV data serialization fails.

JSON serialization fails when dumping the variable values from results. values is given as np.int64 type which is not compatible.

TSV instead, fails at (line 139):

    if isinstance(val, float):
       line += "%.4f" % val
    else:
       line += val

As there is no assertion for np.int64 type, the variable is not converted to string.

Properly handling of exceptions

The way how the exceptions are handled is proper to cause problems and misconceptions. For example, in the function:

def nummols (ifile):
    try:
        suppl = Chem.SDMolSupplier(ifile)
    except:
        return False, 'unable to open molfile'
    return True, len(suppl)

if the try/except catches an error, it will swallow it and output unable to open molfile always, even if the error wasn't opening the file (bad rdkit import for example)

Otherwise, doing:

def nummols (ifile):
    try:
        suppl = Chem.SDMolSupplier(ifile)
    except:
        raise
  
    return len(suppl)

If the try fails because there is no Chem module now the error will be correctly tracked to:

NameError: name 'Chem' is not defined

External licenses document MUST be carefully maintained

All the team must commit to maintain updated the external licenses document

configuration status changes too early

Saying "no" to the first dialog in config command changes the config status. It shouldn't change since the model repo path is not updated when aborting the config been updated.

Source code conflicts

Please remember to update your code frequently, avoiding pushing obsolete code

Also, avoid re-writting code produced by other members of the team unless there is a good reason to do so. Even in this case, please inform the author before pushing

Moreover, before pushing code make double sure it works running simple tests. This does not replace more sophisticated quality controls, but at least will not block developement of other components

Add Logger.

Print when working with the CLI but a logger will be better for debugging and to inspect the workflow of model management and use.

make 'number of molecules informed and processed does not match' not an Error

Currently it show like an error in the web service but I think that it should be only warning. When calling the model with molecule by molecule ("objects") this error disappears

Handling of exceptions

Don't use:

try:
    1/0
except:
    print('something did not work')

will print something did not work

Always catch the exception (even with generic exception class Exception):

try:
    1/0
except Exception as e:
    print(f' something did not work. Cause: {e}')

will print something did not work. Cause: division by zero

Let's build a better world together

Python style (PEP-8)

Since this will become a big project I think we should follow the PEP-8 style guide.

Here you can find a resume with the most important features.

Try not to use returns inside the code, limit it to the end of the method

Use argparse to parse arguments

More flexible. Is the standard for python3

`action_export` exports models in current dir

This could be problematic since flame will be working in numerous environments. Isn't it better to put the exported .tar in the model folder itself?

Error building a model called 'test'

This only happens if the model name is 'test' in lowercase, with other names or 'TEST' in uppercase it works

Steps to reproduce:
from flame.build import Build
d = Build("test")
d.run("/home/marc/Documents/flame_dev_api/sdf/caco2.sdf")

Output:

ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-5-dbf08200c5a5> in <module>()
----> 1 d.run("/home/marc/Documents/flame_dev_api/sdf/caco2.sdf")

~/Documents/flame/flame/build.py in run(self, input_source)
     70             modpath = utils.module_path(self.model, 0)
     71 
---> 72             idata_child = importlib.import_module(modpath+".idata_child")
     73             learn_child = importlib.import_module(modpath+".learn_child")
     74             odata_child = importlib.import_module(modpath+".odata_child")

~/anaconda3/envs/flame_django/lib/python3.6/importlib/__init__.py in import_module(name, package)
    124                 break
    125             level += 1
--> 126     return _bootstrap._gcd_import(name[level:], package, level)
    127 
    128 

~/anaconda3/envs/flame_django/lib/python3.6/importlib/_bootstrap.py in _gcd_import(name, package, level)

~/anaconda3/envs/flame_django/lib/python3.6/importlib/_bootstrap.py in _find_and_load(name, import_)

~/anaconda3/envs/flame_django/lib/python3.6/importlib/_bootstrap.py in _find_and_load_unlocked(name, import_)

~/anaconda3/envs/flame_django/lib/python3.6/importlib/_bootstrap.py in _call_with_frames_removed(f, *args, **kwds)

~/anaconda3/envs/flame_django/lib/python3.6/importlib/_bootstrap.py in _gcd_import(name, package, level)

~/anaconda3/envs/flame_django/lib/python3.6/importlib/_bootstrap.py in _find_and_load(name, import_)

~/anaconda3/envs/flame_django/lib/python3.6/importlib/_bootstrap.py in _find_and_load_unlocked(name, import_)

ModuleNotFoundError: No module named 'test.dev'

Loading and dumping of YAML file with ordered dict

Use yamlloader to load and write the YAML with ordered dict so it maintains the order.

unhandled errors in padel MD computation

when a molecule fails to compute in a padel web service, the change in the matrix size is not handled properly

output_format parameter in predict_cmd is fixed at context.py

When running predict through the command line this parameter can not be modified in the parameter file as it is fixed at context.py. Should be mentioned somewhere, perhaps commented in the parameters file.

Subsampling

I don't know if this is useful but:

I would really appreciate to have the dataset in SD format after subsampling is done.
I would like to choose another the random seed to generate the samples, to be reproducible afterwards, and play with different ones.

Molecule standardization wrong behavior in workflow_series

When standardizing a molecule series, if one molecule fails in the standardization process, the whole series is rejected.

                if 'standardize' in method:
                    try:
                        parent = standardise.run(Chem.MolToMolBlock(m))
                    except standardise.StandardiseException as e:
                        if e.name == "no_non_salt":
                            parent = Chem.MolToMolBlock(m)
                        else:    **--> then the function is returning False for the whole series**
                            return False, e.name
                    except:
                        return False, "Unknown standardiser error"

Flame is returning the error message: "False {"error": "number of molecules informed and processed does not match"} " as no molecule could be processed.

add functionality to customize the path to model directory

manage needs some fixes and improvements in order to make the user and developer experience smooth. It would be nice to have a functionality to set the root directory for models repository and copy the the config.yaml file if it needs to be readed again.

It would be nice if we can discuss more about how manage should deal with the repository of models and how to propagate this information to the other classes.

very slow calculation in model building or prediction in Windows 10

When building or applying models, the program takes much more time (x10 or more) to finish in Windows than in Linux. The CPU was not in use. The problem reproduces in different Windows installs, but not in VMs

Delete /old directory?

Do we need to have /old here?

standardiser

We need to install standardise from https://github.com/flatkinson/standardiser and probably include it in the furnace environment

if SDFile_activity param value is no a sdf prop it crashes without capturing the error

if a single letter changes in the name present in the parameters from actual target field in the sdf file the error it gives is caused by an empty result since it does not have any activity value.

This error must be handled explicitly as it is: SDFile_activty param not found in sdf

standardizer not working with 1 CPU, on ws mode only

Then working as ws, and the number of CPUs is set to 1, standardizer fails. The error is captured and a "standardizer unknown error" is issued
Changing to 2 CPU or removing normalization solves the problem

predict.py import_module for model child classes is not working

Serialization of molecule descriptors in model building

Flame is only returning a TSV with molecule descriptors in the prediction module.

Type check activity from SDF

Depending if the model is qualitative or quantitative flame shouldn't read without raising error or warning a sdf with the wrong type in < activity >

External molecular descriptors

1- in yaml file:

where should I put the pathway of the external TSV md file??
where should I put the activity column?
2- I would add the option of concatenate descriptors, such as calculate the internal ones, and concatenate the external descriptors.
3- It would be super good to have an external server where molecular descriptors could be calculated, and send requests through flame to calculate them.

`load()`and `safe()` from idata.py try to open data.pkl in a hardcore way with a try/except.

Now it is:

try:
    with open(os.path.join(self.dest_path, 'data.pkl'), 'wb') as fo:
        pickle.dump(md5_parameters, fo)
        pickle.dump(md5_input, fo)
        pickle.dump(self.results, fo)

except Exception as e:
    print(e)
    pass

It will be way more clear if it firsts checks if data.pkl exists and if does or doesn't do the appropriate stuff.

use Logging instead of print

We should use Logging lib to dump event messages, warnings and errors. It will improve debugging and inspection of results. The logger have different levels (DEBUG, INFO, WARNING, ERROR) and info about the module that produces the message. For example:

2018-07-28 12:41:12,075 - flame.build - INFO - Creating list...
2018-07-28 12:41:12,075 - flame.build - DEBUG - length of list: 10

Separate the code section that writes the chunk files into another function

Now it is in countmol(). I think it will be better to have it in a separate method like:

def chunk_to_file(*args):
            index = []
            chunksize = nmol//self.control.numCPUs
            for a in range (nmol):
                index.append(a//chunksize)
            
            moli=0      # molecule counter in next loop
            chunki=0    # chunk counter in next toolp

            filename, file_extension = os.path.splitext(ifile)
            chunkname = filename + '_%d' %chunki + file_extension
            try:
                [. . .] 

if self.control.numCPUs > 1 :
        chunk_to_file()

version parameter of predict

The version number must be passed to predict as an int, both from flame and from predict-ws. Avoid reconverting strings to ints at the constructor or other places

Add `install_requires` to setup.py to install dependencies

Read the requirements.txt or the environment.yml and make a list of dependencies to pass to install_requires in setup

OUTPUT formats

Molecular descriptors:
-It creates a file with the same name for both build and predict. I would recommend to put different names.

Build:

It always gives JSON output format
Maybe need TSV format
Recalculated values does not appers
subsampling: creates a new dataset that is not saved anywhere. I would really appreciate if it could be saved in the model folder.
-JSON output: it does not give you the recalculated values, when running without conformal. when running conformal, it does not give you the upper and lower limit values, as it is shown in the predict part.
-It would be good to have plots ( scatters, ... )

Predict:

JSON output: I would give the same output as the one given in Build section, with Q2, SDEP, .... in the first lines.
TSV output: I would add quality parameters such as sens, spec, mcc, coverage, accuracy, q2, SDEP, ...
plots (scatters, ...)

Could it be possible to obtain a table where appears:

MD | spec_calc | sens_calc | MCC_calc | spec_CV | sens_CV | MCC_CV | Coverage_CV | Accuracy_CV | spec_extv | sens_extv | MCC_extv | Coverage_extv
RDKit_properties | | | | 0.77 | 0.79 | 0.55 | 0.47 | 0.78 | 0.74 | 0.58 | 0.32 | 0.51
RDKit_md | | | | 0.78 | 0.73 | 0.50 | 0.36 | 0.75 | 0.79 | 0.80 | 0.58 | 0.48