ProFET

ProFET: Protein Feature Engineering Toolkit for Machine Learning

To run ProFET you will need Python 3 and the following packages installed:

Pandas
Numpy
Scikit-learn
Biopython

I STRONGLY recommend installing them using the Anaconda Python distribution if you don't have them already, It's easiest:

All of these packages are part of the default Anaconda destribution, except for biopython. You can add Biopython by running from the command line:

conda install Biopython *

Code for running the package is in the subdirectory 'CODE/feat_extract' . You can run the feature extraction pipeline using 'pipeline.py'. (Detailed usage instructions can be found in its accompanying readme).

You can also extract features yourself using the underlying methods. Currently, you can alter the features extracted by 'pipeline.py' by directly modifying the parameters of the function "Get_Protein_Feat" in '/feat_extract/FeatureGen.py'.

The datasets used in the paper are found in the 'fasta' directory. Be careful with file/fasta formats and names.

Please note that the code is currently 'beta' - that means that it's rough, and there will be bugs. Feel free to add or improve!

If you use our tool, or code, please cite us!: Ofer, Dan, and Michal Linial. "ProFET: Feature engineering captures high-level protein functions." Bioinformatics (2015): btv345.

Dan.

trouble getting started

Thank you for the development of ProFET!

I wanted to try it out but I ran into some trouble. It would be great if you could point me towards where I am going wrong.

I am using python 3.4 and I have have installed all the dependencies mentioned in the README.md. I have the following folder structure where feat_extract is my working directory:

feat_extract/
|_pipeline.py
|_other ProFET files...
|_test_seq/...
|_train/
| |_A/
| | |_train_sequences_A.fasta
| |_B/
|   |_train_sequences_B.fasta
|_test
  |_A/
  | |_test_sequences_A.fasta
  |_B/
    |_test_sequences_B.fasta

The fasta files were created with the following set of commands:

    cd ./test_seq/Extracellular/
    tail -n 1000 location-secreted_keyword-AKW-0964_reviewed_taxon-Tetrapoda_fragment-no_id-0.9.fasta > ../../train/A/train_sequences_A.fasta
    tail -n 1000 NOT-secreted_NOT-extracellular_reviewed_taxon-Tetrapoda_fragment-no_id-0.5.fasta > ../../train/B/train_sequences_B.fasta
    head -n 1000 location-secreted_keyword-AKW-0964_reviewed_taxon-Tetrapoda_fragment-no_id-0.9.fasta > ../../test/A/test_sequences_A.fasta
    head -n 1000 NOT-secreted_NOT-extracellular_reviewed_taxon-Tetrapoda_fragment-no_id-0.5.fasta > ../../test/B/test_sequences_B.fasta
    cd ../../

When running the command:

python pipeline.py --trainingSetDir ./train --testingSetDir ./test --trainFeatures True --testFeatures True --classType dir

I get the following error message:

<cProfile.Profile object at 0x107745db0>
Starting to extract features from training set
dirr change to: ./train
Multiclass fasta_files list found: []
Features generated
Removing any all zero features
df.shape:  (0, 0)
df_cleaned shape:  (0, 0)
Done
Extracted training data features
Training predictive model
Traceback (most recent call last):
  File "pipeline.py", line 171, in <module>
    res = profiler.runcall(pipeline)
  File "/Users/charles/anaconda/envs/py34/lib/python3.4/cProfile.py", line 109, in runcall
    return func(*args, **kw)
  File "pipeline.py", line 90, in pipeline
    model, lb_encoder = trainClassifier(filename=trainingDir+'/trainingSetFeatures.csv',normFlag= False,classifierType= classifierType,kbest= 0,alpha= False,optimalFlag= False) #Win
  File "/Users/charles/Downloads/feat_extract/Model_trainer.py", line 114, in trainClassifier
    features, labels, lb_encoder,featureNames = load_data(filename, 'file')
  File "/Users/charles/Downloads/feat_extract/Model_trainer.py", line 36, in load_data
    df = pd.read_csv(dataFrame, index_col=[0,1]) # is index column 0 in multiindex as well?
  File "/Users/charles/anaconda/envs/py34/lib/python3.4/site-packages/pandas/io/parsers.py", line 474, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/Users/charles/anaconda/envs/py34/lib/python3.4/site-packages/pandas/io/parsers.py", line 250, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/Users/charles/anaconda/envs/py34/lib/python3.4/site-packages/pandas/io/parsers.py", line 566, in __init__
    self._make_engine(self.engine)
  File "/Users/charles/anaconda/envs/py34/lib/python3.4/site-packages/pandas/io/parsers.py", line 705, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/Users/charles/anaconda/envs/py34/lib/python3.4/site-packages/pandas/io/parsers.py", line 1072, in __init__
    self._reader = _parser.TextReader(src, **kwds)
  File "pandas/parser.pyx", line 350, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:3173)
  File "pandas/parser.pyx", line 594, in pandas.parser.TextReader._setup_parser_source (pandas/parser.c:5912)
OSError: File b'./train/trainingSetFeatures.csv' does not exist

It complains that ./train/trainingSetFeatures.csv' does not exist. I see that a file with this name is being created in the train folder, however it is a table with only column names (no rows).

Thank you for your help.

ddofer / profet Goto Github PK

profet's Introduction

ProFET

profet's People

Contributors

Stargazers

Watchers

Forkers

profet's Issues

trouble getting started

fix scale_trimMean when outlierLength = 1

Running pipeline.py is not working -> training dir doesn´t exist

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent