Giter VIP home page Giter VIP logo

profet's Introduction

ProFET

ProFET: Protein Feature Engineering Toolkit for Machine Learning

To run ProFET you will need Python 3 and the following packages installed:

  • Pandas
  • Numpy
  • Scikit-learn
  • Biopython

I STRONGLY recommend installing them using the Anaconda Python distribution if you don't have them already, It's easiest:

All of these packages are part of the default Anaconda destribution, except for biopython. You can add Biopython by running from the command line:

  • conda install Biopython *

Code for running the package is in the subdirectory 'CODE/feat_extract' . You can run the feature extraction pipeline using 'pipeline.py'. (Detailed usage instructions can be found in its accompanying readme).

You can also extract features yourself using the underlying methods. Currently, you can alter the features extracted by 'pipeline.py' by directly modifying the parameters of the function "Get_Protein_Feat" in '/feat_extract/FeatureGen.py'.

The datasets used in the paper are found in the 'fasta' directory. Be careful with file/fasta formats and names.

Please note that the code is currently 'beta' - that means that it's rough, and there will be bugs. Feel free to add or improve!

If you use our tool, or code, please cite us!: Ofer, Dan, and Michal Linial. "ProFET: Feature engineering captures high-level protein functions." Bioinformatics (2015): btv345.

Dan.

profet's People

Contributors

ddofer avatar nadavrap avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

profet's Issues

trouble getting started

Thank you for the development of ProFET!

I wanted to try it out but I ran into some trouble. It would be great if you could point me towards where I am going wrong.

I am using python 3.4 and I have have installed all the dependencies mentioned in the README.md. I have the following folder structure where feat_extract is my working directory:

feat_extract/
|_pipeline.py
|_other ProFET files...
|_test_seq/...
|_train/
| |_A/
| | |_train_sequences_A.fasta
| |_B/
|   |_train_sequences_B.fasta
|_test
  |_A/
  | |_test_sequences_A.fasta
  |_B/
    |_test_sequences_B.fasta

The fasta files were created with the following set of commands:

    cd ./test_seq/Extracellular/
    tail -n 1000 location-secreted_keyword-AKW-0964_reviewed_taxon-Tetrapoda_fragment-no_id-0.9.fasta > ../../train/A/train_sequences_A.fasta
    tail -n 1000 NOT-secreted_NOT-extracellular_reviewed_taxon-Tetrapoda_fragment-no_id-0.5.fasta > ../../train/B/train_sequences_B.fasta
    head -n 1000 location-secreted_keyword-AKW-0964_reviewed_taxon-Tetrapoda_fragment-no_id-0.9.fasta > ../../test/A/test_sequences_A.fasta
    head -n 1000 NOT-secreted_NOT-extracellular_reviewed_taxon-Tetrapoda_fragment-no_id-0.5.fasta > ../../test/B/test_sequences_B.fasta
    cd ../../

When running the command:

python pipeline.py --trainingSetDir ./train --testingSetDir ./test --trainFeatures True --testFeatures True --classType dir

I get the following error message:

<cProfile.Profile object at 0x107745db0>
Starting to extract features from training set
dirr change to: ./train
Multiclass fasta_files list found: []
Features generated
Removing any all zero features
df.shape:  (0, 0)
df_cleaned shape:  (0, 0)
Done
Extracted training data features
Training predictive model
Traceback (most recent call last):
  File "pipeline.py", line 171, in <module>
    res = profiler.runcall(pipeline)
  File "/Users/charles/anaconda/envs/py34/lib/python3.4/cProfile.py", line 109, in runcall
    return func(*args, **kw)
  File "pipeline.py", line 90, in pipeline
    model, lb_encoder = trainClassifier(filename=trainingDir+'/trainingSetFeatures.csv',normFlag= False,classifierType= classifierType,kbest= 0,alpha= False,optimalFlag= False) #Win
  File "/Users/charles/Downloads/feat_extract/Model_trainer.py", line 114, in trainClassifier
    features, labels, lb_encoder,featureNames = load_data(filename, 'file')
  File "/Users/charles/Downloads/feat_extract/Model_trainer.py", line 36, in load_data
    df = pd.read_csv(dataFrame, index_col=[0,1]) # is index column 0 in multiindex as well?
  File "/Users/charles/anaconda/envs/py34/lib/python3.4/site-packages/pandas/io/parsers.py", line 474, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/Users/charles/anaconda/envs/py34/lib/python3.4/site-packages/pandas/io/parsers.py", line 250, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/Users/charles/anaconda/envs/py34/lib/python3.4/site-packages/pandas/io/parsers.py", line 566, in __init__
    self._make_engine(self.engine)
  File "/Users/charles/anaconda/envs/py34/lib/python3.4/site-packages/pandas/io/parsers.py", line 705, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/Users/charles/anaconda/envs/py34/lib/python3.4/site-packages/pandas/io/parsers.py", line 1072, in __init__
    self._reader = _parser.TextReader(src, **kwds)
  File "pandas/parser.pyx", line 350, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:3173)
  File "pandas/parser.pyx", line 594, in pandas.parser.TextReader._setup_parser_source (pandas/parser.c:5912)
OSError: File b'./train/trainingSetFeatures.csv' does not exist

It complains that ./train/trainingSetFeatures.csv' does not exist. I see that a file with this name is being created in the train folder, however it is a table with only column names (no rows).

Thank you for your help.

fix scale_trimMean when outlierLength = 1

In source file ProtFeat.py, line 797, when the sequence is small enough for outlierLength to be equal to 1, scale_trimMean is set as an empty array, rather than the whole array. When the mean of scale_trimMean is then taken at line 818, it throws the warning "RuntimeWarning: Mean of empty slice".

Running pipeline.py is not working -> training dir doesn´t exist

Hey @ddofer ,

We tried to get your program run but it always failed at the same point. If we execute

python pipeline.py --trainingSetDir r'C:\Users...\ProFET-master\training' --testingSetDir r'C:\Users...\ProFET-master\testing' --trainFeatures True --testFeatures True --classType file

(There haven´t been spaces in the path ;) )

It always gave us the same error!
--> training dir doesn´t exist

But our path is for sure right.
Can you help us with this problem?
If this is a bug, I hope I´ll help you recocnizing it and you can fix it ;)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.