smaegol / plasflow Goto Github PK

Software for prediction of plasmid sequences in metagenomic assemblies

License: GNU General Public License v3.0

Python 37.14% Perl 15.53% Shell 21.27% R 26.06%

classification prediction contigs plasmid fasta plasmid-sequences tensorflow metagenome-assembly metagenome metagenomes

plasflow's Introduction

NOT MAINTAINED

Use at your own risk. I am very grateful that it is being widely used but, as I completely changed my research area I cannot give my time to maintain this project. There are other, newer packages developed, which can be used instead.

PlasFlow 1.1

PlasFlow is a set of scripts used for prediction of plasmid sequences in metagenomic contigs. It relies on the neural network models trained on full genome and plasmid sequences and is able to differentiate between plasmids and chromosomes with accuracy reaching 96%. It outperforms other available solutions for plasmids recovery from metagenomes and incorporates the thresholding which allows for exclusion of incertain predictions. PlasFlow has been published in Nucleic Acids Research (https://doi.org/10.1093/nar/gkx1321).

News
Requirements
Installation
Getting started
Output
Test dataset
Detailed information
Citation
TBD
Support

News

2018-05-25 Version 1.1 released

New version (1.1) released, which is better suited for large datasets. It can be downloaded from conda and pypi, but the simplest way to upgrade is to replace PlasFlow.py file in you previous installation with the current one. If you still encounter problems with the new version, try to use smaller numbers for the --batch_size option.

Requirements:

Python 3.5
Python packages:
- Scikit-learn 0.18.1
- Numpy
- Pandas
- TensorFlow 0.10.0
- rpy2 >= 2.8
- scipy
- biopython
- dateutil >= 2.5
R 3.25
R packages:
- Biostrings

For the perl scripts, especially filter_sequences_by_length.pl:

Perl 5 and modules:
- Bioperl (installation instructions)
- Getopt

Installation

Conda-based - recommended

Conda is recommended option for installation as it properly resolve all dependencies (including R and Biostrings) and allows for installation without messing with other packages installed. Conda can be used both as the Anaconda, and Miniconda (which is easier to install and maintain).

After the installation it is required to add bioconda channel, required for Biostrings package installation:

conda config --add channels bioconda

Sometimes it can be also required to add default conda channel (conda-forge):

conda config --add channels conda-forge

To exclude the possibility of dependencies conflicts its encouraged to create spearate conda environment for Plasflow using command:

conda create --name plasflow python=3.5

Python 3.5 is required becuase of TensorFlow requirements.

to activate created environment type:

source activate plasflow

Mac users should install Tensorflow at this step (as osx-64 package is not present in default channels). If you encounter any problems with missing TensorFlow dependency on other platforms also try to install TF from this source.

conda install -c jjhelmus tensorflow=0.10.0rc0

PlasFlow can be easily installed as an Anaconda package from my Anaconda channel using:

conda install plasflow -c smaegol

With this command all required dependencies are installed into created conda environment. When installation is finished PlasFlow can be invoked as described in the Getting started section.

When you decide to finish your work with PlasFlow, you can simply deactivate current anaconda environment with command:

source deactivate

Pip installer

There is a possibility of pip based installation. However, some requirements have to be met:

Python 3.5 is required (due to TensorFlow requirements)
TensorFlow has to be installed manually:

pip install https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.10.0rc0-cp35-cp35m-linux_x86_64.whl

then install PlasFlow with

pip install plasflow

However, models used for prediction have to be downloaded separately (for example using git clone https://github.com/smaegol/PlasFlow).

Manual installation

Of course, PlasFlow repo can be cloned using

git clone https://github.com/smaegol/PlasFlow

but in that case all dependencies have to be installed manually. TensorFlow can be installed as specified above:

pip install https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.10.0rc0-cp35-cp35m-linux_x86_64.whl

python dependencies can be installed using pip:

pip install numpy pandas scipy rpy2 scikit-learn biopython

to install R Biostrings go to https://bioconductor.org/packages/release/bioc/html/Biostrings.html and follow instructions therein.

Perl modules for additional scripts

Perl scripts (like filter_sequences_by_length.pl) included with PlasFlow requires few Perl modules. THey can be easily installed using conda:

conda install -c bioconda perl-bioperl perl-getopt-long

or cpan:

cpan -i Bio::Perl Getopt::longer

or any package manager included in your system (apt, brew)

Getting started

PlasFlow is designed to take a metagenomic assembly and identify contigs which may come from plasmids. It outputs several files, from which the most important is a tabular file containing all predictions (specified with --output option).

Prior to the PlasFlow invocation it is highly recommended to filter sequences by length, leaving only those longer than 1000 bp. PlasFlow, similarly to other kmer-based methods, does not perform well on short sequences, as it is hard to get proper kmer coverage from them. Hence, results for short sequences are unreliable. As metagenomic assemblies usually contain large number of short contigs additional filtering test can improve results and speed up the PlasFlow. It can also prevent too high RAM usage.

To filter sequences using provided Perl script type:

filter_sequences_by_length.pl -input input_dataset.fasta -output filtered_output.fasta -thresh sequence_length_threshold

where sequence length threshold have to be provided in base pairs. Filtered fasta file can be then used directly for PlasFlow prediction.

Options available in PlasFlow include:

--input - specifies input fasta file with assembly contigs to classify [required]
--output - a name of the tsv file with the tabular output of classification [required]
--threshold - manually specified threshold for probability filtering (default = 0.7)
--labels - manually specified custom location of labels file (used for translation from numeric output to actual class names)
--models - custom location of models used for prediction (have to be specified if PlasFlow was installed using pip)
--batch_size - how many sequences can be used in the single batch of kmers frequency calculation

Output

The most important output of PlasFlow is a tabular file containing all predictions (specified with --output option), consiting of several columns including:

contig_id	contig_name	contig_length	id	label	...

where:

contig_idis an internal id of sequence used for the classification
contig_name is a name of contig used in the classification
contig_length shows the length of a classified sequence
id is an internal id of a produced label (classification)
label is the actual classification
... represents additional columns showing probabilities of assignment to each possible class

Sequences can be classified to 26 classes including: chromosome.Acidobacteria, chromosome.Actinobacteria, chromosome.Bacteroidetes, chromosome.Chlamydiae, chromosome.Chlorobi, chromosome.Chloroflexi, chromosome.Cyanobacteria, chromosome.DeinococcusThermus, chromosome.Firmicutes, chromosome.Fusobacteria, chromosome.Nitrospirae, chromosome.other, chromosome.Planctomycetes, chromosome.Proteobacteria, chromosome.Spirochaetes, chromosome.Tenericutes, chromosome.Thermotogae, chromosome.Verrucomicrobia, plasmid.Actinobacteria, plasmid.Bacteroidetes, plasmid.Chlamydiae, plasmid.Cyanobacteria, plasmid.DeinococcusThermus, plasmid.Firmicutes, plasmid.Fusobacteria, plasmid.other, plasmid.Proteobacteria, plasmid.Spirochaetes.

If the probability of assignment to given class is lower than threshold (default = 0.7) then the sequence is treated as unclassified.

Additionaly, PlasFlow produces fasta files containing input sequences binned to plasmids, chromosomes and unclassified.

Test dataset

Test dataset is located in the test folder (file Citrobacter_freundii_strain_CAV1321_scaffolds.fasta). It is the SPAdes 3.9.1 assembly of Citrobacter freundii strain CAV1321 genome (NCBI assembly ID: GCA_001022155.1), which contains 1 chromosome and 9 plasmids. In the same folder the results of classification can be found in the form of tsv file (Citrobacter_freundii_strain_CAV1321_scaffolds.fasta.PlasFlow.tsv) and fasta files containing identified bins (Citrobacter_freundii_strain_CAV1321_scaffolds.fasta.PlasFlow.tsv_chromosomes.fasta, Citrobacter_freundii_strain_CAV1321_scaffolds.fasta.PlasFlow.tsv_plasmids.fasta and Citrobacter_freundii_strain_CAV1321_scaffolds.fasta.PlasFlow.tsv_unclassified.fasta).

To invoke PlasFlow on the test dataset please copy the test/Citrobacter_freundii_strain_CAV1321_scaffolds.fasta file to you current working directory and type:

PlasFlow.py --input Citrobacter_freundii_strain_CAV1321_scaffolds.fasta --output test.plasflow_predictions.tsv --threshold 0.7

The predictions will be located in the test.plasflow_predictions.tsv file and can be compared to results available in the test/Citrobacter_freundii_strain_CAV1321_scaffolds.fasta.PlasFlow.tsv.

Detailed information

Detailed information concerning the alogrithm and assumptions on which the PlasFlow is based can be found in the publication "PlasFlow - Predicting Plasmid Sequences in Metagenomic Data Using Genome Signatures" (Nucleic Acids Research, submitted). The flowchart illustrating major steps of training and prediction is shown below

All models tested and described in the manuscript can be found in the seperate repository: https://github.com/smaegol/PlasFlow_models

Scripts used for the preparation of training dataset and for neural network training are available in the scripts subfolder as well in the separate repository: https://github.com/smaegol/PlasFlow_processing

Citation

Please cite the following paper when using PlasFlow for your own research.

Krawczyk PS, Lipinski L, Dziembowski A. Nucleic Acids Res. 2018 Apr 6;46(6):e35. doi: 10.1093/nar/gkx1321.

TBD

In next releases we plan to retrain models using the most recent TensorFlow release. During the development of PlasFlow there was a lot of changes in the TensorFlow library and the newest version is not compatible with models trained for TensorFlow. However, retraining requires signficant computational effort and recoding. As we want to include Archaea sequences (which are missed now) in the models, we plan to train new models with the latest TensorFlow version and release new version of PlasFlow in the second part of 2018.

Support

Any issues connected with the PlasFlow should be addressed to Pawel Krawczyk (p.krawczyk (at) ibb.waw.pl).

plasflow's People

Contributors

Stargazers

Watchers

plasflow's Issues

plasflow and tensorflow=0.10.0rc0 packages not found on current channels via miniconda3 installation

Hey Smaegol,

Thanks for updating the installation for PlasFlow. I am excited to use this novel tool to search for plasmids but ran into the following trying to install the new updates:

Any thoughts on how to resolve the issue would be much appreciated.

Thanks,
Sfinks

cannot locate bio/seqio.pm in @INC

Hi, I'm having trouble running the program with the following error. I think it's missing a perl module?
I installed the program using conda install plasflow -c smaegol.
Any advices?
Thanks for the assistance!

Can't locate Bio/SeqIO.pm in @inc (@inc contains: /home/jjjjia/perl5/lib/perl5/5.16.3/x86_64-linux-thread-multi /home/jjjjia/perl5/lib/perl5/5.16.3 /home/jjjjia/perl5/lib/perl5/x86_64-linux-thread-multi /home/jjjjia/perl5/lib/perl5 /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 .) at /home/jjjjia/.conda/envs/plasflow/bin/filter_sequences_by_length.pl line 24.
BEGIN failed--compilation aborted at /home/jjjjia/.conda/envs/plasflow/bin/filter_sequences_by_length.pl line 24.

Plasflow running problem

While I running the plasflow.py, I met some problems showing as following:

(plasflow) wzq:plasflow_output/ $ PlasFlow.py --input filtered_seameta_9.fasta --output seameta_19.plasflow.tsv [11:21:37]
Traceback (most recent call last):
File "/home/wzq/.conda/envs/plasflow/bin/PlasFlow.py", line 50, in
from rpy2.robjects.packages import importr
File "/home/wzq/.conda/envs/plasflow/lib/python3.5/site-packages/rpy2/robjects/init.py", line 16, in
import rpy2.rinterface as rinterface
File "/home/wzq/.conda/envs/plasflow/lib/python3.5/site-packages/rpy2/rinterface/init.py", line 92, in
from rpy2.rinterface._rinterface import (baseenv,
ImportError: libicuuc.so.54: cannot open shared object file: No such file or directory

How to fix it?

Should MAG completeness be considered when running Plasflow?

Hi there, I've been applying Plasflow to MAGs from waste water treatment samples, and it seems to be working really well! My question is more theoretical than technical - given that many of our assembled genomes have low completeness (even when they are highly abundant) is it still appropriate to include these in the pipeline? The paper excludes short sequences but doesn't mention assembly quality, and I am not sure if it should make a difference or not. If I set a threshold at 50% completeness, this excludes most of the MAGs, including some of the most abundant ones. In your study, you test Plasflow on microbial mats, for which I assume most of the assemblies would be rather incomplete. So, would it be sensible to set completeness thresholds if one wants high standards for identifying true associations, or is it irrelevant? Thanks!

Setting up with conda, filter_sequences_by_length.pl isnt reachable

I have set the environment with conda but when I try to use the filter_sequences_by_length.pl script, I noticed it doesn't exist in the environment. Do the scripts need to be manually?

Thanks in advance

Synthetic plasmid detection

Hello PlasFlow,

I notice you train the model on RefSeq references. In my use-case I would like to identify and reconstruct plasmids from NGS data, but all of my plasmids are synthetic constructs and probably don't resemble any of the references in RefSeq.
My question is, where should I look to re-train the model on a reference containing my plasmid set and then be able to identify it from millions of NGS reads? Or, will the accuracy of the tool correctly identify synthetic constructs as is?

Thank you,
Alexander

Cache?

When I followed the steps,the result was:
[41372:0715/180616.696:ERROR:cache_util_win.cc(20)] Unable to move the cache: access denied。 (0x5)
[41372:0715/180616.696:ERROR:disk_cache.cc(205)] Unable to create cache

How can I deal with it?

Running error with "init.py"

Hi,

I installed PlasFlow through anaconda with the command line:
conda create -n plasflow python=3.5 plasflow

Then I ran plasflow.py but then I got the following errors:

Calculating kmer frequencies using kmer 5
Transforming kmer frequencies
Traceback (most recent call last):
File "/srv/scratch/z3336178/anaconda/envs/plasflow/bin/PlasFlow.py", line 261, in
vote_proba = vote_class.predict_proba(input_data)
File "/srv/scratch/z3336178/anaconda/envs/plasflow/bin/PlasFlow.py", line 215, in predict_proba
self.probas_ = [clf.predict_proba_tf(X) for clf in self.clfs]
File "/srv/scratch/z3336178/anaconda/envs/plasflow/bin/PlasFlow.py", line 215, in
self.probas_ = [clf.predict_proba_tf(X) for clf in self.clfs]
File "/srv/scratch/z3336178/anaconda/envs/plasflow/bin/PlasFlow.py", line 171, in predict_proba_tf
import tensorflow as tf
File "/srv/scratch/z3336178/anaconda/envs/plasflow/lib/python3.5/site-packages/tensorflow/init.py", line 23, in
from tensorflow.python import *
File "/srv/scratch/z3336178/anaconda/envs/plasflow/lib/python3.5/site-packages/tensorflow/python/init.py", line 52, in
from tensorflow.core.framework.graph_pb2 import *
File "/srv/scratch/z3336178/anaconda/envs/plasflow/lib/python3.5/site-packages/tensorflow/core/framework/graph_pb2.py", line 6, in
from google.protobuf import descriptor as _descriptor
File "/srv/scratch/z3336178/anaconda/envs/plasflow/lib/python3.5/site-packages/google/protobuf/descriptor.py", line 47, in
from google.protobuf.pyext import _message
ImportError: /srv/scratch/z3336178/anaconda/envs/plasflow/lib/python3.5/site-packages/google/protobuf/pyext/_message.cpython-35m-x86_64-linux-gnu.so: undefined symbol: _ZNK6google8protobuf10TextFormat17FieldValuePrinter9PrintBoolEb

Is it a compatibility problem?

Cheers

Alan

ImportError: libicuuc.so.58: cannot open shared object file: No such file or directory

I had install plasflow following the “Conda-based - recommended” by miniconda3.
command：
conda config --add channels bioconda
conda config --add channels conda-forge
conda create --name plasflow python=3.5
source activate plasflow
conda install -c jjhelmus tensorflow=0.10.0rc0
conda install plasflow -c smaegol

after all of the dependencies being installed, run
~/biosoft/miniconda3/envs/plasflow/bin/PlasFlow.py --input plasflow.test.fa --output W3

then, the erro occured like this:

Traceback (most recent call last): File "/usr/lishasha/biosoft/miniconda3/envs/plasflow/bin/PlasFlow.py", line 50, in <module> from rpy2.robjects.packages import importr File "/usr/lishasha/biosoft/miniconda3/envs/plasflow/lib/python3.5/site-packages/rpy2/robjects/__init__.py", line 16, in <module> import rpy2.rinterface as rinterface File "/usr/lishasha/biosoft/miniconda3/envs/plasflow/lib/python3.5/site-packages/rpy2/rinterface/__init__.py", line 92, in <module> from rpy2.rinterface._rinterface import (baseenv, ImportError: libicuuc.so.58: cannot open shared object file: No such file or directory

Conda install - ImportError: libicuuc.so.58: cannot open shared object file: No such file or directory

Hello,

I attempted to install PlasFlow using conda as recommended in your README, but it looks like conda is having some issues with a dependency - when I try to run PlasFlow.py, I get the following ImportError:

PlasFlow.py --input /home/lowa/test.fasta --output classified.txt
Traceback (most recent call last):
  File "/home/lowa/miniconda3/envs/plasflow/bin/PlasFlow.py", line 48, in <module>
    from rpy2.robjects.packages import importr
  File "/home/lowa/miniconda3/envs/plasflow/lib/python3.5/site-packages/rpy2/robjects/__init__.py", line 15, in <module>
    import rpy2.rinterface as rinterface
  File "/home/lowa/miniconda3/envs/plasflow/lib/python3.5/site-packages/rpy2/rinterface/__init__.py", line 99, in <module>
    from rpy2.rinterface._rinterface import *
ImportError: libicuuc.so.58: cannot open shared object file: No such file or directory

The rpy2 module is definitely installed, as can be seen with conda list, so I'm not sure what's going wrong. Any guidance on how to resolve this would be much appreciated. Thanks!

Output of conda list:

# packages in environment at /home/lowa/miniconda3/envs/plasflow:
#
bioconductor-biocgenerics 0.22.0                 r3.3.2_0    bioconda
bioconductor-iranges      2.8.2                  r3.3.2_0    bioconda
bioconductor-s4vectors    0.12.2                 r3.3.2_0    bioconda
bioconductor-xvector      0.14.1                 r3.3.2_0    bioconda
bioconductor-zlibbioc     1.20.0                 r3.3.2_1    bioconda
biopython                 1.70                     py35_2    conda-forge
bzip2                     1.0.6                         1    conda-forge
ca-certificates           2017.11.5                     0    conda-forge
cairo                     1.14.8                        0  
certifi                   2017.11.5                py35_0    conda-forge
curl                      7.55.1                        0    conda-forge
fontconfig                2.12.1                        3  
freetype                  2.5.5                         2  
glib                      2.50.2                        1  
gsl                       2.1                           2    conda-forge
harfbuzz                  0.9.39                        2  
icu                       54.1                          0  
intel-openmp              2018.0.0             hc7b2577_8  
jpeg                      9b                            2    conda-forge
krb5                      1.14.2                        0    conda-forge
libffi                    3.2.1                         3    conda-forge
libgcc                    7.2.0                h69d50b8_2  
libgcc-ng                 7.2.0                h7cc24e2_2  
libgfortran               3.0.0                         1  
libiconv                  1.14                          4    conda-forge
libpng                    1.6.34                        0    conda-forge
libssh2                   1.8.0                         2    conda-forge
libstdcxx-ng              7.2.0                h7a57d05_2  
libtiff                   4.0.9                         0    conda-forge
libxml2                   2.9.4                         0  
mkl                       2017.0.4             h4c4d0af_0  
mock                      2.0.0                    py35_0    conda-forge
ncurses                   5.9                          10    conda-forge
numpy                     1.11.3                   py35_0  
openssl                   1.0.2n                        0    conda-forge
pandas                    0.22.0                   py35_0    conda-forge
pango                     1.40.3                        1  
pbr                       3.1.1                    py35_0    conda-forge
pcre                      8.39                          0    conda-forge
pip                       9.0.1                    py35_1    conda-forge
pixman                    0.34.0                        1    conda-forge
plasflow                  1.0.7                    py35_0    smaegol
protobuf                  3.5.1                    py35_3    conda-forge
python                    3.5.4                         2    conda-forge
python-dateutil           2.3                      py35_0    bioconda
pytz                      2017.3                     py_2    conda-forge
r-base                    3.3.2                         1  
readline                  6.2                           0    conda-forge
rpy2                      2.7.8              py35r3.3.2_1    bioconda
scikit-learn              0.18.1              np111py35_1  
scipy                     0.19.0              np111py35_0  
setuptools                38.4.0                   py35_0    conda-forge
six                       1.11.0                   py35_1    conda-forge
sqlite                    3.13.0                        1    conda-forge
tensorflow                0.10.0rc0           np111py35_0  
tk                        8.5.19                        2    conda-forge
wheel                     0.30.0                   py35_2    conda-forge
xz                        5.2.3                         0    conda-forge
zlib                      1.2.11                        0    conda-forge

conda install hung at solving environment...

Hi,
I used Conda-based installation, and all the steps were done successfully, expect he last step "conda install plasflow -c smaegol", which output:

Collecting package metadata: done
Solving environment: /

for 12 hours. So what is the problem ?

Thanks

Segmentation fault: 11

Cryptic segmentation fault error

Hi and thanks for the work done.

I was running PlasFlow on a pretty big metagenomic assembly. I've only retained contigs above 1kb but I'm still left with half a million contigs. PlasFlow was running smoothly until the prediction step using for 7-mer (see console output below )

Beside that specific dataset PlasFlow runs great on smaller datasets so it doesn't seem to be an installation issue. I'm running it in a dedicated conda environement.

Is this due to a memory issue and is there anyway to resume the PlasFlow run from intermediary files ? ie not recounting all kmers etc, just doing the classfication setep maybe tweaking batch_size parameter? I'm running that on Debian 3.16.51-3+deb8u1 (2018-01-08) x86_64 and the server is pretty decent RAM wise with 1To installled for 64 cores...

Any ideas to process that dataset are welcome :)

Line run:
PlasFlow.py --input ../anvio.megahit_kstep8.1Kbcontigs.fasta --output PlasFlow_out.megahit_kstep8.1kb

...
processing chunk: 20
processing chunk: 21
processing chunk: 22
Transforming kmer frequencies
Finished transforming, saving transformed values
Predicting labels using kmer 6  frequencies
Calculating kmer frequencies using kmer 7
Due to large number of sequences in the input file, it is splitted to smaller chunks (maximum size: 25000 sequences)
processing chunk: 1
processing chunk: 2
processing chunk: 3
processing chunk: 4
processing chunk: 5
processing chunk: 6
processing chunk: 7
processing chunk: 8
processing chunk: 9
processing chunk: 10
processing chunk: 11
processing chunk: 12
processing chunk: 13
processing chunk: 14
processing chunk: 15
processing chunk: 16
processing chunk: 17
processing chunk: 18
processing chunk: 19
processing chunk: 20
processing chunk: 21
processing chunk: 22
Transforming kmer frequencies
Finished transforming, saving transformed values
Predicting labels using kmer 7  frequencies
Segmentation fault

tensorflow issue?

In a fresh installation of ubuntu 18.04.2 LTS, I installed miniconda and followed the plasflow conda installation.
conda -V

conda 4.6.12

Environment (conda list):

Details

_r-mutex 1.0.0 anacondar_1
bioconductor-biocgenerics 0.26.0 r341_0 bioconda
bioconductor-biostrings 2.48.0 r341h470a237_0 bioconda
bioconductor-iranges 2.14.12 r341h470a237_0 bioconda
bioconductor-s4vectors 0.18.3 r341h470a237_0 bioconda
bioconductor-xvector 0.20.0 r341h470a237_0 bioconda
bioconductor-zlibbioc 1.26.0 r341h470a237_0 bioconda
biopython 1.72 py35_0 conda-forge
blas 1.0 mkl
bzip2 1.0.6 h14c3975_1002 conda-forge
ca-certificates 2019.3.9 hecc5488_0 conda-forge
cairo 1.14.12 h80bd089_1005 conda-forge
certifi 2018.8.24 py35_1001 conda-forge
curl 7.61.0 h93b3f91_2 conda-forge
fontconfig 2.13.1 he4413a7_1000 conda-forge
freetype 2.10.0 he983fc9_0 conda-forge
gettext 0.19.8.1 hc5be6a0_1002 conda-forge
glib 2.56.2 had28632_1001 conda-forge
graphite2 1.3.13 hf484d3e_1000 conda-forge
gsl 2.2.1 h0c605f7_3
harfbuzz 1.9.0 he243708_1001 conda-forge
icu 58.2 hf484d3e_1000 conda-forge
intel-openmp 2019.3 199
jpeg 9c h14c3975_1001 conda-forge
krb5 1.14.6 0 conda-forge
libffi 3.2.1 he1b5a44_1006 conda-forge
libgcc 7.2.0 h69d50b8_2 conda-forge
libgcc-ng 8.2.0 hdf63c60_1
libgfortran 3.0.0 1 conda-forge
libgfortran-ng 7.3.0 hdf63c60_0
libiconv 1.15 h516909a_1005 conda-forge
libpng 1.6.36 h84994c4_1000 conda-forge
libprotobuf 3.6.0 hdbcaa40_1000 conda-forge
libssh2 1.8.0 h1ad7b7a_1003 conda-forge
libstdcxx-ng 8.2.0 hdf63c60_1
libtiff 4.0.10 h648cc4a_1001 conda-forge
libuuid 2.32.1 h14c3975_1000 conda-forge
libxcb 1.13 h14c3975_1002 conda-forge
libxml2 2.9.8 h143f9aa_1005 conda-forge
mkl 2017.0.4 h4c4d0af_0
mock 2.0.0 py35_0 conda-forge
ncurses 6.1 hf484d3e_1002 conda-forge
numpy 1.11.3 py35h3dfced4_4
openssl 1.0.2r h14c3975_0 conda-forge
pandas 0.23.4 py35hf8a1672_0 conda-forge
pango 1.40.14 hf0c64fd_1003 conda-forge
pbr 5.1.3 py_0 conda-forge
pcre 8.41 hf484d3e_1003 conda-forge
pip 18.0 py35_1001 conda-forge
pixman 0.34.0 h14c3975_1003 conda-forge
plasflow 1.1.0 py35_0 smaegol
protobuf 3.6.0 py35hfc679d8_0 conda-forge
pthread-stubs 0.4 h14c3975_1001 conda-forge
python 3.5.5 h5001a0f_2 conda-forge
python-dateutil 2.8.0 py_0 conda-forge
pytz 2019.1 py_0 conda-forge
r-base 3.4.1 h4fe35fd_8 conda-forge
readline 7.0 hf8c457e_1001 conda-forge
rpy2 2.8.5 py35r3.4.1_0 conda-forge
scikit-learn 0.18.1 np111py35_1
scipy 0.19.0 np111py35_0
setuptools 40.4.3 py35_0 conda-forge
six 1.11.0 py35_1 conda-forge
sqlite 3.26.0 h67949de_1001 conda-forge
tensorflow 0.10.0rc0 np111py35_0
tk 8.6.9 h84994c4_1001 conda-forge
wheel 0.32.0 py35_1000 conda-forge
xorg-kbproto 1.0.7 h14c3975_1002 conda-forge
xorg-libice 1.0.9 h516909a_1004 conda-forge
xorg-libsm 1.2.3 h84519dc_1000 conda-forge
xorg-libx11 1.6.7 h14c3975_1000 conda-forge
xorg-libxau 1.0.9 h14c3975_0 conda-forge
xorg-libxdmcp 1.1.3 h516909a_0 conda-forge
xorg-libxext 1.3.4 h516909a_0 conda-forge
xorg-libxrender 0.9.10 h516909a_1002 conda-forge
xorg-libxt 1.1.5 h14c3975_1002 conda-forge
xorg-renderproto 0.11.1 h14c3975_1002 conda-forge
xorg-xextproto 7.3.0 h14c3975_1002 conda-forge
xorg-xproto 7.0.31 h14c3975_1007 conda-forge
xz 5.2.4 h14c3975_1001 conda-forge
zlib 1.2.11 h14c3975_1004 conda-forge

The perl filter step went well, but when I run
PlasFlow.py --input scaffolds_filtered.fasta --output scaffolds.tsv
I got the following error

Importing sequences
Imported 29 sequences
Calculating kmer frequencies using kmer 5
Transforming kmer frequencies
Finished transforming, saving transformed values
Traceback (most recent call last):
File "/home/dgkurt/miniconda3/envs/plasflow3/lib/python3.5/site-packages/tensorflow/python/init.py", line 52, in
from tensorflow.core.framework.graph_pb2 import *
File "/home/dgkurt/miniconda3/envs/plasflow3/lib/python3.5/site-packages/tensorflow/core/framework/graph_pb2.py", line 6, in
from google.protobuf import descriptor as _descriptor
File "/home/dgkurt/miniconda3/envs/plasflow3/lib/python3.5/site-packages/google/protobuf/descriptor.py", line 47, in
from google.protobuf.pyext import _message
ImportError: /home/dgkurt/miniconda3/envs/plasflow3/lib/python3.5/site-packages/google/protobuf/pyext/_message.cpython-35m-x86_64-linux-gnu.so: undefined symbol: _ZNK6google8protobuf10TextFormat17FieldValuePrinter9PrintBoolEb

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/dgkurt/miniconda3/envs/plasflow3/bin/PlasFlow.py", line 346, in
vote_proba = vote_class.predict_proba(inputfile)
File "/home/dgkurt/miniconda3/envs/plasflow3/bin/PlasFlow.py", line 300, in predict_proba
self.probas_ = [clf.predict_proba_tf(X) for clf in self.clfs]
File "/home/dgkurt/miniconda3/envs/plasflow3/bin/PlasFlow.py", line 300, in
self.probas_ = [clf.predict_proba_tf(X) for clf in self.clfs]
File "/home/dgkurt/miniconda3/envs/plasflow3/bin/PlasFlow.py", line 255, in predict_proba_tf
import tensorflow as tf
File "/home/dgkurt/miniconda3/envs/plasflow3/lib/python3.5/site-packages/tensorflow/init.py", line 23, in
from tensorflow.python import *
File "/home/dgkurt/miniconda3/envs/plasflow3/lib/python3.5/site-packages/tensorflow/python/init.py", line 58, in
raise ImportError(msg)
ImportError: Traceback (most recent call last):
File "/home/dgkurt/miniconda3/envs/plasflow3/lib/python3.5/site-packages/tensorflow/python/init.py", line 52, in
from tensorflow.core.framework.graph_pb2 import *
File "/home/dgkurt/miniconda3/envs/plasflow3/lib/python3.5/site-packages/tensorflow/core/framework/graph_pb2.py", line 6, in
from google.protobuf import descriptor as _descriptor
File "/home/dgkurt/miniconda3/envs/plasflow3/lib/python3.5/site-packages/google/protobuf/descriptor.py", line 47, in
from google.protobuf.pyext import _message
ImportError: /home/dgkurt/miniconda3/envs/plasflow3/lib/python3.5/site-packages/google/protobuf/pyext/_message.cpython-35m-x86_64-linux-gnu.so: undefined symbol: _ZNK6google8protobuf10TextFormat17FieldValuePrinter9PrintBoolEb

Error importing tensorflow. Unless you are using bazel,
you should not try to import tensorflow from its source directory;
please exit the tensorflow source tree, and relaunch your python interpreter
from there.

Error in filter_sequences_by_length.pl

When I run the command ‘perl filter_sequences_by_length.pl -input /data/Timmy/metagenome_test/ERR1051325.fa -output /data/Timmy/plasflow/filtered_ERR1051325.fa -thresh sequence_length_threshold’

Error:
“(plasflow) [wangbh@tc6000 Plasflow]$ perl filter_sequences_by_length.pl -input /data/Timmy/metagenome_test/ERR1051325.fa -output /data/Timmy/plasflow/filtered_ERR1051325.fa -thresh sequence_length_threshold
Can't locate Bio/SeqIO.pm in @inc (you may need to install the Bio::SeqIO module) (@inc contains: /usr/lib/perl5/site_perl /data2/wangbh/perl5/lib/perl5/x86_64-linux-thread-multi /data2/wangbh/perl5/lib/perl5 /public/home/bma/miniconda2/envs/plasflow/lib/site_perl/5.32.0/x86_64-linux-thread-multi /public/home/bma/miniconda2/envs/plasflow/lib/site_perl/5.32.0 /public/home/bma/miniconda2/envs/plasflow/lib/5.32.0/x86_64-linux-thread-multi /public/home/bma/miniconda2/envs/plasflow/lib/5.32.0 .) at filter_sequences_by_length.pl line 24.
BEGIN failed--compilation aborted at filter_sequences_by_length.pl line 24.”
It seems that Bio::SeqIO is not installed,
And when I run the command ‘conda install -c bioconda perl -bioperl perl-getopt-long’;
Error : “usage: conda [-h] [-V] command ...
conda: error: unrecognized arguments: -bioperl perl-getopt-long”;

when I run the command ‘cpan -i Bio::Perl Getopt::longer’,
Error : “(error): Skipping Getopt::longer because I couldn't find a matching namespace.”

RRuntime Warning issue

I created a new env as recommended and installed plasflow using

conda install plasflow -c smaegol
but when I run:
PlasFlow.py --input 1.fasta --output test.tsv --threshold 0.7
It shows:

/home/s/anaconda3/envs/plasflow/lib/python3.5/site-packages/rpy2/rinterface/init.py:186: RRuntimeWarning: Failed with error:
warnings.warn(x, RRuntimeWarning)
/home/s/anaconda3/envs/plasflow/lib/python3.5/site-packages/rpy2/rinterface/init.py:186: RRuntimeWarning:
warnings.warn(x, RRuntimeWarning)
/home/s/anaconda3/envs/plasflow/lib/python3.5/site-packages/rpy2/rinterface/init.py:186: RRuntimeWarning: ‘cannot read workspace version 3 written by R 4.1.1; need R 3.5.0 or newer’
warnings.warn(x, RRuntimeWarning)
/home/s/anaconda3/envs/plasflow/lib/python3.5/site-packages/rpy2/rinterface/init.py:186: RRuntimeWarning:

warnings.warn(x, RRuntimeWarning)
/home/s/anaconda3/envs/plasflow/lib/python3.5/site-packages/rpy2/rinterface/init.py:186: RRuntimeWarning: Error in readRDS(pfile) :
cannot read workspace version 3 written by R 4.1.1; need R 3.5.0 or newer

warnings.warn(x, RRuntimeWarning)
Traceback (most recent call last):
File "/home/s/anaconda3/envs/plasflow/bin/PlasFlow.py", line 74, in
biostrings = importr('Biostrings')
File "/home/s/anaconda3/envs/plasflow/lib/python3.5/site-packages/rpy2/robjects/packages.py", line 453, in importr
env = _get_namespace(rname)
rpy2.rinterface.RRuntimeError: Error in readRDS(pfile) :
cannot read workspace version 3 written by R 4.1.1; need R 3.5.0 or newer

here is the conda list:

packages in environment at /home/s/anaconda3/envs/plasflow:

Name Version Build Channel
_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 2_gnu conda-forge
_r-mutex 1.0.1 anacondar_1 conda-forge
bioconductor-biocgenerics 0.22.0 r3.3.2_0 bioconda
bioconductor-biostrings 2.42.1 r3.3.2_0 bioconda
bioconductor-iranges 2.8.2 r3.3.2_0 bioconda
bioconductor-s4vectors 0.12.2 r3.3.2_0 bioconda
bioconductor-xvector 0.14.1 r3.3.2_0 bioconda
bioconductor-zlibbioc 1.20.0 r3.3.2_1 bioconda
biopython 1.72 py35_0 conda-forge
blas 2.11 openblas conda-forge
bzip2 1.0.8 h7f98852_4 conda-forge
c-ares 1.18.1 h7f98852_0 conda-forge
ca-certificates 2022.12.7 ha878542_0 conda-forge
cairo 1.14.8 0 conda-forge
certifi 2018.8.24 py35_1001 conda-forge
curl 7.87.0 h6312ad2_0 conda-forge
fontconfig 2.12.1 3 conda-forge
freetype 2.5.5 2 conda-forge
glib 2.50.2 1 conda-forge
gsl 2.7 he838d99_0 conda-forge
harfbuzz 0.9.39 2 conda-forge
icu 54.1 0 conda-forge
jbig 2.1 h7f98852_2003 conda-forge
jpeg 8d h166bdaf_1 conda-forge
keyutils 1.6.1 h166bdaf_0 conda-forge
krb5 1.20.1 hf9c8cef_0 conda-forge
libblas 3.8.0 11_openblas conda-forge
libcblas 3.8.0 11_openblas conda-forge
libcurl 7.87.0 h6312ad2_0 conda-forge
libedit 3.1.20191231 he28a2e2_2 conda-forge
libev 4.33 h516909a_1 conda-forge
libffi 3.3 h58526e2_2 conda-forge
libgcc 7.2.0 h69d50b8_2 conda-forge
libgcc-ng 12.2.0 h65d4601_19 conda-forge
libgfortran 3.0.0 1 conda-forge
libgfortran-ng 7.5.0 h14aa051_20 conda-forge
libgfortran4 7.5.0 h14aa051_20 conda-forge
libgomp 12.2.0 h65d4601_19 conda-forge
libiconv 1.14 4 conda-forge
liblapack 3.8.0 11_openblas conda-forge
liblapacke 3.8.0 11_openblas conda-forge
libnghttp2 1.51.0 hdcd2b5c_0 conda-forge
libopenblas 0.3.6 h5a2b251_2 conda-forge
libpng 1.6.39 h753d276_0 conda-forge
libprotobuf 3.6.0 hdbcaa40_1000 conda-forge
libsqlite 3.40.0 h753d276_0 conda-forge
libssh2 1.10.0 haa6b8db_3 conda-forge
libstdcxx-ng 12.2.0 h46fd767_19 conda-forge
libtiff 4.0.6 2 conda-forge
libxml2 2.9.9 hea5a465_1 conda-forge
libzlib 1.2.13 h166bdaf_4 conda-forge
mock 2.0.0 py35_0 conda-forge
ncurses 6.3 h27087fc_1 conda-forge
numpy 1.11.3 py35h99e49ec_10 conda-forge
numpy-base 1.11.3 py35h2f8d375_10 conda-forge
openblas 0.2.19 2 conda-forge
openssl 1.1.1t h0b41bf4_0 conda-forge
pandas 0.23.4 py35hf8a1672_0 conda-forge
pango 1.40.3 1 conda-forge
pbr 5.5.1 pyh9f0ad1d_0 conda-forge
pcre 8.39 0 conda-forge
pip 9.0.1 py35_1 conda-forge
pixman 0.34.0 h14c3975_1003 conda-forge
plasflow 1.1.0 py35_0 smaegol
protobuf 3.6.0 py35hfc679d8_0 conda-forge
python 3.5.6 h12debd9_1 conda-forge
python-dateutil 2.8.1 py_0 conda-forge
pytz 2022.2 pyhd8ed1ab_0 conda-forge
r-base 3.3.2 0 defaults
readline 8.2 h8228510_1 conda-forge
rpy2 2.8.5 py35r3.3.2_0 conda-forge
scikit-learn 0.18.1 np111py35_nomkl_1 conda-forge
scipy 1.1.0 py35he2b7bc3_1 conda-forge
setuptools 36.4.0 py35_1 conda-forge
six 1.16.0 pyh6c4a22f_0 conda-forge
sqlite 3.40.0 h4ff8645_0 conda-forge
tensorflow 0.10.0rc0 np111py35_0 conda-forge
tk 8.6.12 h27826a3_0 conda-forge
wheel 0.37.1 pyhd3eb1b0_0 conda-forge
xz 5.2.6 h166bdaf_0 conda-forge
zlib 1.2.13 h166bdaf_4 conda-forge

tensorflow=0.10.0rc0 Install Failed

Hi smaegol,

I use conda python=3.5 to install plasflow, everything is OK but tensorflow.
tensorflow=0.10.0rc0 is not in list
2. Then I use "pip install https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.10.0rc0-cp35-cp35m-linux_x86_64.whl", but it failed with "tensorflow-0.10.0rc0-cp35-cp35m-linux_x86_64.whl is not a supported wheel on this platform."

HOW to fix it?

Viruses?

Hello,
I did some test runs with the software, and I noticed that it could not discriminate between prokaryotic and viral sequences. As a result, it often classifies viral sequences as plasmids. I guess that it would be nice to have the chance to train it to exclude viruses.

ImportError: cannot import name 'MafIO'

Hi,

I'm running PlasFlow and I get the tsv file, but at some point it gets the following error:

Traceback (most recent call last):
  File "/home/people/alpal/.conda/envs/plasflow/bin/PlasFlow.py", line 375, in <module>
    "_chromosomes.fasta", "fasta")
  File "/home/people/alpal/.conda/envs/plasflow/lib/python3.5/site-packages/Bio/SeqIO/__init__.py", line 461, in write
    from Bio import AlignIO
  File "/home/people/alpal/.conda/envs/plasflow/lib/python3.5/site-packages/Bio/AlignIO/__init__.py", line 154, in <module>
    from . import MafIO
ImportError: cannot import name 'MafIO'

I have checked python and:

>>> import Bio 

>>> print(Bio.__version__)
1.69

>>> from Bio import AlignIO
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/people/alpal/.conda/envs/plasflow/lib/python3.5/site-packages/Bio/AlignIO/__init__.py", line 154, in <module>
    from . import MafIO
ImportError: cannot import name 'MafIO'

>>> from Bio.AlignIO import MafIO
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/people/alpal/.conda/envs/plasflow/lib/python3.5/site-packages/Bio/AlignIO/__init__.py", line 154, in <module>
    from . import MafIO
ImportError: cannot import name 'MafIO'

I installed PlasFlow using conda.

Any idea bout what is going on?

Thanks in advance.

For windows？

Does this can run on windows?

Installation Issues on MacOs and Windows

Hi I was wondering if there is an update available to resolve the channels issue. I experienced the following during installation of MacOS:

Also, I experienced this during installation for Windows:

I also, tried installing PlasFlow manually and with no success.

Many thanks for helping to resolve this issue,

-Sfinks

Installation issue on MacOS

Does the toolbox work on Mac?
While I was trying to install PlasFlow on a Mac I got the following error:
(plasflow) N82090:v1 lany$ conda install plasflow -c smaegol
Solving environment: failed

PackagesNotFoundError: The following packages are not available from current channels:

plasflow

Current channels:

Many thanks,
Yuxuan

conda installed tensorflow error

Hello, I installed the tool using conda, however, when I ran the tool, I got the error ImportError: /miniconda3/envs/plasflow/lib/python3.5/site-packages/google/protobuf/pyext/_message.cpython-35m-x86_64-linux-gnu.so: undefined symbol: _ZNK6google8protobuf10TextFormat17FieldValuePrinter9PrintBoolEb
/miniconda3/envs/plasflow/lib/python3.5/site-packages/google/protobuf/pyext/_message.cpython-35m-x86_64-linux-gnu.so: undefined symbol: _ZNK6google8protobuf10TextFormat17FieldValuePrinter9PrintBoolEb

Error importing tensorflow. Unless you are using bazel,
you should not try to import tensorflow from its source directory;
please exit the tensorflow source tree, and relaunch your python interpreter
from there.

Could you please help fix this?
Many thanks!
Best,

TypeError, Generators Being Added

Maybe an issue with numpy version. I have v1.11.3

$ PlasFlow.py --input scaff1k.fa --output scaff.plasflow.csv --threshold 0.7
...
Traceback (most recent call last):
  File "/home/dcdanko/miniconda3/envs/plasflow/bin/PlasFlow.py", line 346, in <module>
    vote_proba = vote_class.predict_proba(inputfile)
  File "/home/dcdanko/miniconda3/envs/plasflow/bin/PlasFlow.py", line 302, in predict_proba
    avg = np.average(self.probas_, axis=0, weights=self.weights)
  File "/home/dcdanko/.local/lib/python3.5/site-packages/numpy/lib/function_base.py", line 1110, in average
    avg = a.mean(axis)
  File "/home/dcdanko/.local/lib/python3.5/site-packages/numpy/core/_methods.py", line 70, in _mean
    ret = umr_sum(arr, axis, dtype, out, keepdims)
TypeError: unsupported operand type(s) for +: 'generator' and 'generator'

A question of tensorflow

May I ask you a question?I used win10 to install the GPU version of TensorFlow successfully, and also ran for several days, suddenly these days do not work, there is "Fail to load the native Tensorflow runtime". CUDA and cuDNN certainly no problem because I managed to run the convolution neural network. You have no good solution, I have looked for a lot of information, can not work.

filter_sequences_by_length.pl

Hello,
Pawel
After running the script, I found my contigs leaving only those longer than 500 bp (not > 1000bp), I wonder if it is possible to customize the parameters, such as filtering short sequences with <1500bp, when i run 'filter_sequences_by_length.pl'.

Running erro

$ PlasFlow.py --input filtered_seameta_trim_9.fasta --output seameta_trim_9.plasflow_predictions.tsv --threshold 0.7
/home/disk3-8T/wzq/miniconda3/lib/python3.5/site-packages/rpy2/rinterface/init.py:186: RRuntimeWarning: Failed with error:
warnings.warn(x, RRuntimeWarning)
/home/disk3-8T/wzq/miniconda3/lib/python3.5/site-packages/rpy2/rinterface/init.py:186: RRuntimeWarning:
warnings.warn(x, RRuntimeWarning)
/home/disk3-8T/wzq/miniconda3/lib/python3.5/site-packages/rpy2/rinterface/init.py:186: RRuntimeWarning: ‘package ‘BiocGenerics’ 0.24.0 is loaded, but >= 0.25.3 is required by ‘IRanges’’
warnings.warn(x, RRuntimeWarning)
/home/disk3-8T/wzq/miniconda3/lib/python3.5/site-packages/rpy2/rinterface/init.py:186: RRuntimeWarning:

warnings.warn(x, RRuntimeWarning)
Importing sequences
Traceback (most recent call last):
File "/home/disk3-8T/wzq/miniconda3/lib/python3.5/site-packages/rpy2/robjects/init.py", line 337, in getattribute
return self.getitem(attr)
File "/home/disk3-8T/wzq/miniconda3/lib/python3.5/site-packages/rpy2/robjects/init.py", line 342, in getitem
res = _globalenv.get(item)
LookupError: 'readDNAStringSet' not found

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/disk3-8T/wzq/miniconda3/bin/PlasFlow.py", line 93, in
input_data = r.readDNAStringSet(inputfile)
File "/home/disk3-8T/wzq/miniconda3/lib/python3.5/site-packages/rpy2/robjects/init.py", line 339, in getattribute
raise AttributeError(orig_ae)
AttributeError: 'R' object has no attribute 'readDNAStringSet'

New release: fix discrepancy between pypi, release, and git

I'm currently trying to fix the bioconda recipe for plasflow: bioconda/bioconda-recipes#27766

I noticed that the content of the download available at pypi and the release on github seems to differ. In particular the model dir seems missing in the pypi release.

I guess its because in setup.py

package_data={'models': ['models/*']},

should be

package_data={'plasflow': ['models/*']},

It would be great if you could create a new release fixing this.

Tensor name "dnn/hiddenlayer_0/biases" not found in checkpoint files

Hello,

I had a few issues installing PlasFlow on my macOS Sierra but it now appears to have installed correctly with all the packages required. The error I am now having is:

NotFoundError (see above for traceback): Tensor name "dnn/hiddenlayer_0/biases" not found in checkpoint files
./Documents/PlasFlow/models/kmer5_split_20_20_neurons_relu/model.ckpt-50000-?????-of-00001
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

I do not have an in depth understanding of TensorFlow to work out if this issue is caused by a fault in the file I have input or if it simply isn't compatible to run with OS X.

Any help would be appreciated!

batch_size advice

re the batch_size option, I found the opposite of the instructions to be true - the larger the batch size the more likely it is to work for a large dataset.

Plasflow running problem

Possible incompatibility with underlying sklearn.utils.sparsefuncs_fast._inplace_csr_row_normalize_l2

A colleague is working on using plasflow to analyze on all contigs >1000 bp in her dataset. After filtering using filter_sequences_by_length.pl, she has a total of 2,964,210 contigs. We are using plasflow-1.1, python-3.5 and sklearn-0.18.1 on CentOS 6.9. Plasflow was installed via Anaconda.

Running :

PlasFlow.py --input all.contigs.1000.fasta --output output.plasflow.all.contigs.csv --threshold 0.7

Yields:
Stdout:

Importing sequences
Imported  2964210  sequences
Calculating kmer frequencies using kmer 5
Due to large number of sequences in the input file, it is splitted to smaller chunks (maximum size: 25000 sequences)
processing chunk: 1
.
.
.
processing chunk: 119
Transforming kmer frequencies

Stderr :

/opt/plasflow/1.1/envs/plasflow/lib/python3.5/site-packages/sklearn/feature_extraction/text.py:1059: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
Traceback (most recent call last):
  File "/opt/plasflow/1.1//envs/plasflow/bin/PlasFlow.py", line 346, in <module>
    vote_proba = vote_class.predict_proba(inputfile)
  File "/opt/plasflow/1.1//envs/plasflow/bin/PlasFlow.py", line 300, in predict_proba
    self.probas_ = [clf.predict_proba_tf(X) for clf in self.clfs]
  File "/opt/plasflow/1.1//envs/plasflow/bin/PlasFlow.py", line 300, in <listcomp>
    self.probas_ = [clf.predict_proba_tf(X) for clf in self.clfs]
  File "/opt/plasflow/1.1//envs/plasflow/bin/PlasFlow.py", line 252, in predict_proba_tf
    self.calculate_freq(data)
  File "/opt/plasflow/1.1//envs/plasflow/bin/PlasFlow.py", line 243, in calculate_freq
    test_tfidf = transformer.fit_transform(kmer_count)
  File "/opt/plasflow/1.1/envs/plasflow/lib/python3.5/site-packages/sklearn/base.py", line 494, in fit_transform    return self.fit(X, **fit_params).transform(X)
  File "/opt/plasflow/1.1/envs/plasflow/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 1084, in transform
    X = normalize(X, norm=self.norm, copy=False)
  File "/opt/plasflow/1.1/envs/plasflow/lib/python3.5/site-packages/sklearn/preprocessing/data.py", line 1352, in normalize
    inplace_csr_row_normalize_l2(X)
  File "sklearn/utils/sparsefuncs_fast.pyx", line 359, in sklearn.utils.sparsefuncs_fast.inplace_csr_row_normalize_l2 (sklearn/utils/sparsefuncs_fast.c:12648)
  File "sklearn/utils/sparsefuncs_fast.pyx", line 362, in sklearn.utils.sparsefuncs_fast._inplace_csr_row_normalize_l2 (sklearn/utils/sparsefuncs_fast.c:13750)
ValueError: Buffer dtype mismatch, expected 'int' but got 'long'

This issue leads me to think this is due to passing the underlying C-funtion, sklearn.utils.sparsefuncs_fast._inplace_csr_row_normalize_l2, too large of a matrix. Following a path of links, lead me to this commit which makes me think that this may be fixed in a more recent version of scikit-learn. The input data, all.contigs.1000.fasta is 12GB in size

Question:

Is my assessment of this issue correct?
Is there a work-around this issue?
Is the input data too big?

Thanks.

WARNING: Logging before flag parsing goes to stderr.

hello @smaegol
when i run PlasFlow.py --input test.fasta --output test_plasflow_predictions.tsv --threshold 0.7 , here are erro information:
Finished transforming, saving transformed values WARNING: Logging before flag parsing goes to stderr. W1207 13:00:41.275060 140190885631808 lazy_loader.py:50] The TensorFlow contrib module will not be included in TensorFlow 2.0.

can you help me out this problem?

Also, another question from me is: PlasFlow help us to identity which contig probably is belong to plasmid, now i have a plasmid fna file contain 35 contigs, can PlasFlow tell me which of these 35 contigs belong to the same Plasmid? Or just tell me these 35 contigs which are probably belong to plasmid or chom?

Thanks a lot

PlasFlow.py --

hi~Can I ask you some questions?
When I code as you wrote:
PlasFlow.py --input Citrobacter_freundii_strain_CAV1321_scaffolds.fasta --output test.plasflow_predictions.tsv --threshold 0.7
The conda tips me that it's not a effective directive。How can I solve it?

Tensorflow don't found GLIBC_2.14

Hi,
I try to run your tool by following your instructions, using Conda, and every time I have the same issue :
ImportError: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by /home/sunam168/.conda/envs/plas_pkgs_included/lib/python3.5/site-packages/tensorflow/python/_pywrap_tensorflow.so)

The installation of tensorflow seems correct so I really don't where it comes from...

filtering error

(plasflow) [administrator@localhost ~]$ ./miniconda3/envs/plasflow/bin/filter_sequences_by_length.pl -input Citrobacter_freundii_strain_CAV1321_scaffolds.fasta -output firstplas.fasta
Can't locate Bio/SeqIO.pm in @inc (@inc contains: /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 .) at ./miniconda3/envs/plasflow/bin/filter_sequences_by_length.pl line 24.
BEGIN failed--compilation aborted at ./miniconda3/envs/plasflow/bin/filter_sequences_by_length.pl line 24.

How to deal with it?Thanks

Archaea sequences?

Great tool. I read that you are planning to include Archaea sequences. Do you have an idea of when will the tool be updated?

Plasflow running erro

Outputting fasta files with classified sequences
Traceback (most recent call last):
File "/home/wzq/miniconda3/envs/plasflow/bin/PlasFlow.py", line 444, in
sequences_dict = SeqIO.index(args.inputfile, "fasta")
File "/home/wzq/.local/lib/python3.5/site-packages/Bio/SeqIO/init.py", line 885, in index
key_function, repr, "SeqRecord")
File "/home/wzq/.local/lib/python3.5/site-packages/Bio/File.py", line 306, in init
raise ValueError("Duplicate key '%s'" % key)
ValueError: Duplicate key '1034476d-0f96-4862-8f0e-aecb00f7cfcb'

Could you fix it ?
Thanks.

Segmentation fault (core dumped)

Hello,
smaegol

When i running :
PlasFlow.py --input /data/Timmy/Plasflow/filtered_ERR1051325.fa --output test.plasflow_predictions.tsv --threshold 0.7
Importing sequences
Imported [1] 1100468
sequences
Calculating kmer frequencies using kmer 5
Transforming kmer frequencies
Predicting labels using kmer 5 frequencies
Calculating kmer frequencies using kmer 6
Segmentation fault (core dumped)(error)

Is the running memory of my server too small?
Generally, how much memory is required to run this script?
Under my current server, is there any solution？

issue with sklearn kit

sequences labelled as 'plasmid' despite low reported probability

I am finding several sequences get labelled as 'plasmid.unclassified' or 'plasmid.Spirochaetes', despite the probability value being below the threshold. I am using the default 0.7. It never seems to happen for a 'chromosome' assignment. Maybe related to the fact I'm not seeing a column labeled 'plasmid.Spirochaetes' in my output.