marbl / binnacle Goto Github PK
View Code? Open in Web Editor NEWBinnacle: Using Scaffolds to Improve the Contiguity and Quality of Metagenomic Bins
Binnacle: Using Scaffolds to Improve the Contiguity and Quality of Metagenomic Bins
Hey!
Thanks for the great software! I am following the wiki to combine the scaffold coverages from two samples.
When I run the code Estimate_Abundances.py
and Collate.py
, I get two files Scaffolds.fasta
and Feature-Matrix-concoct.txt
.
I wanted to ask-
a) Does the Scaffolds.fasta
file corresponds to the "final" scaffolds file of two samples? Can I use this to run the binning software (eg-concoct)? If yes, then do I just concatenate the reads files from the two samples as concoct requires one paired-end reads file along with the scaffolds file.
b) What is Collate.py
doing? I get the Feature-Matrix-concoct.txt
file, but what does it mean?
c) I ran the Estimate_Abundances.py
using sample1 or sample2 as the starting file for Coords_After_Delinking.txt
. They generate a Scaffolds.fasta
file with a very different number of scaffolds. Do I just use the one with the most number of scaffolds for (a)?
Looking forward to your reply!
Hi,
I was trying to run the binnacle output through CONCOCT:
concoct -t 10 --composition_file data/processed/megahit/binnacle/Scaffolds.fasta --coverage_file data/processed/megahit/binnacle/Feature-Matrix-concoct.txt -b test
But I ran into the following issue:
Up and running. Check /data/san/data0/users/chris/prophage_mag_binning_comparison/test_log.txt for progress
/data/san/data0/users/chris/Programs/miniconda3/envs/concoct/lib/python3.8/site-packages/sklearn/utils/validation.py:1673: FutureWarning: Feature names only support names that are all strings. Got feature names with dtypes: ['int', 'str']. An error will be raised in 1.2.
warnings.warn(
Traceback (most recent call last):
File "/data/san/data0/users/chris/Programs/miniconda3/envs/concoct/bin/concoct", line 90, in <module>
results = main(args)
File "/data/san/data0/users/chris/Programs/miniconda3/envs/concoct/bin/concoct", line 37, in main
transform_filter, pca = perform_pca(
File "/data/san/data0/users/chris/Programs/miniconda3/envs/concoct/lib/python3.8/site-packages/concoct/transform.py", line 5, in perform_pca
pca_object = PCA(n_components=nc, random_state=seed).fit(d)
File "/data/san/data0/users/chris/Programs/miniconda3/envs/concoct/lib/python3.8/site-packages/sklearn/decomposition/_pca.py", line 382, in fit
self._fit(X)
File "/data/san/data0/users/chris/Programs/miniconda3/envs/concoct/lib/python3.8/site-packages/sklearn/decomposition/_pca.py", line 430, in _fit
X = self._validate_data(
File "/data/san/data0/users/chris/Programs/miniconda3/envs/concoct/lib/python3.8/site-packages/sklearn/base.py", line 557, in _validate_data
X = check_array(X, **check_params)
File "/data/san/data0/users/chris/Programs/miniconda3/envs/concoct/lib/python3.8/site-packages/sklearn/utils/validation.py", line 797, in check_array
raise ValueError(
ValueError: Found array with 0 sample(s) (shape=(0, 138)) while a minimum of 1 is required.
After some toiling around, it looks like CONCOCT freaks out at the use of numbers as fasta headers (not sure if it relates to their use in the fasta itself or the Feature-Matrix file but either way doesn't matter since they're linked).
It looks to have ran ok after I appended '_contig' to the end of the fasta headers and their associated column in the Feature-Matrix using a couple of bash one-liners and re-running it with their outputs as per below:
sed 's/>.*/&_contig/' data/processed/megahit/binnacle/Scaffolds.fasta > data/processed/megahit/binnacle/Scaffolds_edit.fasta
awk 'BEGIN{FS=OFS="\t"}{$1=$1"_contig"}1' data/processed/megahit/binnacle/Feature-Matrix-concoct.txt > data/processed/megahit/binnacle/Feature-Matrix-concoct_edit.txt
concoct -t 10 --composition_file data/processed/megahit/binnacle/Scaffolds_edit.fasta --coverage_file data/processed/megahit/binnacle/Feature-Matrix-concoct_edit.txt -b test
Just wanted to post this as a heads up that this might be an issue that needs fixed in a future release (as I think using numbers only is the default scaffold naming scheme for binnacle?) and incase someone else runs into the same issue and is looking for a fix.
Thanks for developing this awesome addon!
Chris
Hi everyone, and thank you for your attention.
I've run through the complete Binnacle pipeline flawlessly and ran Collate.py to get a Feature-Matrix for concoct and another for metabat.
However, when it comes to feed Binnacle's output to binner algorithms i started struggling. Let's start with concoct. I'm using version 1.0.0
concoct --version
concoct 1.0.0
concoct -t 30 --composition_file Scaffolds.fasta --coverage_file Feature-Matrix-concoct.txt -b test_concoct
/mnt/mini1/work/marco/miniconda3/envs/metawrap-new/lib/python2.7/site-packages/concoct/input.py:82: FutureWarning: read_table is deprecated, use read_csv instead, passing sep='\t'.
cov = p.read_table(cov_file, header=0, index_col=0)
Traceback (most recent call last):
File "/mnt/mini1/work/marco/miniconda3/envs/metawrap-new/bin/concoct", line 88, in
results = main(args)
File "/mnt/mini1/work/marco/miniconda3/envs/metawrap-new/bin/concoct", line 40, in main
args.seed
File "/mnt/mini1/work/marco/miniconda3/envs/metawrap-new/lib/python2.7/site-packages/concoct/transform.py", line 5, in perform_pca
pca_object = PCA(n_components=nc, random_state=seed).fit(d)
File "/mnt/mini1/work/marco/miniconda3/envs/metawrap-new/lib/python2.7/site-packages/sklearn/decomposition/pca.py", line 340, in fit
self._fit(X)
File "/mnt/mini1/work/marco/miniconda3/envs/metawrap-new/lib/python2.7/site-packages/sklearn/decomposition/pca.py", line 381, in _fit
copy=self.copy)
File "/mnt/mini1/work/marco/miniconda3/envs/metawrap-new/lib/python2.7/site-packages/sklearn/utils/validation.py", line 573, in check_array
allow_nan=force_all_finite == 'allow-nan')
File "/mnt/mini1/work/marco/miniconda3/envs/metawrap-new/lib/python2.7/site-packages/sklearn/utils/validation.py", line 56, in _assert_all_finite
raise ValueError(msg_err.format(type_err, X.dtype))ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
I checked my Feature-Matrix for NaN, NAs, Inf, but all values seem to be fine. Here the feature matrix structure (I have 12 samples in my test dataset, truncated for clarity):
head -4 Feature-Matrix-concoct.txt
Binnacle_Scaffold_1 53.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Binnacle_Scaffold_2 125.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Binnacle_Scaffold_3 63.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Binnacle_Scaffold_4 40.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
I then tested the metabat-formatted feature matrix with metabat v2.15:
metabat2 -t 30 -i Scaffolds.fasta -a Feature-Matrix-metabat.txt -o test_metabat
MetaBAT 2 (2.15 (Bioconda)) using minContig 2500, minCV 1.0, minCVSum 1.0, maxP 95%, minS 60, maxEdges 200 and minClsSize 200000. with random seed=1641895722
terminate called after throwing an instance of 'boost::wrapexceptboost::bad_lexical_cast'
what(): bad lexical cast: source type value could not be interpreted as target
Aborted
I also would like to specify that both binners work with no problems in classic bin pipelines as metaWRAP, so i think that my installation is not the issue here.
Any way i can overcome from this issue? Am i missing something? If more data are needed i would be more than willing to add it to this post.
Thanks!
Marco
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.