marbl / metamos Goto Github PK
View Code? Open in Web Editor NEWA metagenomic and isolate assembly and analysis pipeline built with AMOS
Home Page: http://marbl.github.io/metAMOS
License: Other
A metagenomic and isolate assembly and analysis pipeline built with AMOS
Home Page: http://marbl.github.io/metAMOS
License: Other
Need to add code to call meta-IDBA and parse its output for Assemble step.
Need to add Annotate to task list, check for called programs in PATH/metAMOS PATH, if not there, exit.
metAMOS requires python 2.6. Running it with another version of python will cause library errors. Therefore, there should be a script to configure runPipeline.py and createProject.py to check for python 2.6 and configure itself to use it. Otherwise, an installation error should be reported.
Hi Todd,
I made some changes to the FragGeneScan portion of the findorfs.py that have yielded efficiency improvements by orders of magnitude.
I am not quite sure how to submit the code to the github so I will post here and leave it to your discretion to incorporate it (old code commented out for comparison).
I am still trying to get the pipeline to complete end-to-end on actual data set...I will keep you posted.
Cheers!
for seq in seqs:
hdr,gene = seq.split("\n",1)
#hdr = hdr.split("\n")[0]
hdr = hdr.rstrip("\n")
#gene_ids.append(hdr)
#split the header in two
orfkey = '_'.join(hdr.split('_')[:6])
orfval = '_'.join(hdr.split('_')[7:])
orfhdrs[orfkey]=orfval
#for key in gene_ids:
# genecnt = 1
# gkey = ""
# if not is_scaff:
# for ckey in cvg_dict.keys():
# if ckey in key:
# gkey = ckey
# if gkey != "":
# cvgg.write("%s\t%s\n"%(key,cvg_dict[gkey]))
# else:
# cvgg.write("%s\t%s\n"%(key,1.0))
for key in orfhdrs.keys():
if key in cvg_dict:
cvgg.write("%s\t%s\n"%((key + orfhdrs[key]),cvg_dict[key]))
else:
cvgg.write("%s\t%s\n"%((key + orfhdrs[key]),str(1.0)))
cvgg.close()
check run_process function. may need to check a flag before calling communicate, look here for reference:
In some cases, the PhyloSift classification parser outputs invalid classifications of the form:
137672", "136916 91061
These cause downstream errors in Propagate.
When newbler runs, it can split a read within assembly into multiple pieces. Currently metAMOS will only take the first occurrence of the read.
This happens when only Metaphyler is used to Annotate. Should propagate ORF classification back to their Scaffolds.
Library insert sizes need to appear before the -1,-2,-s,-sm parameters otherwise createProject will fail. would be convenient to make these order independent, especially for backwards compatibility with previous test scripts/user scripts.
Propagate support metaphyler and PhyloSift but not BLAST or phmmer or PhymBL. Add support for propagating their classifications
Relative abundances all remain the same after inputting coverage using --coverage
Ideally we need to make both MacOSX & Linux 64 bit prebuilt binaries available for metAMOS.
The call is currently in Assemble, doesn't make it in if contig file provided (Assemble step skipped).
FindScaffoldORFs fails because it can't find scaffolding output even though it is in skip steps.
Add line 1426 runPipeline.py to be:
run_process("touch %s/Scaffold/out/%s.linearize.scaffolds.final"%(rundir, PREFIX), "Scaffold")
Always generate a fasta and fastq file (no matter what the user inputs) so we can support assemblers on users data (for example Meta-IDBA on fastq files)
An updated version with a patch is available and should be included instead.
Need to check for this and change name if conflicts with internal name.
Need to pass -f flag to bowtie to map fasta representation of sff files.
Now that contig coverage is computed universally for all assemblers, this function should rely on our calculation of coverage instead of assembler-specific one.
Some tools can only accept interleaved files, some can only accept non-interleaved files. To support both tools, each library (regardless of how it was input) should have a corresponding interleaved and non-interleaved file generated by the pipeline.
Suggest rename to initPipeline or initProject.
Amphora fails with:
"perl: symbol lookup error: /home/ondovb/metAMOS/Amphora-2/lib/auto/Math/Random/Random.so: undefined symbol: Perl_Tstack_sp_ptr"
on Linux 2.6.41.4-1.fc15.x86_64 x86_64 with Perl v5.12.4.
Fixed by running amphora_install.pl.
Mated FASTQ reads in non-interleaved format MUST be aligned or all reads could potentially be discarded. Need to index read ids and check the pairs to see if they are in the file but slightly out of order. If so, should fix order. If can't be found, discard the read or place in unpaired/unmated/frag file.
Error: CA not found in . It is needed to process SFF files. Please check your path and try again.
Currently, coverage is only tracked for genes found by a genecaller. The pipeline should also generate a file of coverages for each contig in the assembly.
need to unlock the bank after asmQC failure.
bowtie is segfaulting during the scaffold step on a CentOS 5 system with the following kernel:
uname -a Linux jumbo-0-1.merlot.genomecenter.ucdavis.edu 2.6.18-194.11.4.el5 #1 SMP Tue Sep 21 05:04:09 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
Running the exact same bowtie command using bowtie-0.12.7 downloaded from sourceforge completes without error.
I'm not sure whether it's relevant, but there appears to be a difference in which libs were linked. bowtie in metamos has:
-bash-3.2$ ldd /home/koadman/software/metAMOS/Utilities/cpp/Linux-x86_64/bowtie libpthread.so.0 => /lib64/libpthread.so.0 (0x00000031d8c00000) libm.so.6 => /lib64/libm.so.6 (0x00000031d9000000) libc.so.6 => /lib64/libc.so.6 (0x00000031d8400000) /lib64/ld-linux-x86-64.so.2 (0x00000031d8000000)
bowtie from sf.net has:
-bash-3.2$ ldd ~/software/bowtie-0.12.7/bowtie libpthread.so.0 => /lib64/libpthread.so.0 (0x00000031d8c00000) libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00000031db000000) libm.so.6 => /lib64/libm.so.6 (0x00000031d9000000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00000031dac00000) libc.so.6 => /lib64/libc.so.6 (0x00000031d8400000) /lib64/ld-linux-x86-64.so.2 (0x00000031d8000000) -bash-3.2$ file ~/software/bowtie-0.12.7/bowtie /home/koadman/software/bowtie-0.12.7/bowtie: ELF 64-bit LSB executable, AMD x86-64, version 1 (SYSV), for GNU/Linux 2.6.9, dynamically linked (uses shared libs), for GNU/Linux 2.6.9, not stripped
Stubs exist but more code is needed to parse output and integrate into metAMOS Annotate step
When running Newbler with paried-end data, the pairs do not make it into the Scaffolding step.
There is currently no way to make make local charts for isolated systems, since the krona folder doesn't include the web resources. Is it feasible to drop in the complete KronaTools directory, rather than the flat folder? I'm working on integrating the Amphora and Phmmer scripts, and the next release of KronaTools will have the ability to include the Perl module from anywhere, so that should make it easier to add more scripts in the future.
Test dataset of FCP (test_fcp) has no MetaPhyler classifications when using fraggenescan. However, it does have classifications when using metagenemark. Make sure the genes are being properly passed.
Currently, in 0.33, you cannot mix paired and unpaired data (such as by using the command -1 pairs_1.fq,unpaired.fq -2 pairs_2.fq. Support this functionality as well as multiple file types (fasta and fastq) in one input.
Files are converted to fasta&qual with CA but then are passed to Newbler and fail, when filtering is enabled. SFF file should be filtered by Newbler instead. And CA also fails on a the test example after converting to fasta&qual.
This is a metagenomic gene finder and support should be added soon. Some code stubs may be in place but input/output parsers and integration into the FindORFs step is required.
Warning: Illegal date format for -z/--timecond (and not a file name).
Warning: Disabling time condition. See curl_getdate(3) for valid date syntax.
This is on ubuntu 10.04 LTS with curl version:
koadman@edhar:~$ curl --version
curl 7.19.7 (x86_64-pc-linux-gnu) libcurl/7.19.7 OpenSSL/0.9.8k zlib/1.2.3.3 libidn/1.15
Protocols: tftp ftp telnet dict ldap ldaps http file https ftps
Features: GSS-Negotiate IDN IPv6 Largefile NTLM SSL libz
Hi folks,
I was happy to get the pipeline to run end to end on sub-sampling of my data to 10M paired reads.
I then attempted to run it on the entire data set of 76M paired reads.
It unfortunately crashed at the findORFS step.
The error log is at bottom of this message.
Here are my additional questions:
Here is the command used:
${metAMOS}/runPipeline -c amphora2 -d METAMOS_BS27FULL -g fraggenescan -k 43 -p 22 -a velvet 1> METAMOS_BS27FULL.run.out 2> METAMOS_BS27FULL.run.err &
Her is the STDERR log:
Job = [[SGI_BS27.1.fastq, SGI_BS27.2.fastq] -> preprocess.success] completed
Completed Task = preprocess.Preprocess
Job = [[lib1.seq] -> [proba.asm.contig]] completed
Completed Task = assemble.Assemble
Job = [proba.asm.contig -> proba.bout] completed
Completed Task = mapreads.MapReads
Traceback (most recent call last):
File "/bioinformatics/asm/bio_bin/metAMOS/metAMOS-6b17a08-0.35/runPipeline", line 367, in
pipeline_run([preprocess.Preprocess,assemble.Assemble,findorfs.FindORFS, findreps.FindRepeats, annotate.Annotate, abundance.Abundance, scaffold.Scaffold, findscforfs.FindScaffoldORFS, propagate.Propagate, classify.Classify, postprocess.Postprocess], verbose = 1)
File "/bioinformatics/asm/bio_bin/metAMOS/metAMOS-6b17a08-0.35/Utilities/ruffus/task.py", line 2680, in pipeline_run
raise errt
ruffus.ruffus_exceptions.RethrownJobError:
Exceptions running jobs for
'def findorfs.FindORFS(...):'
Original exception:
Exception #1
exceptions.ValueError(need more than 1 value to unpack):
for findorfs.FindORFS.Job = [proba.asm.contig -> proba.faa]
Traceback (most recent call last):
File "/bioinformatics/asm/bio_bin/metAMOS/metAMOS-6b17a08-0.35/Utilities/ruffus/task.py", line 524, in run_pooled_job_without_exceptions
return t_job_result(task_name, JOB_COMPLETED, job_name, return_value, None)
File "/bio_bin/python26/lib/python2.6/contextlib.py", line 34, in __exit__
self.gen.throw(type, value, traceback)
File "/bioinformatics/asm/bio_bin/metAMOS/metAMOS-6b17a08-0.35/Utilities/ruffus/task.py", line 232, in do_nothing_semaphore
yield
File "/bioinformatics/asm/bio_bin/metAMOS/metAMOS-6b17a08-0.35/Utilities/ruffus/task.py", line 517, in run_pooled_job_without_exceptions
return_value = job_wrapper(param, user_defined_work_func, register_cleanup, touch_files_only)
File "/bioinformatics/asm/bio_bin/metAMOS/metAMOS-6b17a08-0.35/Utilities/ruffus/task.py", line 447, in job_wrapper_io_files
ret_val = user_defined_work_func(*param)
File "/bioinformatics/asm/bio_bin/metAMOS/metAMOS-6b17a08-0.35/src/findorfs.py", line 243, in FindORFS
parse_fraggenescanout("%s/FindORFS/out/%s.orfs"%(_settings.rundir,_settings.PREFIX))
File "/bioinformatics/asm/bio_bin/metAMOS/metAMOS-6b17a08-0.35/src/findorfs.py", line 191, in parse_fraggenescanout
hdr,gene = seq.split("\n",1)
ValueError: need more than 1 value to unpack
Thanks
The scaffold file should point to the contigs in the case that no scaffolding was possible due to no mates.
Script to create directories calls python code that only accepts a Metaphyler output file. This needs to be changed to accept amphora2/PHMMER/BLAST output also.
Line 124 in annotate.py fails if, even though annotation program is not available (e.g. PhyloSift), user is allowed to select it anyway and attempt to run it.
The current Krona importer for Amphora 2 uses only species-level classifications to build the visualization. Update the importer to record the lowest accurately classified level (by contig) from Amphora instead.
need to clobber the index if new/re-assembly is requested.
Hello, I am trying to run MetAMOS on our UC Davis servers, and I keep getting a persistent error as follows (error is repeatable regardless of which Illumnia dataset I try to process - seems like a broken pipe somewhere?):
-bash-3.2$ initPipeline -1 Aphelenchus_1510-KO-4_L4_1.fastq -2 Aphelenchus_1510-KO-4_L4_2.fastq -d Aphelenhcus_4Mar -i 100:600 -q
Project dir /share/jumbo-0-1-scratch-2/hbik/Aphelenhcus_4Mar successfully created!
Use runPipeline.py to start Pipeline
-bash-3.2$ runPipeline -k 45 -d Aphelenhcus_4Mar/
Starting metAMOS pipeline
Warning: Newbler is not found, some functionality will not be available
Warning: FCP is not found, some functionality will not be available
Warning: PHmmer is not found, some functionality will not be available
Tasks which will be run:
Task = preprocess.Preprocess
Task = assemble.Assemble
Task = findorfs.FindORFS
Task = findreps.FindRepeats
Task = annotate.Annotate
Task = abundance.Abundance
Task = scaffold.Scaffold
Task = findscforfs.FindScaffoldORFS
Task = propagate.Propagate
Task = classify.Classify
Task = postprocess.Postprocess
Job = [[Aphelenchus_1510-KO-4_L4_1.fastq, Aphelenchus_1510-KO-4_L4_2.fastq] -> preprocess.success] completed
Completed Task = preprocess.Preprocess
Running SOAPdenovo on input reads...
Traceback (most recent call last):
File "/home/koadman/software/metAMOS/runPipeline", line 358, in
pipeline_run([preprocess.Preprocess,assemble.Assemble,findorfs.FindORFS, findreps.FindRepeats, annotate.Annotate, abundance.Abundance, scaffold.Scaffold, findscforfs.FindScaffoldORFS, propagate.Propagate, classify.Classify, postprocess.Postprocess], verbose = 1)
File "/home/koadman/software/metAMOS/Utilities/ruffus/task.py", line 2680, in pipeline_run
raise errt
ruffus.ruffus_exceptions.RethrownJobError:
Exceptions running jobs for
'def assemble.Assemble(...):'
Original exception:
Exception #1
exceptions.ValueError(invalid literal for int() with base 10: 'ggaggdfadae]gggggcggfdfefbgggaffdcdfdffdaggggggg_ggdgggfggggffffdggf_ggggggggggg'):
for assemble.Assemble.Job = [[lib1.seq] -> [proba.asm.contig]]
Traceback (most recent call last):
File "/home/koadman/software/metAMOS/Utilities/ruffus/task.py", line 517, in run_pooled_job_without_exceptions
return_value = job_wrapper(param, user_defined_work_func, register_cleanup, touch_files_only)
File "/home/koadman/software/metAMOS/Utilities/ruffus/task.py", line 447, in job_wrapper_io_files
ret_val = user_defined_work_func(*param)
File "/home/koadman/software/metAMOS/src/assemble.py", line 464, in Assemble
map2contig()
File "/home/koadman/software/metAMOS/src/assemble.py", line 110, in map2contig
epos = int(spos)+len(read_seq)
ValueError: invalid literal for int() with base 10: 'ggaggdfadae]gggggcggfdfefbgggaffdcdfdffdaggggggg_ggdgggfggggffffdggf_ggggggggggg'
-bash-3.2$
The default for Illumina instruments is now gzipped fastq files. Support these files in the pipeline. Rather than extracting the file in initPipepeline, it would be better if the file could remain gzipped as long as possible to save space.
Rather than having AMOS/bin or cpp/ be the place for platform-dependent executables, they should instead reside in a platform-specific directory (i.e. Linux-amd64, Darwin-i386, etc).
Inefficient parsing of GeneMark output to create gene files & gene coverage file results in a bottleneck in the pipeline, requiring an hour (or more) to parse out 50-100K ORFs from the output file. Need to look closer into this to see if we can simply generate the files via the command line (don't think so) or more efficient parsing of output.
Not properly running stat generation upon completion of Pipeline.
There are no libX.seq/libX.seq.mates files created when running the pipeline on unmated fastq files. The following code should be added to line 1163 (in release 0.2):
elif lib.format == "fastq" and not lib.mated:
run_process("ln -s %s/Preprocess/in/%s %s/Preprocess/out/lib%d.seq"%(rundir, lib.fq, fname, rundir, lib.id), "Preprocess")
run_process("touch %s/Preprocess/out/lib%d.seq.mates"%(rundir, lib.id), "Preprocess")
and line 1230 should be:
soapd = soapd.replace("LIB%dQ1REPLACE"%(lib.id),"%s/Preprocess/out/%s"%(rundir,lib.f1.fname))
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.