marbl / metamos Goto Github PK

A metagenomic and isolate assembly and analysis pipeline built with AMOS

Home Page: http://marbl.github.io/metAMOS

License: Other

Shell 0.01% Assembly 3.42% HTML 0.05% JavaScript 0.22% CSS 0.01% R 0.01% Perl 0.32% Java 0.05% Prolog 0.01% Python 1.09% Batchfile 0.01% Ruby 0.01% Roff 94.86%

metamos's People

Contributors

Stargazers

Watchers

metamos's Issues

meta-IDBA assembler not currently supported

Need to add code to call meta-IDBA and parse its output for Assemble step.

Annotate missing from task pipeline, fails when phmmer,fcp not found

Need to add Annotate to task list, check for called programs in PATH/metAMOS PATH, if not there, exit.

Report error if appropriate version of python is not found

metAMOS requires python 2.6. Running it with another version of python will cause library errors. Therefore, there should be a script to configure runPipeline.py and createProject.py to check for python 2.6 and configure itself to use it. Otherwise, an installation error should be reported.

more efficient snippet

Hi Todd,

I made some changes to the FragGeneScan portion of the findorfs.py that have yielded efficiency improvements by orders of magnitude.
I am not quite sure how to submit the code to the github so I will post here and leave it to your discretion to incorporate it (old code commented out for comparison).
I am still trying to get the pipeline to complete end-to-end on actual data set...I will keep you posted.

Cheers!

for seq in seqs:
    hdr,gene = seq.split("\n",1)
    #hdr = hdr.split("\n")[0]
    hdr = hdr.rstrip("\n")
    #gene_ids.append(hdr)
    #split the header in two
    orfkey = '_'.join(hdr.split('_')[:6])
    orfval = '_'.join(hdr.split('_')[7:])
    orfhdrs[orfkey]=orfval

#for key in gene_ids:
    # genecnt = 1
    # gkey = ""
    # if not is_scaff:
        # for ckey in cvg_dict.keys():
            # if ckey in key:
                # gkey = ckey

        # if gkey != "":
            # cvgg.write("%s\t%s\n"%(key,cvg_dict[gkey])) 
        # else:
            # cvgg.write("%s\t%s\n"%(key,1.0))
for key in orfhdrs.keys():
    if key in cvg_dict:
        cvgg.write("%s\t%s\n"%((key + orfhdrs[key]),cvg_dict[key]))
    else:
        cvgg.write("%s\t%s\n"%((key + orfhdrs[key]),str(1.0)))
cvgg.close()

subprocess/communicate in run_process() causes hang when spawned process returns error

check run_process function. may need to check a flag before calling communicate, look here for reference:

http://bugs.python.org/issue4216

PyloSift Parser sometimes outputs extra characters

In some cases, the PhyloSift classification parser outputs invalid classifications of the form:

137672", "136916 91061

These cause downstream errors in Propagate.

Newbler split reads not supported

When newbler runs, it can split a read within assembly into multiple pieces. Currently metAMOS will only take the first occurrence of the read.

Postprocess/out/proba.classify.txt contains ORF classifications not Scaffold/Contig classifications

This happens when only Metaphyler is used to Annotate. Should propagate ORF classification back to their Scaffolds.

createProject parameters order dependent

Library insert sizes need to appear before the -1,-2,-s,-sm parameters otherwise createProject will fail. would be convenient to make these order independent, especially for backwards compatibility with previous test scripts/user scripts.

Propagate missing BLAST support

Propagate support metaphyler and PhyloSift but not BLAST or phmmer or PhymBL. Add support for propagating their classifications

Amphora2 doesn't seem to use contig coverage

Relative abundances all remain the same after inputting coverage using --coverage

No 64-bit MacOSX AMOS binaries available

Ideally we need to make both MacOSX & Linux 64 bit prebuilt binaries available for metAMOS.

Filtering of non-interleaved fasta files loses quality values

contig coverage file doesn't not get created if contigs provided to initPipeline

The call is currently in Assemble, doesn't make it in if contig file provided (Assemble step skipped).

FindScaffoldORFs gives error when no mate pairs are available

FindScaffoldORFs fails because it can't find scaffolding output even though it is in skip steps.

Add line 1426 runPipeline.py to be:
run_process("touch %s/Scaffold/out/%s.linearize.scaffolds.final"%(rundir, PREFIX), "Scaffold")

Generate fasta/fastq files automatically

Always generate a fasta and fastq file (no matter what the user inputs) so we can support assemblers on users data (for example Meta-IDBA on fastq files)

FilterEdgesByCluster crashes in currently included AMOS release

An updated version with a patch is available and should be included instead.

runPipeline fails at Assemble if external fasta/fastq input file is named lib1.seq

Need to check for this and change name if conflicts with internal name.

User documentation requires updates & additional details

Contig coverages not computed for SFF files

Need to pass -f flag to bowtie to map fasta representation of sff files.

parse_genemarkout still using assembler-specific code to get coverage

Now that contig coverage is computed universally for all assemblers, this function should rely on our calculation of coverage instead of assembler-specific one.

Generate an interleaved/non-interleaved file for each library

Some tools can only accept interleaved files, some can only accept non-interleaved files. To support both tools, each library (regardless of how it was input) should have a corresponding interleaved and non-interleaved file generated by the pipeline.

createProject name in conflict with Newbler createProject binary

Suggest rename to initPipeline or initProject.

Amphora lib error

Amphora fails with:

"perl: symbol lookup error: /home/ondovb/metAMOS/Amphora-2/lib/auto/Math/Random/Random.so: undefined symbol: Perl_Tstack_sp_ptr"

on Linux 2.6.41.4-1.fc15.x86_64 x86_64 with Perl v5.12.4.

Fixed by running amphora_install.pl.

FASTQ file filtering

Mated FASTQ reads in non-interleaved format MUST be aligned or all reads could potentially be discarded. Need to index read ids and check the pairs to see if they are in the file but slightly out of order. If so, should fix order. If can't be found, discard the read or place in unpaired/unmated/frag file.

If CA not found but Newbler selected as assembler, metAMOS will still fail

Error: CA not found in . It is needed to process SFF files. Please check your path and try again.

Generate a contig coverage file for all contigs

Currently, coverage is only tracked for genes found by a genecaller. The pipeline should also generate a file of coverages for each contig in the assembly.

asmQC fails, causing the bank to be locked and scaffolding to fail

need to unlock the bank after asmQC failure.

bowtie segmentation fault

bowtie is segfaulting during the scaffold step on a CentOS 5 system with the following kernel:
uname -a Linux jumbo-0-1.merlot.genomecenter.ucdavis.edu 2.6.18-194.11.4.el5 #1 SMP Tue Sep 21 05:04:09 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux

Running the exact same bowtie command using bowtie-0.12.7 downloaded from sourceforge completes without error.

I'm not sure whether it's relevant, but there appears to be a difference in which libs were linked. bowtie in metamos has:
-bash-3.2$ ldd /home/koadman/software/metAMOS/Utilities/cpp/Linux-x86_64/bowtie libpthread.so.0 => /lib64/libpthread.so.0 (0x00000031d8c00000) libm.so.6 => /lib64/libm.so.6 (0x00000031d9000000) libc.so.6 => /lib64/libc.so.6 (0x00000031d8400000) /lib64/ld-linux-x86-64.so.2 (0x00000031d8000000)

bowtie from sf.net has:

-bash-3.2$ ldd ~/software/bowtie-0.12.7/bowtie libpthread.so.0 => /lib64/libpthread.so.0 (0x00000031d8c00000) libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00000031db000000) libm.so.6 => /lib64/libm.so.6 (0x00000031d9000000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00000031dac00000) libc.so.6 => /lib64/libc.so.6 (0x00000031d8400000) /lib64/ld-linux-x86-64.so.2 (0x00000031d8000000) -bash-3.2$ file ~/software/bowtie-0.12.7/bowtie /home/koadman/software/bowtie-0.12.7/bowtie: ELF 64-bit LSB executable, AMD x86-64, version 1 (SYSV), for GNU/Linux 2.6.9, dynamically linked (uses shared libs), for GNU/Linux 2.6.9, not stripped

FCP/PhymmBL not fully supported

Stubs exist but more code is needed to parse output and integrate into metAMOS Annotate step

Newbler mates not passed to scaffolding

When running Newbler with paried-end data, the pairs do not make it into the Scaffolding step.

Krona charts require internet connection

There is currently no way to make make local charts for isolated systems, since the krona folder doesn't include the web resources. Is it feasible to drop in the complete KronaTools directory, rather than the flat folder? I'm working on integrating the Amphora and Phmmer scripts, and the next release of KronaTools will have the ability to include the Perl module from anywhere, so that should make it easier to add more scripts in the future.

Test MetaPhlyer with FragGeneScan

Test dataset of FCP (test_fcp) has no MetaPhyler classifications when using fraggenescan. However, it does have classifications when using metagenemark. Make sure the genes are being properly passed.

Support multiple file types/formats

Currently, in 0.33, you cannot mix paired and unpaired data (such as by using the command -1 pairs_1.fq,unpaired.fq -2 pairs_2.fq. Support this functionality as well as multiple file types (fasta and fastq) in one input.

Error when attempting to assemble SFF files with Newbler/CA

Files are converted to fasta&qual with CA but then are passed to Newbler and fail, when filtering is enabled. SFF file should be filtered by Newbler instead. And CA also fails on a the test example after converting to fasta&qual.

Glimmer-MG not supported

This is a metagenomic gene finder and support should be added soon. Some code stubs may be in place but input/output parsers and integration into the FindORFs step is required.

warning from curl during taxonomy download

Warning: Illegal date format for -z/--timecond (and not a file name).
Warning: Disabling time condition. See curl_getdate(3) for valid date syntax.

This is on ubuntu 10.04 LTS with curl version:

koadman@edhar:~$ curl --version
curl 7.19.7 (x86_64-pc-linux-gnu) libcurl/7.19.7 OpenSSL/0.9.8k zlib/1.2.3.3 libidn/1.15
Protocols: tftp ftp telnet dict ldap ldaps http file https ftps
Features: GSS-Negotiate IDN IPv6 Largefile NTLM SSL libz

Help debugging failed run

Hi folks,

I was happy to get the pipeline to run end to end on sub-sampling of my data to 10M paired reads.
I then attempted to run it on the entire data set of 76M paired reads.
It unfortunately crashed at the findORFS step.

The error log is at bottom of this message.
Here are my additional questions:

Is there a flag for printing the list of commands to a file or to STDOUT / STDERR ?
what does the --fastest flag do specifically?

Here is the command used:
${metAMOS}/runPipeline -c amphora2 -d METAMOS_BS27FULL -g fraggenescan -k 43 -p 22 -a velvet 1> METAMOS_BS27FULL.run.out 2> METAMOS_BS27FULL.run.err &

Her is the STDERR log:

Job = [[SGI_BS27.1.fastq, SGI_BS27.2.fastq] -> preprocess.success] completed

Completed Task = preprocess.Preprocess
Job = [[lib1.seq] -> [proba.asm.contig]] completed
Completed Task = assemble.Assemble
Job = [proba.asm.contig -> proba.bout] completed
Completed Task = mapreads.MapReads
Traceback (most recent call last):
File "/bioinformatics/asm/bio_bin/metAMOS/metAMOS-6b17a08-0.35/runPipeline", line 367, in
pipeline_run([preprocess.Preprocess,assemble.Assemble,findorfs.FindORFS, findreps.FindRepeats, annotate.Annotate, abundance.Abundance, scaffold.Scaffold, findscforfs.FindScaffoldORFS, propagate.Propagate, classify.Classify, postprocess.Postprocess], verbose = 1)
File "/bioinformatics/asm/bio_bin/metAMOS/metAMOS-6b17a08-0.35/Utilities/ruffus/task.py", line 2680, in pipeline_run
raise errt
ruffus.ruffus_exceptions.RethrownJobError:

Exceptions running jobs for

'def findorfs.FindORFS(...):'

Original exception:

Exception #1
exceptions.ValueError(need more than 1 value to unpack):
for findorfs.FindORFS.Job = [proba.asm.contig -> proba.faa]

Traceback (most recent call last):
  File "/bioinformatics/asm/bio_bin/metAMOS/metAMOS-6b17a08-0.35/Utilities/ruffus/task.py", line 524, in run_pooled_job_without_exceptions
    return t_job_result(task_name, JOB_COMPLETED, job_name, return_value, None)
  File "/bio_bin/python26/lib/python2.6/contextlib.py", line 34, in __exit__
    self.gen.throw(type, value, traceback)
  File "/bioinformatics/asm/bio_bin/metAMOS/metAMOS-6b17a08-0.35/Utilities/ruffus/task.py", line 232, in do_nothing_semaphore
    yield
  File "/bioinformatics/asm/bio_bin/metAMOS/metAMOS-6b17a08-0.35/Utilities/ruffus/task.py", line 517, in run_pooled_job_without_exceptions
    return_value =  job_wrapper(param, user_defined_work_func, register_cleanup, touch_files_only)
  File "/bioinformatics/asm/bio_bin/metAMOS/metAMOS-6b17a08-0.35/Utilities/ruffus/task.py", line 447, in job_wrapper_io_files
    ret_val = user_defined_work_func(*param)
  File "/bioinformatics/asm/bio_bin/metAMOS/metAMOS-6b17a08-0.35/src/findorfs.py", line 243, in FindORFS
    parse_fraggenescanout("%s/FindORFS/out/%s.orfs"%(_settings.rundir,_settings.PREFIX))
  File "/bioinformatics/asm/bio_bin/metAMOS/metAMOS-6b17a08-0.35/src/findorfs.py", line 191, in parse_fraggenescanout
    hdr,gene = seq.split("\n",1)
ValueError: need more than 1 value to unpack

Thanks

When no mates are available scaffolds are empty

The scaffold file should point to the contigs in the case that no scaffolding was possible due to no mates.

Classify only works with Metaphyler output

Script to create directories calls python code that only accepts a Metaphyler output file. This needs to be changed to accept amphora2/PHMMER/BLAST output also.

Annotate fails when annotation program is not available yet user requests anyway

Line 124 in annotate.py fails if, even though annotation program is not available (e.g. PhyloSift), user is allowed to select it anyway and attempt to run it.

Create new importer for Amphora2

The current Krona importer for Amphora 2 uses only species-level classifications to build the visualization. Update the importer to record the lowest accurately classified level (by contig) from Amphora instead.

re-assembly in same project directory uses old bowtie index file

need to clobber the index if new/re-assembly is requested.

Assembly error in runPipeline

Hello, I am trying to run MetAMOS on our UC Davis servers, and I keep getting a persistent error as follows (error is repeatable regardless of which Illumnia dataset I try to process - seems like a broken pipe somewhere?):

-bash-3.2$ initPipeline -1 Aphelenchus_1510-KO-4_L4_1.fastq -2 Aphelenchus_1510-KO-4_L4_2.fastq -d Aphelenhcus_4Mar -i 100:600 -q
Project dir /share/jumbo-0-1-scratch-2/hbik/Aphelenhcus_4Mar successfully created!
Use runPipeline.py to start Pipeline
-bash-3.2$ runPipeline -k 45 -d Aphelenhcus_4Mar/
Starting metAMOS pipeline
Warning: Newbler is not found, some functionality will not be available
Warning: FCP is not found, some functionality will not be available
Warning: PHmmer is not found, some functionality will not be available

Tasks which will be run:

Task = preprocess.Preprocess
Task = assemble.Assemble
Task = findorfs.FindORFS
Task = findreps.FindRepeats
Task = annotate.Annotate
Task = abundance.Abundance
Task = scaffold.Scaffold
Task = findscforfs.FindScaffoldORFS
Task = propagate.Propagate
Task = classify.Classify
Task = postprocess.Postprocess

Job = [[Aphelenchus_1510-KO-4_L4_1.fastq, Aphelenchus_1510-KO-4_L4_2.fastq] -> preprocess.success] completed
Completed Task = preprocess.Preprocess
Running SOAPdenovo on input reads...
Traceback (most recent call last):
File "/home/koadman/software/metAMOS/runPipeline", line 358, in
pipeline_run([preprocess.Preprocess,assemble.Assemble,findorfs.FindORFS, findreps.FindRepeats, annotate.Annotate, abundance.Abundance, scaffold.Scaffold, findscforfs.FindScaffoldORFS, propagate.Propagate, classify.Classify, postprocess.Postprocess], verbose = 1)
File "/home/koadman/software/metAMOS/Utilities/ruffus/task.py", line 2680, in pipeline_run
raise errt
ruffus.ruffus_exceptions.RethrownJobError:

Exceptions running jobs for

'def assemble.Assemble(...):'

Original exception:

Exception #1
exceptions.ValueError(invalid literal for int() with base 10: 'ggaggdfadae]gggggcggfdfefbgggaffdcdfdffdaggggggg_ggdgggfggggffffdggf_ggggggggggg'):
for assemble.Assemble.Job = [[lib1.seq] -> [proba.asm.contig]]

Traceback (most recent call last):
File "/home/koadman/software/metAMOS/Utilities/ruffus/task.py", line 517, in run_pooled_job_without_exceptions
return_value = job_wrapper(param, user_defined_work_func, register_cleanup, touch_files_only)
File "/home/koadman/software/metAMOS/Utilities/ruffus/task.py", line 447, in job_wrapper_io_files
ret_val = user_defined_work_func(*param)
File "/home/koadman/software/metAMOS/src/assemble.py", line 464, in Assemble
map2contig()
File "/home/koadman/software/metAMOS/src/assemble.py", line 110, in map2contig
epos = int(spos)+len(read_seq)
ValueError: invalid literal for int() with base 10: 'ggaggdfadae]gggggcggfdfefbgggaffdcdfdffdaggggggg_ggdgggfggggffffdggf_ggggggggggg'

-bash-3.2$

Support gzipped input files

The default for Illumina instruments is now gzipped fastq files. Support these files in the pipeline. Rather than extracting the file in initPipepeline, it would be better if the file could remain gzipped as long as possible to save space.

Platform-dependent binaries should not be at the top level

Rather than having AMOS/bin or cpp/ be the place for platform-dependent executables, they should instead reside in a platform-specific directory (i.e. Linux-amd64, Darwin-i386, etc).

Parsing of MetaGeneMark output slow

Inefficient parsing of GeneMark output to create gene files & gene coverage file results in a bottleneck in the pipeline, requiring an hour (or more) to parse out 50-100K ORFs from the output file. Need to look closer into this to see if we can simply generate the files via the command line (don't think so) or more efficient parsing of output.

Postprocess/covstats.out & stats.out empty

Not properly running stat generation upon completion of Pipeline.

Preprocess/out not properly initialized for unmated fastq files

There are no libX.seq/libX.seq.mates files created when running the pipeline on unmated fastq files. The following code should be added to line 1163 (in release 0.2):
elif lib.format == "fastq" and not lib.mated:
run_process("ln -s %s/Preprocess/in/%s %s/Preprocess/out/lib%d.seq"%(rundir, lib.fq, fname, rundir, lib.id), "Preprocess")
run_process("touch %s/Preprocess/out/lib%d.seq.mates"%(rundir, lib.id), "Preprocess")

and line 1230 should be:
soapd = soapd.replace("LIB%dQ1REPLACE"%(lib.id),"%s/Preprocess/out/%s"%(rundir,lib.f1.fname))

marbl / metamos Goto Github PK

metamos's People

Contributors

Stargazers

Watchers

Forkers

metamos's Issues

Recommend Projects

Recommend Topics

Recommend Org