geronimp / graftm Goto Github PK

View Code? Open in Web Editor NEW

43.0 43.0 16.0 73.76 MB

GraftM - Rapid community profiles from metagenomes

Home Page: http://geronimp.github.io/graftM/

License: GNU General Public License v3.0

Python 99.82% Shell 0.18%

graftm's People

Contributors

Stargazers

Watchers

Forkers

wwood xvazquezc tv195 jonathanylin jmeppley thexiyang liupfskygre mita2020 zhaoxia413 eliasonaws aroneys egriggsuvm trellixvulnteam mattoslmp anyihu hj1994412

graftm's Issues

graftm create should create a HMM when an alignment is provided

s4383937@mrca002:~/mingle/mtd$ ~uqbwoodc/git/graftM/bin/graftM create --taxonomy homologs.tax2tree.rerooted.decorated.tree-consensus-strings --alignment homologs.trimmed.aligned.faa --tree the.tree --tree_log homologs.tree.log --output mtd.gpkg

                            CREATE

                   Joel Boyd, Ben Woodcroft

                                                    /                
              >a                                   /
              -------------                       /            
              >b                        |        |
              --------          >>>     |  GPKG  |
              >c                        |________|
              ----------     

05/29/2015 02:51:09 PM INFO: Building gpkg for homologs
05/29/2015 02:51:09 PM INFO: Building seqinfo and taxonomy file
05/29/2015 02:51:09 PM INFO: Creating reference package
warning: rank species not represented in the lineage of any sequence in reference package mtd.gpkg.
05/29/2015 02:51:10 PM INFO: Compiling gpkg
Traceback (most recent call last):
  File "/srv/whitlam/home/users/uqbwoodc/git/graftM/bin/graftM", line 174, in <module>
    graftm.run.Run(args).main()
  File "/srv/home/uqbwoodc/git/graftM/bin/../graftm/run.py", line 403, in main
    self.args.taxonomy, self.args.tree, self.args.tree_log, self.args.output)
  File "/srv/home/uqbwoodc/git/graftM/bin/../graftm/create.py", line 181, in main
    self.compile(base, refpkg, hmm, contents, prefix)
  File "/srv/home/uqbwoodc/git/graftM/bin/../graftm/create.py", line 108, in compile
    shutil.copyfile(hmm, os.path.join(gpkg, os.path.basename(hmm)))
  File "/srv/sw/python/2.7.4/lib/python2.7/posixpath.py", line 121, in basename
    i = p.rfind('/') + 1
AttributeError: 'NoneType' object has no attribute 'rfind'

graftm shouldn't require an evalue to be given, it should allow usage of the threshold set in the HMM file

ie. the TC (trusted cutoff) defined in the HMM file.

sometimes sample names clash

ben@u:~/git/graftM$ bin/graftM graft --graftm_package test/data/mcrA.gpkg/ --forward test/data/mcrA.gpkg/mcrA_1.1.fna test/data/mcrA.gpkg/mcrA_1.2.fna --output t --force; cat t/combined_count_table.txt 

                                GRAFT

                       Joel Boyd, Ben Woodcroft

                                                         __/__
                                                  ______|
          _- - _                         ________|      |_____/
           - -            -             |        |____/_
           - _     >>>>  -   >>>>   ____|          
          - _-  -         -             |      ______
             - _                        |_____|
           -                                  |______

06/05/2015 10:04:01 AM INFO: Working on mcrA_1
06/05/2015 10:04:01 AM INFO: Searching mcrA_1.1.fna
06/05/2015 10:04:01 AM INFO: 1 read(s) found
06/05/2015 10:04:01 AM INFO: Aligning reads to reference package database
06/05/2015 10:04:01 AM INFO: Working on mcrA_1
06/05/2015 10:04:01 AM INFO: Searching mcrA_1.2.fna
06/05/2015 10:04:01 AM INFO: 1 read(s) found
06/05/2015 10:04:01 AM INFO: Aligning reads to reference package database
06/05/2015 10:04:02 AM INFO: Placing reads into phylogenetic tree
06/05/2015 10:04:02 AM INFO: Placements finished
06/05/2015 10:04:02 AM INFO: Reading classifications
06/05/2015 10:04:02 AM INFO: Reads classified.
06/05/2015 10:04:02 AM INFO: Building summary table
06/05/2015 10:04:04 AM INFO: Building summary krona plot
06/05/2015 10:04:04 AM INFO: Cleaning up
06/05/2015 10:04:04 AM INFO: Done, thanks for using graftM!

#ID mcrA_1  ConsensusLineage
0   1   Root; mcrA; Euryarchaeota_mcrA; Methanomicrobia; Methanosarcinales; Methanosarcinaceae; Methanosarcina

There is two samples input, not one, so there should be two count columns.

@geronimp I'm working on #40 (biom) atm, so don't be tempted to fix just yet pls.

graftM create seems to ignore the --min_aligned_per flag

Hello Joel and Ben,

I'm having problems getting graftM create to make a tree when I provide an HMM and do not want sequences removed from the tree. This is the command used:
graftM create --sequences homologues_cut.faa --hmm /srv/projects/abisko/Caitlin/graftM_packages_cait/0_graftm_packages_current/pmoA_deduplicated_051015.gpkg/graftm45y8od.hmm --rerooted_annotated_tree ../homologs.tax2tree.tree --output large_tree_pmoA --min_aligned_percent 1 (this happens when I use 0% too)

And this is some of the traceback:
10/08/2015 09:25:18 AM INFO: Building gpkg for large_tree_pmoA
10/08/2015 09:25:21 AM INFO: Filtered 502 short sequences from the alignment
10/08/2015 09:25:21 AM INFO: 1120 sequences remaining
10/08/2015 09:25:21 AM INFO: Checking for incorrect or fragmented reads
10/08/2015 09:25:22 AM INFO: Building seqinfo and taxonomy file from input annotated tree
10/08/2015 09:25:22 AM ERROR: Unable to find sequence 'contig_285233_1_1_1_Root_d_Bacteria_Methylococcales_pmoA' in the taxonomy definition
10/08/2015 09:25:22 AM ERROR: Unable to find sequence 'contig_3335073_186_6_12_Root_d_Bacteria_Alphaproteobacteria_pmoA_Methylocystaceae_pmoA1_f_Methylocystaceae_g_Methylosinus' in the taxonomy definition
10/08/2015 09:25:22 AM ERROR: Unable to find sequence 'contig_615983_66_3_6_Root_d_Bacteria_pxmA' in the taxonomy definition

It may be something I am doing during the processing - I am doing my best to look into it!
Thanks :)
Caitlin

need to make installation easier

Just had a discussion with @minillinim about install, he had some useful things to say.

Need to make it work with python setup.py install and pip install graftm. To do this we might use https://github.com/minillinim/OutstandingKnife
Be good to make all dependencies pip dependencies too e.g. orfm and fxtract. This implies writing new wrapping code
To test the install, there is a bare ubuntu virtual machine on the servers we can clone. When we install it successfully on there we'll have some confidence in our install procedure, because there are only the bare bones of a computer setup on it.
Need to write an install (and usage) manual, and readme

graftm create shouldn't create diamond db files for nucleotide packages

should handle proteins as input

A possibly sizeable feature request - it would be useful to be able to pipe proteins in. Some outputs would need to be removed e.g. the output file of nucleotide sequences.

euk contamination hmm shouldn't be hardcoded

Hmmer.py:            nhmmer_cmd = "nhmmer --cpu %s %s --tblout %s /srv/db/graftm/0.1/HMM/Euk.hmm " % (threads, eval, out_table)
Hmmer.py:            cmd = "nhmmer --cpu %s %s --tblout %s /srv/db/graftm/0.1/HMM/Euk.hmm %s 2>&1 > /dev/null " % (threads, eval, out_table, reads)

Maybe we need some kind of 'GRAFTM_DB' environment variable or something, where all the gpkgs live and then the user can specify them by name rather than path, and euk 18S can just be one of those. Just to make things easier. It should also be possible to specify the euk HMM directly on the cmdline though.

cluster files are not cleaned up

the leftover

2501025505_clustered.fa

files in the output directory are probably not necessary, yeh?

missing file from repo?

got this error, does it need to be git added?

$ nosetests -x
Traceback (most recent call last):
  File "/home/ben/git/graftM/test/../bin/graftM", line 166, in <module>
    Run(args).main()
  File "/home/ben/git/graftM/bin/../graftm/run.py", line 22, in __init__
    self.setattributes(self.args)
  File "/home/ben/git/graftM/bin/../graftm/run.py", line 31, in setattributes
    self.hk.set_attributes(self.args)
  File "/home/ben/git/graftM/bin/../graftm/housekeeping.py", line 211, in set_attributes
    raise Exception("%s does not exist. Are you sure you provided the correct path?" % args.graftm_package)
Exception: /home/ben/git/graftM/test/data/61_otus.gpkg does not exist. Are you sure you provided the correct path?
E
======================================================================
ERROR: test_concatenated_OTU_table (test_graftm.Tests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ben/git/graftM/test/test_graftm.py", line 738, in test_concatenated_OTU_table
    subprocess.check_output(cmd, shell=True)
  File "/usr/lib/python2.7/subprocess.py", line 573, in check_output
    raise CalledProcessError(retcode, cmd, output=output)
CalledProcessError: Command '/home/ben/git/graftM/test/../bin/graftM graft --forward /home/ben/git/graftM/test/sample_runs/sample_16S_1.1.fa /home/ben/git/graftM/test/sample_runs/sample_16S_2.1.fa --graftm_package /home/ben/git/graftM/test/data/61_otus.gpkg --output_directory /tmp/tmpj9C7Fe --force' returned non-zero exit status 1

----------------------------------------------------------------------
Ran 1 test in 0.092s

FAILED (errors=1)

error with error message

uqbwoodc@mrca003:20150220:/srv/projects/abisko/shotgun_abundance/53_doe_talk_functional_gene_pca/bclA/graftm$ graftM --forward /srv/projects/abisko/data/flat20150213/20120800_S1M.1.fq.gz --search_only --threads 40 --output_stats 20120800_S1M --hmm_file ../AAN32623.1.mingle/homologs.faa.aligned.hmm  --type P
Traceback (most recent call last):
  File "/srv/sw/graftm/0.4.0/bin/graftM", line 47, in <module>
    Run(args).main()
  File "/srv/sw/graftm/0.4.0/graftm/Run.py", line 33, in __init__
    self.HK.set_attributes(self.args)
  File "/srv/sw/graftm/0.4.0/graftm/HouseKeeping.py", line 116, in set_attributes
    Messenger().message('ERROR: %s is empty or misformatted.' % args.graftm_package)
AttributeError: 'Namespace' object has no attribute 'graftm_package'

graftM create builds an HMM from sequences that have not been deduplicated

Hello Ben,

GraftM create builds the HMM using the sequences provided, and seems to be resulting in an HMM biased to duplicates.

Thanks,
Caitlin

use faster nhmmer

As of hmmer3.1b2 nhmmer can be much faster if you follow the directions at http://cryptogenomicon.org/2015/04/15/hmmer-3-1-beta-test-2-released/

To do

Update the code use this by default, including in graftm create
Remake the nucleotide packages with makehmmerdb
This needs testing in terms of the numbers of reads detected, as well.

graftm create should rejig the log file so it just works

Currently need to run something like this

NOTE: homologs.tree.log is actually an output file

FastTreeMP -nome -mllen -intree homologs.tax2tree.rerooted2.no_header.tree -log homologs.tree.log <homologs.trimmed.aligned.faa >/dev/null

defaults are not shown with graftM -h

This just makes things a little less confusing for the user.

graftm create should handle taxonomy strings that have missing parts in the middle

It should try to fill in the middle parts where possible, or croak if this is not consistent with other known taxonomic info.

This is a potential cause of the 'null placements' issue

when --extract_alignment_from_nhmmer is specified, the number of reads is not reported

Is graftM create --graftm_package working?

Any ideas?

uqbwoodc@mrca003:20151112:/srv/projects/abisko/shotgun_abundance/93_singlem_better_pkgs$ ~/git/graftM/bin/graftM create --graftm_package /srv/db/graftm/4/4.07.ribosomal_protein_L2_rplB.gpkg --search_hmm_files ../82_methanogen_singlem_issue/kingdom_specific_search_hmms/hmm_creation/DNGNGWU00010_mingle_output_good_seqs.hmm_search_archaea.hmm ../82_methanogen_singlem_issue/kingdom_specific_search_hmms/hmm_creation/DNGNGWU00010_mingle_output_good_seqs.hmm_search_bacteria.hmm --output test.gpkg --verbosity 5

                             GraftM 0.8.1
11/12/2015 03:01:30 PM DEBUG: Ran command: /srv/whitlam/home/users/uqbwoodc/git/graftM/bin/graftM create --graftm_package /srv/db/graftm/4/4.07.ribosomal_protein_L2_rplB.gpkg --search_hmm_files ../82_methanogen_singlem_issue/kingdom_specific_search_hmms/hmm_creation/DNGNGWU00010_mingle_output_good_seqs.hmm_search_archaea.hmm ../82_methanogen_singlem_issue/kingdom_specific_search_hmms/hmm_creation/DNGNGWU00010_mingle_output_good_seqs.hmm_search_bacteria.hmm --output test.gpkg --verbosity 5

                            CREATE

                   Joel Boyd, Ben Woodcroft

                                                    /
              >a                                   /
              -------------                       /
              >b                        |        |
              --------          >>>     |  GPKG  |
              >c                        |________|
              ----------

uqbwoodc@mrca003:20151112:/srv/projects/abisko/shotgun_abundance/93_singlem_better_pkgs$ echo $?
0
uqbwoodc@mrca003:20151112:/srv/projects/abisko/shotgun_abundance/93_singlem_better_pkgs$ ls
uqbwoodc@mrca003:20151112:/srv/projects/abisko/shotgun_abundance/93_singlem_better_pkgs$

hmmalign/seqmagick step should be abstracted into a separate class

currently duplicately coded in create and graft.

graft should also be able to handle gapped inputs e.g. input an alignment as --forward, fixed in create (soon), but not graft.

--output_stats should be renamed, and it should work

Easy enough to change the argparse setup, but it seems to not actually use the directory as output

key error

Any ideas?

uqbwoodc@keating:20150806:/tmp$ PATH=~/git/graftM/bin:$PATH graftM graft --threads 15 --forward /srv/projects/share/for_Joel/bin55.fa  --search_hmm_files /srv/sw/singlem/0.0.0.dev/singlem/../db/ribosomal_protein_S2_rpsB_gpkg/DNGNGWU00001.hmm /srv/sw/singlem/0.0.0.dev/singlem/../db/ribosomal_protein_L11_rplK_gpkg/DNGNGWU00024.hmm /srv/sw/singlem/0.0.0.dev/singlem/../db/ribosomal_protein_S17_gpkg/DNGNGWU00036.hmm --search_and_align_only --aln_hmm_file /srv/sw/singlem/0.0.0.dev/singlem/../db/ribosomal_protein_S2_rpsB_gpkg/DNGNGWU00001.hmm --verbosity 5 --input_sequence_type nucleotide

                             GraftM 0.7.0

                                GRAFT

                       Joel Boyd, Ben Woodcroft

                                                         __/__
                                                  ______|
          _- - _                         ________|      |_____/
           - -            -             |        |____/_
           - _     >>>>  -   >>>>   ____|
          - _-  -         -             |      ______
             - _                        |_____|
           -                                  |______

08/06/2015 03:27:27 PM DEBUG: Creating working directory: GraftM_output
08/06/2015 03:27:27 PM DEBUG: HMM type: P Trusted Cutoff: None
08/06/2015 03:27:27 PM DEBUG: Working with 1 file(s)
08/06/2015 03:27:27 PM INFO: Working on bin55
08/06/2015 03:27:27 PM DEBUG: Running protein pipeline
08/06/2015 03:27:27 PM DEBUG: Detected file format FORMAT_FASTA
08/06/2015 03:27:27 PM DEBUG: raw read unpacking command chunk: cat /srv/projects/share/for_Joel/bin55.fa
08/06/2015 03:27:27 PM DEBUG: Detected sequence type as nucleotide
08/06/2015 03:27:27 PM DEBUG: Using 3 HMMs to search
08/06/2015 03:27:27 PM DEBUG: OrfM command chunk: orfm  -m 96 /srv/projects/share/for_Joel/bin55.fa
08/06/2015 03:27:27 PM DEBUG: Running command: orfm  -m 96 /srv/projects/share/for_Joel/bin55.fa | tee >(hmmsearch -E 1e-5 --cpu 5 -o /dev/null --noali --domtblout GraftM_output/bin55/DNGNGWU00001_bin55.hmmout.csv /srv/sw/singlem/0.0.0.dev/singlem/../db/ribosomal_protein_S2_rpsB_gpkg/DNGNGWU00001.hmm -) | tee >(hmmsearch -E 1e-5 --cpu 5 -o /dev/null --noali --domtblout GraftM_output/bin55/DNGNGWU00024_bin55.hmmout.csv /srv/sw/singlem/0.0.0.dev/singlem/../db/ribosomal_protein_L11_rplK_gpkg/DNGNGWU00024.hmm -) | hmmsearch -E 1e-5 --cpu 5 -o /dev/null --noali --domtblout GraftM_output/bin55/DNGNGWU00036_bin55.hmmout.csv /srv/sw/singlem/0.0.0.dev/singlem/../db/ribosomal_protein_S17_gpkg/DNGNGWU00036.hmm -
08/06/2015 03:27:27 PM DEBUG: Running commandeer cmd: orfm  -m 96 /srv/projects/share/for_Joel/bin55.fa | tee >(hmmsearch -E 1e-5 --cpu 5 -o /dev/null --noali --domtblout GraftM_output/bin55/DNGNGWU00001_bin55.hmmout.csv /srv/sw/singlem/0.0.0.dev/singlem/../db/ribosomal_protein_S2_rpsB_gpkg/DNGNGWU00001.hmm -) | tee >(hmmsearch -E 1e-5 --cpu 5 -o /dev/null --noali --domtblout GraftM_output/bin55/DNGNGWU00024_bin55.hmmout.csv /srv/sw/singlem/0.0.0.dev/singlem/../db/ribosomal_protein_L11_rplK_gpkg/DNGNGWU00024.hmm -) | hmmsearch -E 1e-5 --cpu 5 -o /dev/null --noali --domtblout GraftM_output/bin55/DNGNGWU00036_bin55.hmmout.csv /srv/sw/singlem/0.0.0.dev/singlem/../db/ribosomal_protein_S17_gpkg/DNGNGWU00036.hmm -
08/06/2015 03:27:28 PM DEBUG: Running commandeer cmd: fxtract -H -X -f /tmp/graftm_readnamesEuAr3x  /srv/projects/share/for_Joel/bin55.fa > /tmp/_raw_extracted_reads.faxZiXui
08/06/2015 03:27:28 PM DEBUG: OrfM command chunk: orfm  -m 96 
08/06/2015 03:27:28 PM DEBUG: Running commandeer cmd: orfm  -m 96  GraftM_output/bin55/bin55_hits.fa > /tmp/orf_hmmsearchGbkFoo
08/06/2015 03:27:28 PM DEBUG: Running command: cat /tmp/orf_hmmsearchGbkFoo | hmmsearch  --cpu 1 -o /dev/null --noali --domtblout /tmp/orf_hmmsearchqxbmux /srv/sw/singlem/0.0.0.dev/singlem/../db/ribosomal_protein_S2_rpsB_gpkg/DNGNGWU00001.hmm -
08/06/2015 03:27:28 PM DEBUG: Running commandeer cmd: cat /tmp/orf_hmmsearchGbkFoo | hmmsearch  --cpu 1 -o /dev/null --noali --domtblout /tmp/orf_hmmsearchqxbmux /srv/sw/singlem/0.0.0.dev/singlem/../db/ribosomal_protein_S2_rpsB_gpkg/DNGNGWU00001.hmm -
08/06/2015 03:27:28 PM DEBUG: Running command: cat /tmp/orf_hmmsearchGbkFoo | hmmsearch  --cpu 1 -o /dev/null --noali --domtblout /tmp/orf_hmmsearchdlJsCs /srv/sw/singlem/0.0.0.dev/singlem/../db/ribosomal_protein_L11_rplK_gpkg/DNGNGWU00024.hmm -
08/06/2015 03:27:28 PM DEBUG: Running commandeer cmd: cat /tmp/orf_hmmsearchGbkFoo | hmmsearch  --cpu 1 -o /dev/null --noali --domtblout /tmp/orf_hmmsearchdlJsCs /srv/sw/singlem/0.0.0.dev/singlem/../db/ribosomal_protein_L11_rplK_gpkg/DNGNGWU00024.hmm -
08/06/2015 03:27:28 PM DEBUG: Running command: cat /tmp/orf_hmmsearchGbkFoo | hmmsearch  --cpu 1 -o /dev/null --noali --domtblout /tmp/orf_hmmsearchzknwr5 /srv/sw/singlem/0.0.0.dev/singlem/../db/ribosomal_protein_S17_gpkg/DNGNGWU00036.hmm -
08/06/2015 03:27:28 PM DEBUG: Running commandeer cmd: cat /tmp/orf_hmmsearchGbkFoo | hmmsearch  --cpu 1 -o /dev/null --noali --domtblout /tmp/orf_hmmsearchzknwr5 /srv/sw/singlem/0.0.0.dev/singlem/../db/ribosomal_protein_S17_gpkg/DNGNGWU00036.hmm -
08/06/2015 03:27:28 PM DEBUG: Running commandeer cmd: fxtract -H -X -f /tmp/orf_titles8DbenX /tmp/orf_hmmsearchGbkFoo > GraftM_output/bin55/bin55_orf.fa
08/06/2015 03:27:28 PM INFO: Aligning reads to reference package database
Traceback (most recent call last):
  File "/srv/whitlam/home/users/uqbwoodc/git/graftM/bin/graftM", line 231, in <module>
    graftm.run.Run(args).main()
  File "/srv/home/uqbwoodc/git/graftM/bin/../graftm/run.py", line 394, in main
    self.graft()
  File "/srv/home/uqbwoodc/git/graftM/bin/../graftm/run.py", line 319, in graft
    self._get_sequence_directions(result.search_result)
  File "/srv/home/uqbwoodc/git/graftM/bin/../graftm/timeit.py", line 10, in timed
    result = method(*args, **kw)
  File "/srv/home/uqbwoodc/git/graftM/bin/../graftm/hmmer.py", line 917, in align
    directions)
  File "/srv/home/uqbwoodc/git/graftM/bin/../graftm/hmmer.py", line 64, in hmmalign
    if directions[read_id] == True:
KeyError: 'bin55_contig_12071_22985_5_390'

then

uqbwoodc@keating:20150806:/tmp$ grep contig_12071 GraftM_output/bin55/*
GraftM_output/bin55/bin55_hits.fa:>bin55_contig_12071
GraftM_output/bin55/bin55_orf.fa:>bin55_contig_12071_22438_4_386
GraftM_output/bin55/bin55_orf.fa:>bin55_contig_12071_22985_5_390
GraftM_output/bin55/DNGNGWU00001_bin55.hmmout.csv:bin55_contig_12071_22438_4_386 -            310 DNGNGWU00001         -            250   4.9e-96  318.4   0.1   1   1   8.6e-99   6.8e-96  317.9   0.1    21   239    25   243    20   270 0.96 -

So not a hit, but still translated.

graftm create should use more tempfiles

There's a danger in this part of the code (perhaps among others) that the use of graftm might silently delete, if the user had already created them before using graftm. Probably safer to use tempfiles instead.

    def alignSequences(self, hmm, sequences, base): 
        stockholm_alignment = base +".aln.sto" # Set an output path for the alignment
        fasta_alignment = base+".insertions.aln.fa" # Set an output path for the alignment
        corrected_fasta_alignment = base+".aln.fa" # Set an output path for the alignment

fasttree shouldn't be required for graftm create when --tree is specified

This was a little hard to change in the code as it is currently structured in #27

I think it might be good to make a DependencyChecker class that keeps a list of particular binaries e.g. FastTree hashed to the URL where they can be installed from. Then use it like this:

dependences = ['taxit','seqmagick]
if not args.tree:
    dependences.append('FastTree')
DependencyChecker().check_dependencies(dependencies)

That last line prints URLs and exits if something is missing - the URLs are specified as a constant in DependencyChecker.

The purist in me dislikes instantiating a new DependencyChecker in that last line, but there's no other solution I can think of that is any better. The important thing is that there is too much repetition of pre-req URLs in housekeeping.py, and the logic is spread out between there and run.py, making it a bit annoying to modify ie. don't check for fasttree if it isn't needed.

no such thing as graftM extract

Got this error when running graftm extract. Does graftm extract even exist?

$:~/projects/mg/graftm_results$ graftM -h

                                       GraftM  0.5.0 asdf

             A suite of tools for the rapid analysis of large sequence datasets.

                                Joel Boyd, Ben Woodcroft

=====================================================================================================
COMMUNITY PROFILING

    graft       -       Search for and phylogenetically classify reads associated with a single
                        marker gene, and construct a community profile
                        e.g. usage:
                            $ graftM graft --forward <READS> --graftm_package <GRAFTM_PACKAGE>

=====================================================================================================
UTILITIES


    extract     -       Tools to manage the output of GraftM such as:
                        e.g. usage:
                            $ graftM extract --profile <GRAFTM_PROFILE> --lineage <LINEAGE>

    create      -       Create a graftM package of from aligned sequences or a hmm, and a
                        taxonomy file.

                        e.g. usage
                            > With aligned sequences:
                            $ graftm create --sequences <SEQUENCES> --alignment <ALIGNED_SEQUENCES>
                              --taxonomy <GREENGENES_FORMAT_TAXONOMY>

                            > With a HMM:
                            $ graftm create --sequences <SEQUENCES> --hmm <HMM>
                              --taxonomy <GREENGENES_FORMAT_TAXONOMY>

=====================================================================================================

$:~/projects/mg/graftm_results$ graftM extract
usage: graftM [--version] {graft,filter,assemble,manage,create,pathfinder} ...
graftM: error: argument subparser_name: invalid choice: 'extract' (choose from 'graft', 'filter', 'assemble', 'manage', 'create', 'pathfinder')

Deal with crazy characters in the tree names

Currently graftM create will fail when tip names contain odd characters such as set(",:_;()[]") since fastree needs unquoted tip names (currently in wwood/create_fixes, presumably will be merged soon).

To fix might need to match the alignment and tree names before creating the log file, and change them into something saner (while still unique). Workaround is just to rename the sequences in the alignment manually.

missing sequences bug

$ PATH=~/git/graftM/bin:$PATH graftM graft --threads 15 --verbosity 2 --forward /srv/projects/share/for_Joel/AB_C2_C3_C4_comb_assembly_bin4_hits.fa --graftm_package /srv/sw/singlem/0.0.0.dev/singlem/../db/ribosomal_protein_S2_rpsB_gpkg --output_directory /tmp/outer --input_sequence_type nucleotide --force --verbosity 5

                             GraftM 0.7.0

                                GRAFT

                       Joel Boyd, Ben Woodcroft

                                                         __/__
                                                  ______|
          _- - _                         ________|      |_____/
           - -            -             |        |____/_
           - _     >>>>  -   >>>>   ____|
          - _-  -         -             |      ______
             - _                        |_____|
           -                                  |______

08/07/2015 11:05:41 AM DEBUG: Creating working directory: /tmp/outer
08/07/2015 11:05:41 AM DEBUG: HMM type: P Trusted Cutoff: None
08/07/2015 11:05:41 AM DEBUG: Working with 1 file(s)
08/07/2015 11:05:41 AM INFO: Working on AB_C2_C3_C4_comb_assembly_bin4_hits
08/07/2015 11:05:41 AM DEBUG: Running protein pipeline
08/07/2015 11:05:41 AM DEBUG: Detected file format FORMAT_FASTA
08/07/2015 11:05:41 AM DEBUG: raw read unpacking command chunk: cat /srv/projects/share/for_Joel/AB_C2_C3_C4_comb_assembly_bin4_hits.fa
08/07/2015 11:05:41 AM DEBUG: Detected sequence type as nucleotide
08/07/2015 11:05:41 AM DEBUG: Using 1 HMMs to search
08/07/2015 11:05:41 AM DEBUG: OrfM command chunk: orfm  -m 96 /srv/projects/share/for_Joel/AB_C2_C3_C4_comb_assembly_bin4_hits.fa
08/07/2015 11:05:41 AM DEBUG: Running command: orfm  -m 96 /srv/projects/share/for_Joel/AB_C2_C3_C4_comb_assembly_bin4_hits.fa | hmmsearch -E 1e-5 --cpu 15 -o /dev/null --noali --domtblout /tmp/outer/AB_C2_C3_C4_comb_assembly_bin4_hits/AB_C2_C3_C4_comb_assembly_bin4_hits.hmmout.csv /srv/sw/singlem/0.0.0.dev/singlem/../db/ribosomal_protein_S2_rpsB_gpkg/DNGNGWU00001.hmm -
08/07/2015 11:05:41 AM DEBUG: Running commandeer cmd: orfm  -m 96 /srv/projects/share/for_Joel/AB_C2_C3_C4_comb_assembly_bin4_hits.fa | hmmsearch -E 1e-5 --cpu 15 -o /dev/null --noali --domtblout /tmp/outer/AB_C2_C3_C4_comb_assembly_bin4_hits/AB_C2_C3_C4_comb_assembly_bin4_hits.hmmout.csv /srv/sw/singlem/0.0.0.dev/singlem/../db/ribosomal_protein_S2_rpsB_gpkg/DNGNGWU00001.hmm -
08/07/2015 11:05:41 AM DEBUG: Running commandeer cmd: fxtract -H -X -f /tmp/graftm_readnamesL2GYCj  /srv/projects/share/for_Joel/AB_C2_C3_C4_comb_assembly_bin4_hits.fa > /tmp/_raw_extracted_reads.faMfB1Cg
08/07/2015 11:05:41 AM DEBUG: OrfM command chunk: orfm  -m 96 
08/07/2015 11:05:41 AM INFO: Aligning reads to reference package database
08/07/2015 11:05:41 AM DEBUG: Running commandeer cmd: hmmalign --trim /srv/sw/singlem/0.0.0.dev/singlem/../db/ribosomal_protein_S2_rpsB_gpkg/DNGNGWU00001.hmm /tmp/outer/AB_C2_C3_C4_comb_assembly_bin4_hits/AB_C2_C3_C4_comb_assembly_bin4_hits_orf.fa | seqmagick convert --input-format stockholm - /tmp/for_conv_fileEhAYYE.fa
08/07/2015 11:05:41 AM INFO: Placing reads into phylogenetic tree
08/07/2015 11:05:41 AM DEBUG: Running commandeer cmd: pplacer -j 15 --verbosity 0 --out-dir /tmp/outer -c /srv/sw/singlem/0.0.0.dev/singlem/../db/ribosomal_protein_S2_rpsB_gpkg/DNGNGWU00001.refpkg /tmp/outer/combined_alignment.aln.fa
Traceback (most recent call last):
  File "/srv/whitlam/home/users/uqbwoodc/git/graftM/bin/graftM", line 231, in <module>
    graftm.run.Run(args).main()
  File "/srv/home/uqbwoodc/git/graftM/bin/../graftm/run.py", line 396, in main
    self.graft()
  File "/srv/home/uqbwoodc/git/graftM/bin/../graftm/run.py", line 361, in graft
    gpkg.taxonomy_info_path()
  File "/srv/home/uqbwoodc/git/graftM/bin/../graftm/timeit.py", line 10, in timed
    result = method(*args, **kw)
  File "/srv/home/uqbwoodc/git/graftM/bin/../graftm/pplacer.py", line 112, in place
    jplace = self.pplacer(files.jplace_output_path(), args.output_directory, files.comb_aln_fa(), args.threads)
  File "/srv/home/uqbwoodc/git/graftM/bin/../graftm/pplacer.py", line 28, in pplacer
    extern.run(cmd)
  File "/srv/sw/extern/0.0.4/lib/python2.7/site-packages/extern/__init__.py", line 30, in run
    stdout)
extern.ExternCalledProcessError: Command pplacer -j 15 --verbosity 0 --out-dir /tmp/outer -c /srv/sw/singlem/0.0.0.dev/singlem/../db/ribosomal_protein_S2_rpsB_gpkg/DNGNGWU00001.refpkg /tmp/outer/combined_alignment.aln.fa returned non-zero exit status 1.
STDERR was: STDOUT was: Sequence length cut to 0 by pre-masking; can't proceed with no information.

Possibly because of the unusual hmmsearch output?

$ cat /tmp/outer/AB_C2_C3_C4_comb_assembly_bin4_hits/AB_C2_C3_C4_comb_assembly_bin4_hits.hmmout.csv 
#                                                                                   --- full sequence --- -------------- this domain -------------   hmm coord   ali coord   env coord
# target name               accession   tlen query name           accession   qlen   E-value  score  bias   #  of  c-Evalue  i-Evalue  score  bias  from    to  from    to  from    to  acc description of target
#       ------------------- ---------- ----- -------------------- ---------- ----- --------- ------ ----- --- --- --------- --------- ------ ----- ----- ----- ----- ----- ----- ----- ---- ---------------------
bin4_contig_71_93849_3_1167 -            326 DNGNGWU00001         -            250  2.6e-103  343.5   5.8   1   2  2.7e-107  2.6e-103  343.5   5.8     1   248    15   262    15   266 0.96 -
bin4_contig_71_93849_3_1167 -            326 DNGNGWU00001         -            250  2.6e-103  343.5   5.8   2   2      0.27   2.7e+03   -3.8   4.0   236   248   293   305   272   323 0.47 -
#
# Program:         hmmsearch
# Version:         3.1b2 (February 2015)
# Pipeline mode:   SEARCH
# Query file:      /srv/sw/singlem/0.0.0.dev/singlem/../db/ribosomal_protein_S2_rpsB_gpkg/DNGNGWU00001.hmm
# Target file:     -
# Option settings: hmmsearch -o /dev/null --domtblout /tmp/outer/AB_C2_C3_C4_comb_assembly_bin4_hits/AB_C2_C3_C4_comb_assembly_bin4_hits.hmmout.csv --noali -E 1e-5 --cpu 15 /srv/sw/singlem/0.0.0.dev/singlem/../db/ribosomal_protein_S2_rpsB_gpkg/DNGNGWU00001.hmm - 
# Current dir:     /srv/projects/aom/aom-acetate/steve_analyses/binning/metabat_specific/current_bins
# Date:            Fri Aug  7 11:05:41 2015
# [ok]

basic stats file not tallying as expected

e.g.

Basic run statistics (count):
Files:  73.20100900_E1D.fasta.metabat-bins-.14  73.20100900_E1D.fasta.metabat-bins-.19  73.20100900_E1D.fasta.metabat-bins-.29  73.20100900_E1D.fasta.metabat-bins-.45  73.20100900_E1D.fasta.metabat-bins-.52  73.20
reads detected: 1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1
reads placed in tree:   1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1

if create is given short sequences and an annotated tree, then a bad error message is given

error is correct but a little unhelpful. It would be better to try to remake the tree and see if the topology has not changed - if not can just continue on merrily. Workaround is to remake the tree again and reannotate without including the offending sequence(s).

graftM create --alignment ../homologs.trimmed.aligned.faa
--rerooted_annotated_tree ../reordered_rerooted_pmoA.tree --sequences
../homologs.faa 
...
08/31/2015 11:50:37 AM INFO: Building gpkg for homologs.gpkg
08/31/2015 11:50:38 AM INFO: Checking for incorrect or fragmented reads
08/31/2015 11:50:38 AM WARNING: One or more alignments do not span > 50.00 % of HMM
08/31/2015 11:50:38 AM WARNING: Insufficient alignment of 119352925_ABL64049_Clonothrix_Vigliotta, not including this sequence
08/31/2015 11:50:38 AM INFO: After removing 1 insufficiently aligned sequences, left with 178 sequences
08/31/2015 11:50:38 AM INFO: Building seqinfo and taxonomy file from input annotated tree
08/31/2015 11:50:38 AM INFO: Deduplicating sequences
08/31/2015 11:50:38 AM INFO: Removed 53 sequences as duplicates, leaving 125 non-identical sequences

Traceback (most recent call last):

  File "/srv/sw/graftm/0.7.0/bin/graftM", line 232, in <module>
    Run(args).main()
  File "/srv/sw/graftm/0.7.0/bin/../graftm/run.py", line 529, in main
    force = self.args.force
  File "/srv/sw/graftm/0.7.0/bin/../graftm/create.py", line 408, in main
    [g[0].name for g in deduplicated_arrays], tree)
  File "/srv/sw/graftm/0.7.0/bin/../graftm/tree_cleaner.py", line 36, in match_alignment_and_tree_sequence_ids
    raise Exception("Sequence '%s' was found in the tree but not the alignment" % name)

Exception: Sequence '119352925_ABL64049_Clonothrix_Vigliotta' was found in the tree but not the alignment

basic stats file output a bit off topic for searching protein gpkgs

ben@u:~/git/graftM$ bin/graftM graft --graftm_package test/data/mcrA.gpkg/ --forward test/data/mcrA.gpkg/mcrA_1.1.fna test/data/mcrA.gpkg/mcrA_2.1.fna --output t_new --force
...
ben@u:~/git/graftM/t_new$ cat basic_stats.txt 
Basic run statistics (count):

                             Files: mcrA_1  mcrA_2
Total number of 16S reads detected: 1   1
Total number of 18S reads detected: N/AN/A  N/AN/A
    'Contaminant' eukaryotic reads: N/A N/A
              reads placed in tree: 2


Runtime (seconds):
                             Files: mcrA_1  mcrA_2
                       Search step: 0   0
                    Alignment step: 0   0
              Eukaryote check step: N/A N/A

               Tree insertion step: 0
                 Summarising steps: 0
                     Total runtime: 1

Maybe 1 format for the 16S/18S pipeline, and another, smaller, one for everything else?

Error when using diamond search method

Hello,

Traceback error when using --search_method diamond. The command run was: graftM graft --forward /srv/projects/abisko/aterrible_bins/10_assembly73_individuals20150928/bins/*.fa --graftm_package /srv/projects/abisko/Caitlin/graftM_packages_cait/0_graftm_packages_current/20121111_nifH.gpkg/ --output_directory 20151112_nifH_diamond --search_method diamond --threads 20 --force &> nifH.log

Traceback (most recent call last):
File "/srv/sw/graftm/dev_rochelle/bin/graftM", line 287, in
Run(args).main()
File "/srv/sw/graftm/dev_rochelle/bin/../graftm/run.py", line 510, in main
self.graft()
File "/srv/sw/graftm/dev_rochelle/bin/../graftm/run.py", line 322, in graft
diamond_db
File "/srv/sw/graftm/dev_rochelle/bin/../graftm/timeit.py", line 10, in timed
result = method(_args, *_kw)
File "/srv/sw/graftm/dev_rochelle/bin/../graftm/hmmer.py", line 803, in aa_db_search
hit_reads_orfs_fasta)
File "/srv/sw/graftm/dev_rochelle/bin/../graftm/hmmer.py", line 931, in search_and_extract_orfs_matching_protein_database
SequenceSearchResult.QUERY_TO_FIELD])
File "/srv/sw/graftm/dev_rochelle/bin/../graftm/hmmer.py", line 592, in _extract_orfs
entry=sequence_frame_info_dict[record.id]
KeyError: 'contig_8516_split_1'

Thanks,
Caitlin

graftm create overwrites files sometimes

e.g. if the alignment is in the current directory.

graftm create doesn't classify down to species level

Input taxonomy does have stuff down to species level eg.

$ head -n1 ../autotax.tsv 
UncA2677        K__dArchaea.4; P__dArchaea.11; C__dArchaea.9; O__dArchaea.8; F__dArchaea.7; G__dArchaea.6; S__dArchaea.6

                            CREATE

                   Joel Boyd, Ben Woodcroft

                                                    /                
              >a                                   /
              -------------                       /            
              >b                        |        |
              --------          >>>     |  GPKG  |
              >c                        |________|
              ----------     
[15:54:42]: Building gpkg for silva_20150312_hits
        30-04-2015  [15:54:42]: Using sequences previously aligned to HMM: ../GraftMSilva97tree2tax.gpkg/GraftMSilva97tree2tax/silva_20150312_hits.aln.TnotU.fa
        30-04-2015  [15:54:42]: Building seqinfo and taxonomy file
        30-04-2015  [15:54:47]: Creating reference package
warning: rank species not represented in the lineage of any sequence in reference package silva_20150312_hits.
        30-04-2015  [15:55:47]: Compiling gpkg

output seqinfo:

$ head silva_20150312_hits.gpkg/silva_20150312_hits.refpkg/silva_20150312_hits_seqinfo.csv
seqname,tax_id
UncA2677,G__dArchaea.6
UniArch7,G__pAAG.3
Unc01qz1,G__pAAG.2
Unc01gc0,G__pAAG.1
UncA9114,G__pAAG.1
Unc01sgh,G__pAAG.1
Unc01gm2,G__pAAG.1
UncA8682,G__pAAG.1
Unc01hk8,G__dArchaea.9

When the input file is in the current working directory, and -o is specified, the input file gets moved

I guess it is due to this part of the code (the mv)

# HMM aligning
def hmmalign(hmm, sequencefile):
    base_name = replace_name(sequencefile, '')  # Create a base name for output files
    subprocess.check_call("mkdir -p " + cl.o[0], shell=True)
    subprocess.check_call("mv " + base_name + ".* " + cl.o[0], shell=True)
    subprocess.check_call('hmmalign --trim -o ' + base_name + '.sto ' + hmm + ' ' + cl.o[0] + '/' + sequencefile, shell=True)
    subprocess.check_call('seqmagick convert ' + base_name + '.sto ' + base_name + '_conv_.fa', shell=True)

doesn't deal with interleaved paired sequence files

Get an error like this:

STDERR was: Traceback (most recent call last):
  File "/srv/whitlam/home/users/uqbwoodc/git/graftM/bin/graftM", line 233, in <module>
    Run(args).main()
  File "/srv/home/uqbwoodc/git/graftM/bin/../graftm/run.py", line 483, in main
    self.graft()
  File "/srv/home/uqbwoodc/git/graftM/bin/../graftm/run.py", line 314, in graft
    diamond_db
  File "/srv/home/uqbwoodc/git/graftM/bin/../graftm/timeit.py", line 10, in timed
    result = method(*args, **kw)
  File "/srv/home/uqbwoodc/git/graftM/bin/../graftm/hmmer.py", line 745, in aa_db_search
    hit_reads_orfs_fasta)
  File "/srv/home/uqbwoodc/git/graftM/bin/../graftm/hmmer.py", line 849, in search_and_extract_orfs_matching_protein_database
    hits
  File "/srv/home/uqbwoodc/git/graftM/bin/../graftm/hmmer.py", line 455, in _extract_from_raw_reads
    complement_info = self._extract_multiple_hits(hits, tmp.name, output_path)  # split them into multiple reads
  File "/srv/home/uqbwoodc/git/graftM/bin/../graftm/hmmer.py", line 393, in _extract_multiple_hits
    reads = SeqIO.to_dict(SeqIO.parse(reads_path, "fasta"))  # open up reads as dictionary
  File "/srv/sw/biopython/1.63/lib/python2.7/site-packages/biopython-1.63-py2.7-linux-x86_64.egg/Bio/SeqIO/__init__.py", line 716, in to_dict
    raise ValueError("Duplicate key '%s'" % key)
ValueError: Duplicate key 'D8QSB6V1:137:H9FH8ADXX:1:1101:8497:3847'
STDOUT was:

Code cleanup: a class that handles the naming

as opposed to constantly calling replace_name everywhere. e.g. a class that is created after arguments have been parsed, and then methods of this class can return specific filenames e.g.

class GraftMFiles:
    def __init___(self, forward_reads_file):
        self.basename = ... #basename of path similar to how replace_name works

    def forward_read_hmmsearch_output_path(self):
        return os.path.join(self.output_directory, "%s.hmmout.csv" % self.basename)

Doing it this way is more "don't repeat yourself" and so helps eliminate typo bugs (in general it is bad practice to have strings repeated in the code). Doing it this way will help with #4 as well, if all the files are always in the output directory.

Error with graftM graft --for/rev/merge when reads in forward but not reverse

Hello,

Seems that if only a few reads are picked up, and if they are found only in the forward (or only in the reverse) then it crashes. If reads are found in both the forward and reverse it's fine, if no reads are found in either it's also fine.

Command: ls /srv/projects/abisko/data/flat20150213/ | grep .1.fq.gz | sed 's/.1.fq.gz//g'| parallel -j5 graftM graft --forward /srv/projects/abisko/data/flat20150213/{}.1.fq.gz --reverse /srv/projects/abisko/data/flat20150213/{}.2.fq.gz --merge_reads --graftm_package /srv/projects/abisko/Caitlin/graftM_packages_cait/0_graftm_packages_current/20121111_archaeal_amoA.gpkg --threads 4 --force --output_directory {} &> archaea.log ; finish

Traceback:
11/12/2015 05:34:34 PM INFO: Working on 20110600_E1M.1
11/12/2015 05:34:34 PM INFO: Working on forward reads
11/12/2015 05:36:25 PM INFO: 2 read(s) detected
11/12/2015 05:36:25 PM INFO: Aligning reads to reference package database
11/12/2015 05:36:25 PM INFO: Filtered 0 short sequences from the alignment
11/12/2015 05:36:25 PM INFO: 2 sequences remaining
11/12/2015 05:36:25 PM INFO: Working on reverse reads
11/12/2015 05:38:23 PM INFO: No reads found in 20110600_E1M.1
Traceback (most recent call last):
File "/srv/sw/graftm/dev_rochelle/bin/graftM", line 287, in
Run(args).main()
File "/srv/sw/graftm/dev_rochelle/bin/../graftm/run.py", line 510, in main
self.graft()
File "/srv/sw/graftm/dev_rochelle/bin/../graftm/run.py", line 408, in graft
seqs_list=C.cluster(seqs_list, REVERSE_PIPE)
File "/srv/sw/graftm/dev_rochelle/bin/../graftm/clusterer.py", line 83, in cluster
reads=self.seqio.read_fasta_file(input_fasta) # Read in FASTA records
File "/srv/sw/graftm/dev_rochelle/bin/../graftm/sequence_io.py", line 44, in read_fasta_file
for name, seq, _ in self._readfq(open(path_to_fasta_file)):
IOError: [Errno 2] No such file or directory: '20110600_E1M/20110600_E1M.1/20110600_E1M.1_hits.aln.fa'

Thanks!
Caitlin

Doesn't deal well with large numbers of input files

By going through each file one by one when many input files are provided, the bottleneck is orfm. If instead multiple orfm instances were run simultaneously, we could squeeze a bit more performance out. Maybe not the easiest thing to implement.

.. or, maybe could just make orfm faster, though I think that requires multi-threading it which is not especially trivial.

missing file from repo?

In newest master

Traceback (most recent call last):
  File "/home/ben/git/graftM/bin/graftM", line 26, in <module>
    import graftm.run
  File "/home/ben/git/graftM/bin/../graftm/run.py", line 9, in <module>
    from graftm.hmmer import Hmmer
  File "/home/ben/git/graftM/bin/../graftm/hmmer.py", line 15, in <module>
    from graftm.readHmmTable import HMMreader
ImportError: No module named readHmmTable

graftm graft requires mafft, when it is unneeded

doesn't fail gracefully when an old version of hmmer is used

from @fauziharoon running 3.0:

$ graftM graft --forward ../prinseq_trimmed_metagenome_reads/sample01.1.prinseq.fasta --threads 20 --graftm_package ~/DB/graftm_packages/16S_82_hmmaligned_gpkg

                                GRAFT

                       Joel Boyd, Ben Woodcroft

                                                         __/__
                                                  ______|
          _- - _                         ________|      |_____/
           - -            -             |        |____/_
           - _     >>>>  -   >>>>   ____|
          - _-  -         -             |      ______
             - _                        |_____|
           -                                  |______


[14:49:20]: Working on sample01
        06-05-2015  [14:49:20]: Searching sample01.1.prinseq.fasta
        06-05-2015  [14:49:45]: 5389 read(s) found
        06-05-2015  [14:49:53]: Aligning reads to reference package database

ERROR: File /home/haroonmf/DB/graftm_packages/16S_82_hmmaligned_gpkg/82.hmm does not appear to be in a recognized HMM format.
Usage: hmmalign [-options] <hmmfile> <seqfile>

To see more help on available options, do hmmalign -h


ERROR: File /home/haroonmf/DB/graftm_packages/16S_82_hmmaligned_gpkg/82.hmm does not appear to be in a recognized HMM format.
Usage: hmmalign [-options] <hmmfile> <seqfile>

To see more help on available options, do hmmalign -h


[14:49:54]: Placing reads into phylogenetic tree
Sequence length cut to 0 by pre-masking; can't proceed with no information.
Traceback (most recent call last):
  File "/usr/local/bin/graftM", line 152, in <module>
    Run(args).main()
  File "/usr/local/lib/python2.7/dist-packages/graftm/run.py", line 396, in main
    self.graft()
  File "/usr/local/lib/python2.7/dist-packages/graftm/run.py", line 338, in graft
    self.placement(summary_table)
  File "/usr/local/lib/python2.7/dist-packages/graftm/run.py", line 127, in placement
    self.args)
  File "/usr/local/lib/python2.7/dist-packages/graftm/pplacer.py", line 89, in place
    jplace = self.pplacer(files.jplace_output_path(), args.output_directory, files.comb_aln_fa(), args.threads, files.command_log_path())
  File "/usr/local/lib/python2.7/dist-packages/graftm/pplacer.py", line 28, in pplacer
    subprocess.check_call(cmd, shell=True) # Run it
  File "/usr/lib/python2.7/subprocess.py", line 511, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'pplacer -j 20 --verbosity 0 --out-dir GraftM_output -c /home/haroonmf/DB/graftm_packages/16S_82_hmmaligned_gpkg/82.refpkg GraftM_output/combined_alignment.aln.fa' returned non-zero exit status 1

graft no longer reports the number of reads it detects in default verbosity

when -r is specified, the read count output includes forward and reverse reads

The user would expect this number to be the number of pairs, not the number of individual reads, at least how the user reads.

[10:46:37]: Found 57127 read(s) in cck10_for.

I think a best solution would be to report the number of forward reads separately to the number of reverse reads found

should output biom format

using the -r flag seems to bias the composition estimate towards bacteria

I'll send you a separate email with the details @geronimp. This seems to be true for @dparks1134 's data, at least.

evalues for search appear to be ignored

in Hmmer, hmmsearch and nhmmer methods do not appear to ever read the eval argument, presumably meaning the default evalue is used even when an evalue is specified when graftM is called on the command line.

create deduplication should prefer sequences that are in the tree, if one is provided

Because if the only provided taxonomy is from sequences in the tree, choosing a different random one then no taxonomy will be known.

Especially relevant when upgrading graftm packages, although currently original information e.g. the sequences are needed since they cannot be derived from a graftm package anyway (in v2, at least).

More details on making refpkg using graftM create

I am still not sure on how to make a refpkg. Is there an example of the taxonomy file?
--taxonomy TAX File containing two tab separated columns, the first
with the ID of the sequences, the second with the
taxonomy string (GreenGenes taxonomy file format).
What kind of tree? Is it only from fasttree output? Can I use other tree making programs to make the tree?

Thanks for looking into this.

pre-compiled fxtract binaries don't deal with bgzip files

This is really an fxtract problem, but we need to deal with it so I thought I'd put it here. Here's what happens with a sample straight off the nextseq

ben@ben:/tmp/gah$ fxtract -H -z -X ~/l/t/reads /srv/data/lims_controlled/150502_NS500333_0025_AH2HWNBGXX/Paperpalooza/NS25-04/Acinetobacter-sp_S4_L001_R1_001.fastq.gz
bad character found: �
The file /home/ben/l/t/reads does not look like either fasta or fastq
Failed to open stream

fxtract compiled from source does not have this problem, just the pre-compiled linux binaries.

--graftm_package shouldn't be required when --hmm_file and --search_only are specified

uqbwoodc@mrca003:20150220:/srv/projects/abisko/shotgun_abundance/53_doe_talk_functional_gene_pca/bclA/graftm$ graftM --forward /srv/projects/abisko/data/flat20150213/20120800_S1M.1.fq.gz --search_only --threads 40 --output_stats 20120800_S1M --hmm_file ../AAN32623.1.mingle/homologs.faa.aligned.hmm  --type P
Traceback (most recent call last):
  File "/srv/sw/graftm/0.4.0/bin/graftM", line 47, in <module>
    Run(args).main()
  File "/srv/sw/graftm/0.4.0/graftm/Run.py", line 33, in __init__
    self.HK.set_attributes(self.args)
  File "/srv/sw/graftm/0.4.0/graftm/HouseKeeping.py", line 116, in set_attributes
    Messenger().message('ERROR: %s is empty or misformatted.' % args.graftm_package)
AttributeError: 'Namespace' object has no attribute 'graftm_package'

graftm create fails when a sequence has no taxonomy

Currently, the sequence gets given a taxonomy which is not in the refpkg taxonomy file.