ropensci / phylotar Goto Github PK

View Code? Open in Web Editor NEW

23.0 6.0 8.0 15.67 MB

An automated pipeline for retrieving orthologous DNA sequences from GenBank in R

Home Page: https://docs.ropensci.org/phylotaR

License: Other

R 98.49% Just 0.23% C 1.28%

sequence-alignment blastn phylogenetics r genbank peer-reviewed r-package rstats

phylotar's Introduction

Automated Retrieval of Orthologous DNA Sequences from GenBank

R implementation of the PhyLoTa sequence cluster pipeline. For more information see the accompanying website. Tested and demonstrated on Unix and Windows. Find out more by visiting the phylotaR website.

Install

install.packages("remotes")
remotes::install_github('ropensci/phylotaR')

Full functionality depends on a local copy of BLAST+ (>= 2.0.0). For details on downloading and compiling BLAST+ on your machine please visit the NCBI website.

Pipeline

phylotaR runs the PhyLoTa pipeline in four automated stages: identify and retrieve taxonomic information on all descendent nodes of the taxonomic group of interest (taxise), download sequence data for every identified node (download), identify orthologous clusters using BLAST (cluster), and identify sister clusters for sets of clusters identified in the previous stage (cluster^2) After these stages are complete, phylotaR provides tools for exploring, identifying and exporting suitable clusters for subsequent analysis.

For more information on the pipeline and how it works see the publication, phylotaR: An Automated Pipeline for Retrieving Orthologous DNA Sequences from GenBank in R.

Running

At a minimum all a user need do is provide the taxonomic ID of their chosen taxonomic group of interest. For example, if you were interested in primates, you can visit the NCBI taxonomy home page and search primates to look up their ID. After identifying the ID, the phylotaR pipeline can be run with the following script.

library(phylotaR)
wd <- '[FILEPATH TO WORKING DIRECTORY]'
ncbi_dr <- '[FILEPATH TO COMPILED BLAST+ TOOLS]'
txid <- 9443  # primates ID
setup(wd = wd, txid = txid, ncbi_dr = ncbi_dr)
run(wd = wd)

The pipeline can be stopped and restarted at any point without loss of data. For more details on this script, how to change parameters, check the log and details of the pipeline, please check out the package vignette.

library(phylotaR)
vignette("phylotaR")

Timings

How long does it take for a phylotaR pipeline to complete? Below is a table listing the runtimes in minutes for different demonstration, taxonomic groups.

Taxon	N. taxa	N. sequences	N. clusters	Taxise (mins.)	Download (mins.)	Cluster (mins.)	Cluster2 (mins.)	Total (mins.)
Anisoptera	1175	11432	796	1.6	23	48	0.017	72
Acipenseridae	51	2407	333	0.1	6.9	6.4	0.017	13
Tinamiformes	25	251	98	0.067	2.4	0.18	0.017	2.7
Aotus	13	1499	193	0.067	3.2	0.6	0	3.9
Bromeliaceae	1171	9833	724	1.2	28	37	0.033	66
Cycadidae	353	8331	540	0.32	19	18	0.033	37
Eutardigrada	261	960	211	0.3	11	1.8	0.05	14
Kazachstania	40	623	101	0.1	20	3	0.05	23
Platyrrhini	212	12731	3112	0.35	51	6.9	1.2	60

To run these same demonstrations see demos/demo_run.R.

License

MIT

Authors

Maintainer: Shixiang Wang [email protected]

This package previously developed and maintained by:

Dom Bennett, Hannes Hettling, Rutger Vos, Alexander Zizka and Alexandre Antonelli

Reference

Bennett, D., Hettling, H., Silvestro, D., Zizka, A., Bacon, C., Faurby, S., … Antonelli, A. (2018). phylotaR: An Automated Pipeline for Retrieving Orthologous DNA Sequences from GenBank in R. Life, 8(2), 20. DOI:10.3390/life8020020

Sanderson, M. J., Boss, D., Chen, D., Cranston, K. A., & Wehe, A. (2008). The PhyLoTA Browser: Processing GenBank for molecular phylogenetics research. Systematic Biology, 57(3), 335–346. DOI:10.1080/10635150802158688

phylotar's People

Contributors

Stargazers

Watchers

Forkers

dombennett cnyuanh agamisch haithamsghaier danielyao12 gavieira bomeara huiqingyeooo

phylotar's Issues

Documentation issues

use devtools::spell_check
add return fields to all functions
create examples for each function, investigate devtools::use_data
what does MAD, type and seed mean?

Investigate incorporating more packages

tm for filtering common words in feature/gene names
taxize
taxizeDB
biostrings

pkgdown

Change gi codes to accessions

Apparently, gi codes are on their way out of the NCBI databases. Unfortunately the whole code relies on gis, and so does phylota. However, this might change in the future. We should consider moving to accession numbers.

Unable to find correct versions of NCBI BLAST+ tools

I am getting this error (last line):

library(phylotaR)
wd <- /apps/Acanthaceae
Error: unexpected '/' in "wd <- /"
wd <- "/apps/Acanthaceae"
ncbi_dr <- "/usr/lib/ncbi-blast+"
txid <- 4185
setup(wd = wd, txid = txid, ncbi_dr = ncbi_dr, v = TRUE)

phylotaR: Implementation of PhyLoTa in R [v1.0.0]

Checking for valid NCBI BLAST+ Tools ...
Failed to run: [/usr/lib/ncbi-blast+/makeblastdb]. Reason:
[Error : Failed to execute '/usr/lib/ncbi-blast+/makeblastdb' (No such file or directory)
]Failed to run: [/usr/lib/ncbi-blast+/blastn]. Reason:
[Error : Failed to execute '/usr/lib/ncbi-blast+/blastn' (No such file or directory)
]Error:Unable to find correct versions of NCBI BLAST+ tools
Error in blast_setup(d = ncbi_dr, v = v, wd = wd) :
Unable to find correct versions of NCBI BLAST+ tools

I think BLAST+ is ok in my computer. Perhaps I have to install some specific database?
If I type:
sudo apt install ncbi-blast+
I get:
Reading package lists... Done
Building dependency tree
Reading state information... Done
ncbi-blast+ is already the newest version (2.6.0-1).
0 upgraded, 0 newly installed, 0 to remove and 83 not upgraded.

Local vs remote NCBI access

At the moment there is some hybrid version of ncbi database access. nodes.R relies on a local copy of the NCBI taxonomy nodes table but the retreival of sequences for clustering is done with remote access using the package rentrez.

nodes.R also has remote access to NCBI taxonomy implemented, but this needs to be debugged.

No periods or tildes in file paths

Hi Dom,

when I try to run a dummy analysis I run into the issue that my $HOME directory is /Users/rutger.vos, which the code wants to parse on the period. It tries to instantiate a log file called /Users/rutger.log, which is not allowed.

In trying to circumvent this I tested whether I can use a tilde instead (i.e. for my $HOME, use ~) but it doesn't like that either.

Best,

Rutger

Decide on what the package output should be

At the moment, the code produces tables as in the phylota database, so the output can be read into supersmart.

We should carefully think of what a possible user might want from this package; I suppose some tsv files that can be read into sqlite is not so attractive.

Maybe it should produce a fasta file per cluster?

Use phylotaR for retrieving orthologous protein sequences

Hi,

is it possible to use phylotaR also for retrieving orthologous protein sequences (using all-vs.-all blastp searches) from Genbank/NCBI?

Many thanks in advance for your help!

Jan

Create advanced vignette

Create unit tests

It would be good if there were standardized tests so that users can verify correct functioning and use them as code samples.

Write documentation

Only one file (ncbi-remote.R) is properly documented so far. We need documentation for all functions that will be exposed to users such that documentation .Rd files can be generated.

Error: blastn failed to run. Check BLAST log files

Hi. I'm trying to run the following commands

wd <- "C: / Project_Crass / 11_R / PhylotaR"
ncbi_dr <- "C: / Program Files / NCBI / blast-2.7.1 + / bin"
txid <- 3784
setUp (wd = wd, txid = txid, ncbi_dr = ncbi_dr, v = T)
run (wd = "C: / Project_Crass / 11_R / PhylotaR")

but I get an error when the cluster stage starts

Error in runStgs (wd = wd, frm = 1, to = nstages, stgs_msg = stgs_msg):
Unexpected Error in error (ps = ps, "blastn failed to run. Check BLAST log files."):
Error: blastn failed to run. Check BLAST log files.

Could you guide me how I solve this error?
I already performed the same procedure with different taxa and always stops at the CLUSTER stage.
I attach the log file, thank you
log.txt

Several species

I am currently working with phylotaR for a group of species. But since I have seen in your paper, the examples and the vignette, phylotaR only Works for one txid.
Is there a form of providing several txid to work with species of several groups?
I have tried to supply the id of my species as a vector of characters, but it seems that it doesn´t work because the program does not advance from taxise.
Best regards,

Error in run.R II

The cluster.ci_gi.seqs.create function drops the following error

Counting species for taxon 4613
Number of sequences for taxon 4613 : 16929
Too many seqs to blast for taxid 4613 ... retrieving children
Error in .children(current.taxon, nodes) : unused argument (nodes)

Reproducible example

#try out to get a molecular alignment using Phylotar

#libraries
#source the phylotaR package
source("blast.R")
source("ci_gi.R")
source("cl.R")
source("clusters.R")
source("db.R")
source("ncbi-remote.R")
source("nodes.R")
source("query-local.R")

#run an analysis follwoing the run.sh script, this is the run.r copied
library(foreach)
library(doMC)

set.seed(111)

options(error=recover)

## Adjustable parameters:
## Maximum number of sequences per species
MODEL.THRESHOLD <<- 3000
## Maximum number of sequences to blast in a single run; if taxon has more subtree sequences
## than that, its children will get clustered
MAX.BLAST.SEQS <<- 10000
## Maximum characters in one sequence
MAX.SEQUENCE.LENGTH <<- 25000
## directory for sequence cache; will be created if does not exist
SEQS.CACHE.DIR <<- "./sequences/"
## Download file ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz,
## unzip and specify directory where file 'nodes.dmp' is located
taxdir <- 'taxdmp'
## Number of processing units
CORES <<- 4

registerDoMC(CORES)

## Do analysis for Felidae family
taxid <- 4613
nodes.create(taxid, taxdir=taxdir, file.name='dbfiles-bromeliaceae-nodes.tsv')

clusters.ci_gi.seqs.create(4613, 'dbfiles-bromeliaceae-nodes.tsv', 
                           files=list(clusters='dbfiles-bromeliaceae-clusters.tsv',
                                      ci_gi='dbfiles-bromeliaceae-ci_gi.tsv',
                                      seqs='dbfiles-bromeliaceae-seqs.tsv'))

sessionInfo

R version 3.4.2 (2017-09-28)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252    LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C                            LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] foreach_1.4.3         CHNOSZ_1.1.0          XML_3.98-1.9          rentrez_1.1.0         igraph_1.1.2         
 [6] RSQLite_2.0           data.table_1.10.4-3   bindrcpp_0.2          speciesgeocodeR_2.0-8 sp_1.2-5             
[11] forcats_0.2.0         stringr_1.2.0         dplyr_0.7.4           purrr_0.2.4           readr_1.1.1          
[16] tidyr_0.7.2           tibble_1.3.4          ggplot2_2.2.1         tidyverse_1.2.0      

loaded via a namespace (and not attached):
 [1] viridis_0.4.0     httr_1.3.1        bit64_0.9-7       jsonlite_1.5      viridisLite_0.2.0 modelr_0.1.1     
 [7] assertthat_0.2.0  blob_1.1.0        cellranger_1.1.0  yaml_2.1.14       lattice_0.20-35   glue_1.2.0       
[13] digest_0.6.12     rvest_0.3.2       colorspace_1.3-2  picante_1.6-2     Matrix_1.2-11     plyr_1.8.4       
[19] psych_1.7.8       pkgconfig_2.0.1   broom_0.4.2       raster_2.5-8      haven_1.1.0       scales_0.5.0     
[25] mgcv_1.8-20       lazyeval_0.2.1    cli_1.0.0         mnormt_1.5-5      magrittr_1.5      crayon_1.3.4     
[31] readxl_1.0.0      memoise_1.1.0     nlme_3.1-131      MASS_7.3-47       xml2_1.1.1        foreign_0.8-69   
[37] vegan_2.4-4       tools_3.4.2       hms_0.3           geosphere_1.5-7   munsell_0.4.3     cluster_2.0.6    
[43] compiler_3.4.2    rlang_0.1.4       grid_3.4.2        iterators_1.0.8   rstudioapi_0.7    codetools_0.2-15 
[49] gtable_0.2.0      curl_3.0          DBI_0.7           reshape2_1.4.2    R6_2.2.2          gridExtra_2.3    
[55] lubridate_1.7.1   knitr_1.17        rgdal_1.2-15      rgeos_0.3-26      bit_1.1-12        bindr_0.1        
[61] permute_0.9-4     ape_5.0           stringi_1.1.5     parallel_3.4.2    Rcpp_0.12.13

Vigenette issues

How long will the pipeline take?

Contributing needed

Search for a single cluster

Identify sequences that are orthologous to a specified sequence e.g. a RefSeq.

Drop parallel

It's better to use BLAST's own threads argument, not foreach.

makeblastdb failed to run

Hi Dom,

I could not run pipeline due to an error with makeblastdb.

This is my code:

wd <- "C:/Users/jorge/Documents/Mente/BIOVERA humboldt"
ncbi_dr <- "C:/Program Files/NCBI/blast-2.7.1+/bin"
txid <- 241806 

setUp(wd=wd, txid=txid, ncbi_dr=ncbi_dr)
run(wd=wd)

Everythin works until cluster. Then it appears the following message:
Error in error(ps = ps, paste0("makeblastdb failed to run. Check BLAST log files.")) :
Error: makeblastdb failed to run. Check BLAST log files.

This is the logfile:
USAGE
makeblastdb.exe [-h] [-help] [-in input_file] [-input_type type]
-dbtype molecule_type [-title database_title] [-parse_seqids]
[-hash_index] [-mask_data mask_data_files] [-mask_id mask_algo_ids]
[-mask_desc mask_algo_descriptions] [-gi_mask]
[-gi_mask_name gi_based_mask_names] [-out database_name]
[-max_file_sz number_of_bytes] [-logfile File_Name] [-taxid TaxID]
[-taxid_map TaxIDMapFile] [-version]

DESCRIPTION
Application to create BLAST databases, version 2.7.1+

Use '-help' to print detailed descriptions of command line arguments

Error: Too many positional arguments (1), the offending value: humboldt/blast/taxon-1203511-typ-subtree-db.fa
Error: (CArgException::eSynopsis) Too many positional arguments (1), the offending value: humboldt/blast/taxon-1203511-typ-subtree-db.fa

I attach you my log file:
log.txt

Better configuration for parallel mode

At the moment, the number of cores etc is hard-coded in the run.R script.
This should be configurable on some other level.

Fasta error

Fix TODO's

In quite some places in the code, I marked some dirty bits or something that would need attention with 'TODO'. These should be thoroughly investigated.

An Error in PhylotaR

Hi Dom,

I am now using phylotaR for processed the data in Blats+. During the processing of PhylotaR, it shows an error may need you help to solve. Below were the script I used and the Error it shows.
The Script I Used:

library(phylotaR)
wd<-'e:/Institute_Of_Bontany/WorldTree/TracheophytaTest'
ncbi_dr<-'d:/blast/blast-2.7.1+/bin/'
txid<-58023
setup(wd=wd,txid=txid,ncbi_dr=ncbi_dr,v=TRUE)

phylotaR: Implementation of PhyLoTa in R [v1.0.0]

Checking for valid NCBI BLAST+ Tools ...
Found: [d:/blast/blast-2.7.1+/bin//makeblastdb]
Found: [d:/blast/blast-2.7.1+/bin//blastn]
Setting up pipeline with the following parameters:
. blstn [d:/blast/blast-2.7.1+/bin//blastn]
. btchsz [100]
. date [2018-10-22]
. mdlthrs [3000]
. mkblstdb [d:/blast/blast-2.7.1+/bin//makeblastdb]
. mncvrg [51]
. mnsql [250]
. mxevl [1e-10]
. mxnds [1e+05]
. mxrtry [100]
. mxsql [2000]
. mxsqs [50000]
. ncps [1]
. txid [58023]
. v [TRUE]
. wd [e:/Institute_Of_Bontany/WorldTree/TracheophytaTest]

run(wd=wd)

The Error it shows:
Generating taxonomic dictionary ...
Unexpected Error in FUN(X[[i]], ...) : Unable to get "PRNT" slot from object without slot basic category ("NULL")

Occurred [2018-10-23 09:06:12]
Contact package maintainer for help.
Error in stages_run(wd = wd, frm = 1, to = nstages, stgs_msg = stgs_msg) :
Unexpected Error in FUN(X[[i]], ...) : Unable to get "PRNT" slot from object without slot basic category ("NULL")

Best Wishes,
Qiang

Allow users to modify the NCBI search term in their parameters

Run issues

log should include BLAST version
sessionInfo needs to be recorded
'Don't panic' statement?
More information on what R is actually doing during a run

Use goodpractice::gp()

Error in run.R

The clusters.ci_gi.seqs.create function in the run.R script drops the following error:

Counting species for taxon 15123
Number of sequences for taxon 15123 : 541
Will process taxon 15123
evaluation # 1:
$i
[1] 1

Processing taxid 15123 # 1 / 1
Attempting to retrieve sequences for taxid 15123
541 seqs for taxon 15123 , less than maximum of 3000 sequences. Retreiving sequences for whole subtree
Going to retrieve 541 sequences for taxid 15123
Retreiving seqs 1 to 500 for taxid 15123
Done retreiving 500 ( 1 to 500 ) seqs for taxid 15123
Retreiving seqs 501 to 541 for taxid 15123
Done retreiving 41 ( 501 to 541 ) seqs for taxid 15123
Finished retreiving 541 sequences for taxid 15123
Writing 541 sequences for taxon 15123 to file ./sequences//15123-max-3000.fa
Making sequence dataframe
Done making sequence dataframe
Processing taxid 15123 of rank genus attempting to make subtree clusters
result of evaluating expression:
<simpleError in .children(taxon, nodes): unused argument (nodes)>
got results for task 1
accumulate got an error result
numValues: 1, numResults: 1, stopped: FALSE
returning status FALSE
numValues: 1, numResults: 1, stopped: TRUE
not calling combine function due to errors
Error in { : task 1 failed - "unused argument (nodes)"

Reproducible example:


#source the phylotaR package
source("blast.R")
source("ci_gi.R")
source("cl.R")
source("clusters.R")
source("db.R")
source("ncbi-remote.R")
source("nodes.R")
source("query-local.R")

#run an analysis follwoing the run.sh script, this is the run.r copied
library(foreach)
library(doMC)

set.seed(111)

options(error=recover)

## Adjustable parameters:
## Maximum number of sequences per species
MODEL.THRESHOLD <<- 3000
## Maximum number of sequences to blast in a single run; if taxon has more subtree sequences
## than that, its children will get clustered
MAX.BLAST.SEQS <<- 10000
## Maximum characters in one sequence
MAX.SEQUENCE.LENGTH <<- 25000
## directory for sequence cache; will be created if does not exist
SEQS.CACHE.DIR <<- "./sequences/"
## Download file ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz,
## unzip and specify directory where file 'nodes.dmp' is located
taxdir <- 'C:/Users/alexander.zizka/Dropbox (Antonelli Lab)/Arbeit/Gothenburg/projects/31_bromeliaceae_distribution/analyses/phylotaR/phylotaR-master/R/taxdmp'
## Number of processing units
CORES <<- 4

registerDoMC(CORES)

## Do analysis for Bromeliaceae family
taxid <- 4613
nodes.create(taxid, taxdir=taxdir, file.name='dbfiles-bromeliaceae-nodes.tsv')

clusters.ci_gi.seqs.create(15123, 'dbfiles-bromeliaceae-nodes.tsv', 
                           files=list(clusters='dbfiles-bromeliaceae-clusters.tsv',
                                      ci_gi='dbfiles-bromeliaceae-ci_gi.tsv',
                                      seqs='dbfiles-bromeliaceae-seqs.tsv'))

SessionInfo

> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252    LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C                            LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] foreach_1.4.3         CHNOSZ_1.1.0          XML_3.98-1.9          rentrez_1.1.0         igraph_1.1.2         
 [6] RSQLite_2.0           data.table_1.10.4-3   bindrcpp_0.2          speciesgeocodeR_2.0-8 sp_1.2-5             
[11] forcats_0.2.0         stringr_1.2.0         dplyr_0.7.4           purrr_0.2.4           readr_1.1.1          
[16] tidyr_0.7.2           tibble_1.3.4          ggplot2_2.2.1         tidyverse_1.2.0      

loaded via a namespace (and not attached):
 [1] viridis_0.4.0     httr_1.3.1        bit64_0.9-7       jsonlite_1.5      viridisLite_0.2.0 modelr_0.1.1     
 [7] assertthat_0.2.0  blob_1.1.0        cellranger_1.1.0  yaml_2.1.14       lattice_0.20-35   glue_1.2.0       
[13] digest_0.6.12     rvest_0.3.2       colorspace_1.3-2  picante_1.6-2     Matrix_1.2-11     plyr_1.8.4       
[19] psych_1.7.8       pkgconfig_2.0.1   broom_0.4.2       raster_2.5-8      haven_1.1.0       scales_0.5.0     
[25] mgcv_1.8-20       lazyeval_0.2.1    cli_1.0.0         mnormt_1.5-5      magrittr_1.5      crayon_1.3.4     
[31] readxl_1.0.0      memoise_1.1.0     nlme_3.1-131      MASS_7.3-47       xml2_1.1.1        foreign_0.8-69   
[37] vegan_2.4-4       tools_3.4.2       hms_0.3           geosphere_1.5-7   munsell_0.4.3     cluster_2.0.6    
[43] compiler_3.4.2    rlang_0.1.4       grid_3.4.2        iterators_1.0.8   rstudioapi_0.7    codetools_0.2-15 
[49] gtable_0.2.0      curl_3.0          DBI_0.7           reshape2_1.4.2    R6_2.2.2          gridExtra_2.3    
[55] lubridate_1.7.1   knitr_1.17        rgdal_1.2-15      rgeos_0.3-26      bit_1.1-12        bindr_0.1        
[61] permute_0.9-4     ape_5.0           stringi_1.1.5     parallel_3.4.2    Rcpp_0.12.13

Infrasturcture and support for blastn alternatives

There are alternatives to NCBI's blastn that are often much faster, but:

some are less sensitive
some only work with specific sequence types
some are only beneficial in certain computing environments (e.g. HPC)

There is a large number: https://en.wikipedia.org/wiki/List_of_sequence_alignment_software
In particular: usearch, blat, megablast, plast, diammond (nucleotide capabilities pending: bbuchfink/diamond#117)

Instead of developing a new code for each individual search tool alternative, provide the user with general input/output functions that a user could adapt for their own choice of search tool. The benefits of this solution:

no need to worry about ensuring each dependency works across versions and OSs.
no need to keep up-to-date with latest developments in search-tool development

Increase test coverage

>70%

PhylotaR setup issue

Hi,
I'm trying to run phylotaR in R 3.4.4 with the following code:

devtools::install_github('ropensci/phylotaR')
library(phylotaR)
wd<-getwd()
ncbi_dr<-"C:/Program Files/ncbi-blast-2.7.1+/bin"
txid<-9504
setup(wd = wd, txid = txid, ncbi_dr = ncbi_dr, v = TRUE)

then I get:

phylotaR: Implementation of PhyLoTa in R [v1.0.0]

Checking for valid NCBI BLAST+ Tools ...
Error in if (tst) { : missing value where TRUE/FALSE needed
Inoltre: Warning message:
In blast_setup(d = ncbi_dr, v = v, wd = wd) :
NAs introduced by coercion

Do you have any suggestion? I could not find a solution for this problem

Best,
Matteo

failing to run drop_by_rank()

Hi,
I am trying to reduce the sequence data by using drop_by_rank()
but getting the following error:

Error in phylota@sqs@sqs[[i]] :
attempt to select less than one element in get1index

I am running the pipeline according to the tutorial:

library(phylotaR)
wd <- '/media/...
ncbi_dr <- '/home/...
txid <- 4747
setup(wd = wd, txid = txid, ncbi_dr = ncbi_dr, v=TRUE)
run(wd = wd)

all_clusters <- read_phylota(wd)
print(all_clusters)
cids <- all_clusters@cids
n_taxa <- get_ntaxa(phylota = all_clusters, cid = cids)
keep <- cids[taxa_all > 10]
selected <- drop_clstrs(phylota = all_clusters, cid = keep)
reduced <- drop_by_rank(selected, rnk = 'species', n=1)

My experience with R is unfortunately quite limited and I appreciate the help.
Thanks!

makeblastdb failing to run

Hello,

I'm having trouble to run the pipeline due to an error in BLAST.

This is my code and, on the last line, the error.

> library(phylotaR)
> wd <- 'C:/bioinfo/NCBI/Serpentes'
> ncbi_dr <- 'C:/bioinfo/NCBI/blast-2.10.0+/bin'
> txid <- 8570
> setup(wd = wd, txid = txid, ncbi_dr = ncbi_dr, v = TRUE)
-------------------------------------------------
phylotaR: Implementation of PhyLoTa in R [v1.2.0]
-------------------------------------------------
Checking for valid NCBI BLAST+ Tools ...
Found: [C:/bioinfo/NCBI/blast-2.10.0+/bin/makeblastdb]
Found: [C:/bioinfo/NCBI/blast-2.10.0+/bin/blastn]
. . Running makeblastdb
Erro: makeblastdb failed to run. Check BLAST log files.

Amount of data accessible with this pkg?

👋 as part of preparing an rOpenSci annual report, we're trying to estimate amount of data the various pkgs in our suite provide access to.

Do you have a sense for how much data (e.g., in GB) one can access through this pkg? And/or whatever metric is most relevant for this data (sequences maybe?)?

Cat and print consistency for classes

Remove old code

There is quite some old code in the repository: clusters.R will be replaced with the contents of cl.R. There is also a db.R which connects to a sqlite database, I suppose we won't need that for a proper package. Also a run.R should not be in the package.

makeblastdb error

Reviewer could not run pipeline due to an error with makeblastdb.

> # RUN PIPELINE
> txid <- 9504
> setUp(wd="/Users/naupaka/Desktop/phylota_review/aotus", txid=txid, ncbi_dr=ncbi_dr, v=TRUE)
-----------------------------------------------
phylotaR: Implementation of PhyLoTa in R [v0.1]
-----------------------------------------------
Checking for valid NCBI BLAST+ Tools ...
Found: [/usr/local/bin/makeblastdb]
Found: [/usr/local/bin/blastn]
Setting up pipeline with the following parameters:
. blstn      [/usr/local/bin/blastn]
. btchsz     [300]
. date       [2018-03-26]
. mdlthrs    [3000]
. mkblstdb   [/usr/local/bin/makeblastdb]
. mncvrg     [51]
. mnsql      [250]
. mxevl      [1e-10]
. mxnds      [1e+05]
. mxrtry     [100]
. mxsql      [2000]
. mxsqs      [50000]
. ncps       [1]
. txid       [9504]
. v          [TRUE]
. wd         [/Users/naupaka/Desktop/phylota_review/aotus]
-----------------------------------------------
> run(wd=wd)
... Taxise
... Download
... Cluster
Error in runStgs(wd = wd, frm = 1, to = nstages, stgs_msg = stgs_msg) :
  Unexpected Error in error(ps = ps, paste0("makeblastdb failed to run. Check BLAST log files.")) :
  Error: makeblastdb failed to run. Check BLAST log files.


Occurred [2018-03-26 06:37:35]
Contact package maintainer for help.

Possibility to supply NCBI API key when running pipeline to access NCBI data faster?

Hi there,

thanks for your efforts developing phylotaR!

Unfortunately, I'm encountering issues when trying to run phylotaR for larger taxonomic groups (e. g. txid=33630). It happens during the download stage when fetching data from NCBI:

taxise_run(wd = wd)
download_run(wd = wd)

after downloading some part of the data I get

Retrying in [1s] for [fetch]
Retrying in [3s] for [fetch]
Retrying in [6s] for [fetch]
Retrying in [10s] for [fetch]
Retrying in [60s] for [fetch]
Retrying in [300s] for [fetch]
Retrying in [300s] for [fetch]
Retrying in [300s] for [fetch]
...

I therefore wondered if its possible to set a NCBI API key to access the NCBI data faster as described in this blog post https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/?

Many thanks in advance for your help!

Jan

Description issues

Version formatted as .. (e.g. 0.1.0)
The Date field should not be used
For the URL and BugReports entries, it would be better if these
pointed to the repository that holds the source code (see Repos issues
section below).
The License slot says GPL-2 but your actual LICENSE is MIT, this needs
to be resolved.

Resolve BLAST dependency

The code relies on blastn for the clustering. I'm not sure if this is acceptable for a CRAN (or bioconductor or similar) submission.

This should be checked and if it is impossible to depend on an external tool, blastn should be included in the package.

As far as I know, all blast R packages do remote blasting, which is however not an option for this package.

Style issues

Irregular line wrapping
Extra newlines at the beginning of code
Longer class and function names
snakecase for functions

Dependency 'treeman' has been archived:

https://cran.r-project.org/web/packages/phylotaR/index.html

PhylotaR Cluster error: blastn failed to run

Hi! I've been trying to use phylotaR with the code:

wd <- 'C:/Users/maari/Desktop/Trial'
ncbi_dr <- 'C:/Program Files/NCBI/blast-2.9.0+/bin'
txid <- 231623
setup(wd = wd, txid = txid, ncbi_dr = ncbi_dr, v = T)
run(wd = wd)

Everything seems to work fine until the Cluster stage, where I get the following error:

Error in stages_run(wd = wd, frm = 1, to = nstages, stgs_msg = stgs_msg) :
Unexpected Error : blastn failed to run. Check BLAST log files.

BLAST query/options error: '"6' is not a valid output format
Please refer to the BLAST+ user manual.

Could you help solve this error?
I've already tried running the code with different taxa, R and Blast versions, but always seem to get an error in the cluster stage.
I've also tried installing the most recent version of phylotaR using the command:
install.packages('phylotaR')
I'm working on a windows 10 machine.

Error in gzfile

Hi,
I try to use phylotaR. However, after long time (around 12 hrs) of sequences downloading, the software sends this error: Error in stages_run(wd = wd, frm = frm, to = nstages, stgs_msg = stgs_msg, :
Unexpected Error in gzfile(file, "rb") : invalid 'description' argument.
I attach the log and session file
session_info.txt
log.txt

Test demo pipelines on Windows

Never run on Windows.

Internal handling of sequences

Right now, sets of sequences are represented as lists with gis as keys. For a package that should be build to interact with other packages (e.g. ape), we might want to consider using existing classes for sequence (and also taxonomy node) representations.

Reademe issues

no need for 'in dev' tag
specify min BLAST
more details on pipeline
update names for pipeline

Verify compliance with @ropensci

I.e. make sure that the package fits with the community practices of @ropensci and work with them for broader distribution and advertising.

problem with the restez/phylota integration

Issue(s) resolved via email ....

On Tue, Jun 9, 2020 at 8:20 PM Alexandre Pedro <[email protected]> wrote:

Dear Dominc

I'm a Post-Doc at the Federal University of Rio de Janeiro working with birds
and vertebrate macroevolution and biogeography in general. I really appreciate
your efforts to create and maintain the new PhylotaR, and I'm absolutely sure
this will be one the most useful phylogenetic tools to work with GenBank data
from now on.

I'm having a problem with the restez/phylota integration, not sure why. I really
tried to solve everything without having to send you this, but decided to do so
after the program told me to contact the package maintainer:

Error in stages_run(wd = wd, frm = 1, to = nstages, stgs_msg = stgs_msg) : 
Unexpected Error in if (ps[["multiple_ids"]]) { : argument is of length zero

Occurred [2020-06-09 19:57:45]
Contact package maintainer for help.


I'm trying to do a PhylotaR search on the Araceae family (txid 4454), which is
taking too long to run through the PhylotaR alone, so I tried the restez/phylotaR
integration. It took quite a while to download the plant database from GB, but I
apparently managed to  create the restez database (I can access with the
connect/disconnect restez commands). However, I'm stuck on what seems to
be a makeblastdb error...I get the following error message when trying to setp:

setup(wd = wd, txid = txid, ncbi_dr=ncbi_dr, v=T)
 -------------------------------------------------
phylotaR: Implementation of PhyLoTa in R [v1.2.0]
-------------------------------------------------
Checking for valid NCBI BLAST+ Tools ...
Found: [/usr/local/ncbi/blast/bin/makeblastdb]
Found: [/usr/local/ncbi/blast/bin/blastn]
. . Running makeblastdb
Error: makeblastdb failed to run. Check BLAST log files.

and the following log file in the recently created blast folder:

BLAST options error: Please provide a database name using -out

I'm using:
blast 2.9
Rstudio version 1.2.1335 with R version 3.6
Mac OS Mojave 10.14.6

Thank you so much for your time and patience.

Please let me know if you need further information.

Best regards

Alexandre

On Tue, Jun 9, 2020 at 8:23 PM Alexandre Pedro <[email protected]> wrote:

Sorry, forgot to say the phylota version is 1.2.0 and restez is 1.0.2

Den ons 10 juni 2020 09:03Alexandre Pedro <[email protected]> skrev:

Dear Dominic,

I've solved the issue with makeblastdb (I had two conflicting versions in my R $path). But nevertheless, I'm still unable to run the PhylotaR with my restez database. I get the following message:

Error in stages_run(wd = wd, frm = 1, to = nstages, stgs_msg = stgs_msg) : 
Unexpected Error in .local(conn, statement, ...) : 
Unable to execute statement 'SELECT accession,raw_record,raw_sequence FROM nucleotide WHERE accession IN ('MN099113','MN099112','...'.
Server says 'MALException:mat.pack:HY001!Could not allocate space'.

Occurred [2020-06-10 03:52:57]
Contact package maintainer for help.

I know the Araceae family is a large group, but I can't run even the Bromeliaceae
example from the PhylotaR webpage. I have over 100Gb free in my HD; 16G of RAM
and 4 i7 CPUs.

I'm sending you the log file as well just to make sure I'm not missing anything in the setup settings.

Thank you so very much for your time and patience.

Best

Alex

On Wed, Jun 10, 2020 at 4:28 AM Dominic John Bennett <[email protected]> wrote:

Hi Alex,

I'm sorry you're experiencing problems.

That looks to me like an SQL type error. It's probably due to MN099113 ...
Having larger records than can be allocated in memory. It's not impossible
that the selected records combined compressed size is greater than 16gb or
whatever your OS limits the request to be.

I see two possible workarounds... 

First, you could try re-running phylotar with a much lower batch size thereby
reducing the number of records to pull out of the database each time.

Second, you could try excluding the largest records from the database by
re-running db_create with an upper sequence length.

Those are my two quick thoughts.

Good luck!

Dom

On Wed, Jun 10, 2020 at 5:45 AM Alexandre Pedro <[email protected]> wrote:

Dear Dom,

Please, no need to apologize, I'm really grateful that you made such a useful
program. Thanks so much for the fast reply!

It ran a little longer with batch = 50, but it returned the same error after a
while... I started a new database with max length = 4000 but i'll take a while
to finish. I'll keep you posted!

Again, thanks so much for the prompt support!

Best

Alex

On Thu, 11 Jun 2020 at 06:49, Alexandre Pedro <[email protected]> wrote:

Hi Dom,

unfortunately reducing the database did not solve the problem...I made
one with a max length of 50.000, and was able to reduce from 100Gb to
just 20Gb. Although I feel that this will improve my analyses downstream,
I'm still stuck with the same message,
Server says 'MALException:mat.pack:HY001!Could not allocate space'.

The thing is, I'm not sure if the problem is with my computer settings...
That message shows up right at the beginning of the download phase,
and complains about short sequences such as environmental samples
with just 600bp. But the real strange thing is that my mac does not reach
any saturation point: I keep an eye on the Activity Monitor app, and neither
my RAM or CPU gets saturated with the analysis. It feels like my R is
not using the full resource available on my computer, even though I
change the default setup to 8 cpus...RStudio does not go above 200M
in memory or 1% of CPU usage... So it doesn't look like a resource overload.

I tried several different things, as you can see in the script I have attached.
Could you please tell me if I am missing something?

Thank you for your time and patience.

P.S. I downloaded a Araceae subclade (txid<-284555#Aroideae) without restez
with no problem, but when I try the exact same thing from restez I get that error
message. I'd really like to learn how to use the restez resource, so please let me
know if there are tests we can make to get to the bottom of this issue.

On Thu, Jun 11, 2020 at 6:16 AM Dominic John Bennett <[email protected]> wrote:

Hmmm... I'm not sure it is your computer. We shouldn't be seeing the CPUs being used
until the BLAST stage. As for the RAM usage that indicates that we're not extracting
from the database more than you can handle. This makes sense given you've reduced
the database size and the batch size.

It's interesting that you're breaking on the same database query involving these records:
https://www.ncbi.nlm.nih.gov/nuccore/MN099113.1/ and https://www.ncbi.nlm.nih.gov/nuccore/MN099112

Are you excluding environmental sequences to try and avoid these? It looks like your
search query isn't working. I  think in your query where you have "NOT environmental"
you need to specify a field. E.g. to specify no mention of environmental in the record
title: "NOT environmental [Ti]". Are you aware that you can play around with what
parameters work here: https://www.ncbi.nlm.nih.gov/nuccore/advanced

Best,
Dom

P.S. Not sure why parameters_reset() isn't working!

On Fri, 12 Jun 2020 at 01:52, Alexandre Pedro <[email protected]> wrote:

Hi Dom,

I'm sorry to bother you with this, but I'd really like to understand what I could be
doing wrong.

Thanks for the suggestion, I excluded the environmental samples the correct
way. However, I'm still having issues. To rule out any restez/PhylotaR
miscommunication I might be doing, I ran just the PhylotaR on the Araceae
clade (without the restez), and after many hours it was downloaded successfully.
Nevertheless, the analysis halted again at the CLUSTER^2 stage with the error message:

Error in stages_run(wd = wd, frm = frm, to = nstages, stgs_msg = stgs_msg,  : 
    Unexpected Error in Ops.factor(blast_res[["query.id"]], blast_res[["subject.id"]]) : 
level sets of factors are different

Occurred [2020-06-11 14:03:21]
Contact package maintainer for help.

I tried a different large taxonomic group, Serpentes, to see if the problem is with
Araceae or even plants, but I get the same error. Only very small groups such as
the Aotus example work. Other lab colleagues using different computers (all Macs)
are having the same issue: we can't complete the run phase, with or without restez.
The issue seems to go away with the older version of PhylotaR (1.0), but it is much
slower. We've been following the github updates and a lot is being implemented in
version 1.2 (such as restez integration), so we would like to use the most up-to-date
version. Please let me know how we can try to understand this, and you can count
on me to run tests to sort this out.

Thanks so much

Alex

On Fri, Jun 12, 2020 at 4:36 AM Dominic John Bennett <[email protected]> wrote:

OK. Well at least we fixed the restez problem for now. It's probably an unexpected format
of the environmental sequences that caused the issue.

With the cluster^2 step... would it be possible for you to share your results folder?
Perhaps you compress it and upload it to a shared folder/link on google drive or
similar? Then I'd be able to investigate more properly what is going wrong with
the later version of phylotaR.

Your 95% there in terms of the pipeline run.... thanks for your persistence!

Dom

On Fri, Jun 12, 2020 at 4:50 AM Alexandre Pedro <[email protected]> wrote:

Hi Dom,

Gosh, thank you for your 100% patience!! After a long day I was able to download a
very large clade within Serpentes using just the PhylotaR, so it might be just as you
said, some weirdo sequences in a particular plant taxon must be causing problems
with the overall phylotaR run.

OK, I just started a fresh run for the Araceae using the same parameters just to make
sure and clean up previous frustrating tryouts, and will send you the results as soon
as I can. I've officially turned into a night owl during this pandemic, so I'm heading to
bed now at 5 am...If it goes exactly as it did earlier today, by the time I wake up it
will have finished and I'll send it right away.

Thanks so very much!!

Alex

On Fri, Jun 12, 2020 at 6:12 AM Alexandre Pedro <[email protected]> wrote:

Hi,

It finished pretty quickly, and at the exact same stage with the same error message.
You can download it from this link: [LINK REMOVED]

Please let me know if you have any issues with the file

Thanks!

On Fri, 12 Jun 2020 at 11:14, Alexandre Pedro <[email protected]> wrote:

Just to make everything easier, here's the setup line I used for the taxon 4454 (Araceae)

setup(wd=wd, txid=txid, ncbi_dr=ncbi_dr, v = TRUE, overwrite = T, btchsz = 100, ncps = 8,
db_only = F, srch_trm = "NOT predicted[TI] NOT \"whole genome shotgun\"[TI] NOT unverified[TI] NOT \"synthetic construct\"[Organism] NOT refseq[filter] NOT TSA[Keyword] NOT environmental[TI]")

On Sat, Jun 13, 2020 at 7:55 AM Dominic John Bennett <[email protected]> wrote:

Hi Alex,

It seemed to be quite a minor error in the end to do with factors -- not quite sure why it wasn't an issue in the older phylotar version.

You should be able to run the complete pipeline with a newly updated package from 

remotes::install_github("ropensci/phylotaR")
# after reinstalling, to be safe, I would restart the R session to make sure
# you're using the newly installed package
phylotaR::clusters2_run(wd = "[YOUR WORKING DIRECTORY]")

Thanks for your efforts, it made fixing a lot easier.

Dom

On Sat, 13 Jun 2020 at 21:59, Alexandre Pedro <[email protected]> wrote:

Hi Dom, 

It works perfectly. I got curious why this error did not show up in
smaller clades or even the large Serpentes clade I tried...But if it's
solved, it's solved!

I'm really happy we (you) could figure it out! I used the first phylota
browser during my MSc, but as it started to get outdated over the years
I was really sad because that idea was a major breakthrough for
phylogeneticists. So you can imagine how excited I got when your
version came out. I'm really grateful (as certainly is the entire
phylogenetics community).

Please count on me if you need anything.

Thank you so much for your patience and time. My students and lab
colleagues are also very grateful.

Best

Alex

On Sat, Jun 13, 2020 at 5:32 PM Dominic John Bennett <[email protected]> wrote:

Thanks Alex! I'm glad we could work it out.

It worked for the smaller clades because the smaller clades wouldn't have
needed the cluster^2 step -- clustering of clusters is only needed for the
clades where there are so many sequences/species the number of
possible combinations is too great for a single cluster step.

I was just wondering, would it be possible for me to share a redacted version
of our conversation on phylotaR's GitHub
page (https://github.com/ropensci/phylotaR/issues)? I think it's good for
others to see how potential problems can be fixed.

Best,
Dom

Oh right, that makes absolute sense.

Of course you can share it! Again, thank you for the superb support for your program(s).

Best

Alex

you probably don't want to link to the http://phylota.net/ website anymore

see http://phylota.net/

@DomBennett

ropensci / phylotar Goto Github PK

phylotar's Introduction

Automated Retrieval of Orthologous DNA Sequences from GenBank

Install

Pipeline

Running

Timings

License

Authors

Reference

phylotar's People

Contributors

Stargazers

Watchers

Forkers

phylotar's Issues

phylotaR: Implementation of PhyLoTa in R [v1.0.0]

Reproducible example

sessionInfo

Use '-help' to print detailed descriptions of command line arguments

phylotaR: Implementation of PhyLoTa in R [v1.0.0]

SessionInfo

phylotaR: Implementation of PhyLoTa in R [v1.0.0]

Recommend Projects

Recommend Topics

Recommend Org