Giter VIP home page Giter VIP logo

phylotar's People

Contributors

dombennett avatar hettling avatar jeroen avatar maelle avatar shixiangwang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

phylotar's Issues

Write documentation

Only one file (ncbi-remote.R) is properly documented so far. We need documentation for all functions that will be exposed to users such that documentation .Rd files can be generated.

An Error in PhylotaR

Hi Dom,

I am now using phylotaR for processed the data in Blats+. During the processing of PhylotaR, it shows an error may need you help to solve. Below were the script I used and the Error it shows.
The Script I Used:

library(phylotaR)
wd<-'e:/Institute_Of_Bontany/WorldTree/TracheophytaTest'
ncbi_dr<-'d:/blast/blast-2.7.1+/bin/'
txid<-58023
setup(wd=wd,txid=txid,ncbi_dr=ncbi_dr,v=TRUE)


phylotaR: Implementation of PhyLoTa in R [v1.0.0]

Checking for valid NCBI BLAST+ Tools ...
Found: [d:/blast/blast-2.7.1+/bin//makeblastdb]
Found: [d:/blast/blast-2.7.1+/bin//blastn]
Setting up pipeline with the following parameters:
. blstn [d:/blast/blast-2.7.1+/bin//blastn]
. btchsz [100]
. date [2018-10-22]
. mdlthrs [3000]
. mkblstdb [d:/blast/blast-2.7.1+/bin//makeblastdb]
. mncvrg [51]
. mnsql [250]
. mxevl [1e-10]
. mxnds [1e+05]
. mxrtry [100]
. mxsql [2000]
. mxsqs [50000]
. ncps [1]
. txid [58023]
. v [TRUE]
. wd [e:/Institute_Of_Bontany/WorldTree/TracheophytaTest]

run(wd=wd)

The Error it shows:
Generating taxonomic dictionary ...
Unexpected Error in FUN(X[[i]], ...) : Unable to get "PRNT" slot from object without slot basic category ("NULL")

Occurred [2018-10-23 09:06:12]
Contact package maintainer for help.
Error in stages_run(wd = wd, frm = 1, to = nstages, stgs_msg = stgs_msg) :
Unexpected Error in FUN(X[[i]], ...) : Unable to get "PRNT" slot from object without slot basic category ("NULL")

Best Wishes,
Qiang

Decide on what the package output should be

At the moment, the code produces tables as in the phylota database, so the output can be read into supersmart.

We should carefully think of what a possible user might want from this package; I suppose some tsv files that can be read into sqlite is not so attractive.

Maybe it should produce a fasta file per cluster?

Unable to find correct versions of NCBI BLAST+ tools

I am getting this error (last line):

library(phylotaR)
wd <- /apps/Acanthaceae
Error: unexpected '/' in "wd <- /"
wd <- "/apps/Acanthaceae"
ncbi_dr <- "/usr/lib/ncbi-blast+"
txid <- 4185
setup(wd = wd, txid = txid, ncbi_dr = ncbi_dr, v = TRUE)


phylotaR: Implementation of PhyLoTa in R [v1.0.0]

Checking for valid NCBI BLAST+ Tools ...
Failed to run: [/usr/lib/ncbi-blast+/makeblastdb]. Reason:
[Error : Failed to execute '/usr/lib/ncbi-blast+/makeblastdb' (No such file or directory)
]Failed to run: [/usr/lib/ncbi-blast+/blastn]. Reason:
[Error : Failed to execute '/usr/lib/ncbi-blast+/blastn' (No such file or directory)
]Error:Unable to find correct versions of NCBI BLAST+ tools
Error in blast_setup(d = ncbi_dr, v = v, wd = wd) :
Unable to find correct versions of NCBI BLAST+ tools

I think BLAST+ is ok in my computer. Perhaps I have to install some specific database?
If I type:
sudo apt install ncbi-blast+
I get:
Reading package lists... Done
Building dependency tree
Reading state information... Done
ncbi-blast+ is already the newest version (2.6.0-1).
0 upgraded, 0 newly installed, 0 to remove and 83 not upgraded.

Drop parallel

It's better to use BLAST's own threads argument, not foreach.

Error in gzfile

Hi,
I try to use phylotaR. However, after long time (around 12 hrs) of sequences downloading, the software sends this error: Error in stages_run(wd = wd, frm = frm, to = nstages, stgs_msg = stgs_msg, :
Unexpected Error in gzfile(file, "rb") : invalid 'description' argument.
I attach the log and session file
session_info.txt
log.txt

Style issues

  • Irregular line wrapping
  • Extra newlines at the beginning of code
  • Longer class and function names
  • snakecase for functions

Internal handling of sequences

Right now, sets of sequences are represented as lists with gis as keys. For a package that should be build to interact with other packages (e.g. ape), we might want to consider using existing classes for sequence (and also taxonomy node) representations.

Fix TODO's

In quite some places in the code, I marked some dirty bits or something that would need attention with 'TODO'. These should be thoroughly investigated.

Amount of data accessible with this pkg?

👋 as part of preparing an rOpenSci annual report, we're trying to estimate amount of data the various pkgs in our suite provide access to.

Do you have a sense for how much data (e.g., in GB) one can access through this pkg? And/or whatever metric is most relevant for this data (sequences maybe?)?

makeblastdb error

Reviewer could not run pipeline due to an error with makeblastdb.

> # RUN PIPELINE
> txid <- 9504
> setUp(wd="/Users/naupaka/Desktop/phylota_review/aotus", txid=txid, ncbi_dr=ncbi_dr, v=TRUE)
-----------------------------------------------
phylotaR: Implementation of PhyLoTa in R [v0.1]
-----------------------------------------------
Checking for valid NCBI BLAST+ Tools ...
Found: [/usr/local/bin/makeblastdb]
Found: [/usr/local/bin/blastn]
Setting up pipeline with the following parameters:
. blstn      [/usr/local/bin/blastn]
. btchsz     [300]
. date       [2018-03-26]
. mdlthrs    [3000]
. mkblstdb   [/usr/local/bin/makeblastdb]
. mncvrg     [51]
. mnsql      [250]
. mxevl      [1e-10]
. mxnds      [1e+05]
. mxrtry     [100]
. mxsql      [2000]
. mxsqs      [50000]
. ncps       [1]
. txid       [9504]
. v          [TRUE]
. wd         [/Users/naupaka/Desktop/phylota_review/aotus]
-----------------------------------------------
> run(wd=wd)
... Taxise
... Download
... Cluster
Error in runStgs(wd = wd, frm = 1, to = nstages, stgs_msg = stgs_msg) :
  Unexpected Error in error(ps = ps, paste0("makeblastdb failed to run. Check BLAST log files.")) :
  Error: makeblastdb failed to run. Check BLAST log files.


Occurred [2018-03-26 06:37:35]
Contact package maintainer for help.

failing to run drop_by_rank()

Hi,
I am trying to reduce the sequence data by using drop_by_rank()
but getting the following error:

Error in phylota@sqs@sqs[[i]] :
attempt to select less than one element in get1index

I am running the pipeline according to the tutorial:

library(phylotaR)
wd <- '/media/...
ncbi_dr <- '/home/...
txid <- 4747
setup(wd = wd, txid = txid, ncbi_dr = ncbi_dr, v=TRUE)
run(wd = wd)

all_clusters <- read_phylota(wd)
print(all_clusters)
cids <- all_clusters@cids
n_taxa <- get_ntaxa(phylota = all_clusters, cid = cids)
keep <- cids[taxa_all > 10]
selected <- drop_clstrs(phylota = all_clusters, cid = keep)
reduced <- drop_by_rank(selected, rnk = 'species', n=1)

My experience with R is unfortunately quite limited and I appreciate the help.
Thanks!

Error in run.R II

The cluster.ci_gi.seqs.create function drops the following error

Counting species for taxon 4613
Number of sequences for taxon 4613 : 16929
Too many seqs to blast for taxid 4613 ... retrieving children
Error in .children(current.taxon, nodes) : unused argument (nodes)

Reproducible example

#try out to get a molecular alignment using Phylotar

#libraries
#source the phylotaR package
source("blast.R")
source("ci_gi.R")
source("cl.R")
source("clusters.R")
source("db.R")
source("ncbi-remote.R")
source("nodes.R")
source("query-local.R")

#run an analysis follwoing the run.sh script, this is the run.r copied
library(foreach)
library(doMC)

set.seed(111)

options(error=recover)

## Adjustable parameters:
## Maximum number of sequences per species
MODEL.THRESHOLD <<- 3000
## Maximum number of sequences to blast in a single run; if taxon has more subtree sequences
## than that, its children will get clustered
MAX.BLAST.SEQS <<- 10000
## Maximum characters in one sequence
MAX.SEQUENCE.LENGTH <<- 25000
## directory for sequence cache; will be created if does not exist
SEQS.CACHE.DIR <<- "./sequences/"
## Download file ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz,
## unzip and specify directory where file 'nodes.dmp' is located
taxdir <- 'taxdmp'
## Number of processing units
CORES <<- 4

registerDoMC(CORES)

## Do analysis for Felidae family
taxid <- 4613
nodes.create(taxid, taxdir=taxdir, file.name='dbfiles-bromeliaceae-nodes.tsv')

clusters.ci_gi.seqs.create(4613, 'dbfiles-bromeliaceae-nodes.tsv', 
                           files=list(clusters='dbfiles-bromeliaceae-clusters.tsv',
                                      ci_gi='dbfiles-bromeliaceae-ci_gi.tsv',
                                      seqs='dbfiles-bromeliaceae-seqs.tsv'))

sessionInfo

R version 3.4.2 (2017-09-28)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252    LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C                            LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] foreach_1.4.3         CHNOSZ_1.1.0          XML_3.98-1.9          rentrez_1.1.0         igraph_1.1.2         
 [6] RSQLite_2.0           data.table_1.10.4-3   bindrcpp_0.2          speciesgeocodeR_2.0-8 sp_1.2-5             
[11] forcats_0.2.0         stringr_1.2.0         dplyr_0.7.4           purrr_0.2.4           readr_1.1.1          
[16] tidyr_0.7.2           tibble_1.3.4          ggplot2_2.2.1         tidyverse_1.2.0      

loaded via a namespace (and not attached):
 [1] viridis_0.4.0     httr_1.3.1        bit64_0.9-7       jsonlite_1.5      viridisLite_0.2.0 modelr_0.1.1     
 [7] assertthat_0.2.0  blob_1.1.0        cellranger_1.1.0  yaml_2.1.14       lattice_0.20-35   glue_1.2.0       
[13] digest_0.6.12     rvest_0.3.2       colorspace_1.3-2  picante_1.6-2     Matrix_1.2-11     plyr_1.8.4       
[19] psych_1.7.8       pkgconfig_2.0.1   broom_0.4.2       raster_2.5-8      haven_1.1.0       scales_0.5.0     
[25] mgcv_1.8-20       lazyeval_0.2.1    cli_1.0.0         mnormt_1.5-5      magrittr_1.5      crayon_1.3.4     
[31] readxl_1.0.0      memoise_1.1.0     nlme_3.1-131      MASS_7.3-47       xml2_1.1.1        foreign_0.8-69   
[37] vegan_2.4-4       tools_3.4.2       hms_0.3           geosphere_1.5-7   munsell_0.4.3     cluster_2.0.6    
[43] compiler_3.4.2    rlang_0.1.4       grid_3.4.2        iterators_1.0.8   rstudioapi_0.7    codetools_0.2-15 
[49] gtable_0.2.0      curl_3.0          DBI_0.7           reshape2_1.4.2    R6_2.2.2          gridExtra_2.3    
[55] lubridate_1.7.1   knitr_1.17        rgdal_1.2-15      rgeos_0.3-26      bit_1.1-12        bindr_0.1        
[61] permute_0.9-4     ape_5.0           stringi_1.1.5     parallel_3.4.2    Rcpp_0.12.13     

Reademe issues

  • no need for 'in dev' tag
  • specify min BLAST
  • more details on pipeline
  • update names for pipeline

Error: blastn failed to run. Check BLAST log files

Hi. I'm trying to run the following commands

wd <- "C: / Project_Crass / 11_R / PhylotaR"
ncbi_dr <- "C: / Program Files / NCBI / blast-2.7.1 + / bin"
txid <- 3784
setUp (wd = wd, txid = txid, ncbi_dr = ncbi_dr, v = T)
run (wd = "C: / Project_Crass / 11_R / PhylotaR")

but I get an error when the cluster stage starts

Error in runStgs (wd = wd, frm = 1, to = nstages, stgs_msg = stgs_msg):
Unexpected Error in error (ps = ps, "blastn failed to run. Check BLAST log files."):
Error: blastn failed to run. Check BLAST log files.

Could you guide me how I solve this error?
I already performed the same procedure with different taxa and always stops at the CLUSTER stage.
I attach the log file, thank you
log.txt

problem with the restez/phylota integration

Issue(s) resolved via email ....


On Tue, Jun 9, 2020 at 8:20 PM Alexandre Pedro <[email protected]> wrote:

Dear Dominc

I'm a Post-Doc at the Federal University of Rio de Janeiro working with birds
and vertebrate macroevolution and biogeography in general. I really appreciate
your efforts to create and maintain the new PhylotaR, and I'm absolutely sure
this will be one the most useful phylogenetic tools to work with GenBank data
from now on.

I'm having a problem with the restez/phylota integration, not sure why. I really
tried to solve everything without having to send you this, but decided to do so
after the program told me to contact the package maintainer:

Error in stages_run(wd = wd, frm = 1, to = nstages, stgs_msg = stgs_msg) : 
Unexpected Error in if (ps[["multiple_ids"]]) { : argument is of length zero

Occurred [2020-06-09 19:57:45]
Contact package maintainer for help.


I'm trying to do a PhylotaR search on the Araceae family (txid 4454), which is
taking too long to run through the PhylotaR alone, so I tried the restez/phylotaR
integration. It took quite a while to download the plant database from GB, but I
apparently managed to  create the restez database (I can access with the
connect/disconnect restez commands). However, I'm stuck on what seems to
be a makeblastdb error...I get the following error message when trying to setp:

setup(wd = wd, txid = txid, ncbi_dr=ncbi_dr, v=T)
 -------------------------------------------------
phylotaR: Implementation of PhyLoTa in R [v1.2.0]
-------------------------------------------------
Checking for valid NCBI BLAST+ Tools ...
Found: [/usr/local/ncbi/blast/bin/makeblastdb]
Found: [/usr/local/ncbi/blast/bin/blastn]
. . Running makeblastdb
Error: makeblastdb failed to run. Check BLAST log files.

and the following log file in the recently created blast folder:

BLAST options error: Please provide a database name using -out

I'm using:
blast 2.9
Rstudio version 1.2.1335 with R version 3.6
Mac OS Mojave 10.14.6

Thank you so much for your time and patience.

Please let me know if you need further information.

Best regards

Alexandre 
On Tue, Jun 9, 2020 at 8:23 PM Alexandre Pedro <[email protected]> wrote:

Sorry, forgot to say the phylota version is 1.2.0 and restez is 1.0.2
Den ons 10 juni 2020 09:03Alexandre Pedro <[email protected]> skrev:

Dear Dominic,

I've solved the issue with makeblastdb (I had two conflicting versions in my R $path). But nevertheless, I'm still unable to run the PhylotaR with my restez database. I get the following message:

Error in stages_run(wd = wd, frm = 1, to = nstages, stgs_msg = stgs_msg) : 
Unexpected Error in .local(conn, statement, ...) : 
Unable to execute statement 'SELECT accession,raw_record,raw_sequence FROM nucleotide WHERE accession IN ('MN099113','MN099112','...'.
Server says 'MALException:mat.pack:HY001!Could not allocate space'.

Occurred [2020-06-10 03:52:57]
Contact package maintainer for help.

I know the Araceae family is a large group, but I can't run even the Bromeliaceae
example from the PhylotaR webpage. I have over 100Gb free in my HD; 16G of RAM
and 4 i7 CPUs.

I'm sending you the log file as well just to make sure I'm not missing anything in the setup settings.

Thank you so very much for your time and patience.

Best

Alex
On Wed, Jun 10, 2020 at 4:28 AM Dominic John Bennett <[email protected]> wrote:

Hi Alex,

I'm sorry you're experiencing problems.

That looks to me like an SQL type error. It's probably due to MN099113 ...
Having larger records than can be allocated in memory. It's not impossible
that the selected records combined compressed size is greater than 16gb or
whatever your OS limits the request to be.

I see two possible workarounds... 

First, you could try re-running phylotar with a much lower batch size thereby
reducing the number of records to pull out of the database each time.

Second, you could try excluding the largest records from the database by
re-running db_create with an upper sequence length.

Those are my two quick thoughts.

Good luck!

Dom
On Wed, Jun 10, 2020 at 5:45 AM Alexandre Pedro <[email protected]> wrote:

Dear Dom,

Please, no need to apologize, I'm really grateful that you made such a useful
program. Thanks so much for the fast reply!

It ran a little longer with batch = 50, but it returned the same error after a
while... I started a new database with max length = 4000 but i'll take a while
to finish. I'll keep you posted!

Again, thanks so much for the prompt support!

Best

Alex
On Thu, 11 Jun 2020 at 06:49, Alexandre Pedro <[email protected]> wrote:

Hi Dom,

unfortunately reducing the database did not solve the problem...I made
one with a max length of 50.000, and was able to reduce from 100Gb to
just 20Gb. Although I feel that this will improve my analyses downstream,
I'm still stuck with the same message,
Server says 'MALException:mat.pack:HY001!Could not allocate space'.

The thing is, I'm not sure if the problem is with my computer settings...
That message shows up right at the beginning of the download phase,
and complains about short sequences such as environmental samples
with just 600bp. But the real strange thing is that my mac does not reach
any saturation point: I keep an eye on the Activity Monitor app, and neither
my RAM or CPU gets saturated with the analysis. It feels like my R is
not using the full resource available on my computer, even though I
change the default setup to 8 cpus...RStudio does not go above 200M
in memory or 1% of CPU usage... So it doesn't look like a resource overload.

I tried several different things, as you can see in the script I have attached.
Could you please tell me if I am missing something?

Thank you for your time and patience.

P.S. I downloaded a Araceae subclade (txid<-284555#Aroideae) without restez
with no problem, but when I try the exact same thing from restez I get that error
message. I'd really like to learn how to use the restez resource, so please let me
know if there are tests we can make to get to the bottom of this issue.
On Thu, Jun 11, 2020 at 6:16 AM Dominic John Bennett <[email protected]> wrote:

Hmmm... I'm not sure it is your computer. We shouldn't be seeing the CPUs being used
until the BLAST stage. As for the RAM usage that indicates that we're not extracting
from the database more than you can handle. This makes sense given you've reduced
the database size and the batch size.

It's interesting that you're breaking on the same database query involving these records:
https://www.ncbi.nlm.nih.gov/nuccore/MN099113.1/ and https://www.ncbi.nlm.nih.gov/nuccore/MN099112

Are you excluding environmental sequences to try and avoid these? It looks like your
search query isn't working. I  think in your query where you have "NOT environmental"
you need to specify a field. E.g. to specify no mention of environmental in the record
title: "NOT environmental [Ti]". Are you aware that you can play around with what
parameters work here: https://www.ncbi.nlm.nih.gov/nuccore/advanced

Best,
Dom

P.S. Not sure why parameters_reset() isn't working!
On Fri, 12 Jun 2020 at 01:52, Alexandre Pedro <[email protected]> wrote:

Hi Dom,

I'm sorry to bother you with this, but I'd really like to understand what I could be
doing wrong.

Thanks for the suggestion, I excluded the environmental samples the correct
way. However, I'm still having issues. To rule out any restez/PhylotaR
miscommunication I might be doing, I ran just the PhylotaR on the Araceae
clade (without the restez), and after many hours it was downloaded successfully.
Nevertheless, the analysis halted again at the CLUSTER^2 stage with the error message:

Error in stages_run(wd = wd, frm = frm, to = nstages, stgs_msg = stgs_msg,  : 
    Unexpected Error in Ops.factor(blast_res[["query.id"]], blast_res[["subject.id"]]) : 
level sets of factors are different

Occurred [2020-06-11 14:03:21]
Contact package maintainer for help.

I tried a different large taxonomic group, Serpentes, to see if the problem is with
Araceae or even plants, but I get the same error. Only very small groups such as
the Aotus example work. Other lab colleagues using different computers (all Macs)
are having the same issue: we can't complete the run phase, with or without restez.
The issue seems to go away with the older version of PhylotaR (1.0), but it is much
slower. We've been following the github updates and a lot is being implemented in
version 1.2 (such as restez integration), so we would like to use the most up-to-date
version. Please let me know how we can try to understand this, and you can count
on me to run tests to sort this out.

Thanks so much

Alex
On Fri, Jun 12, 2020 at 4:36 AM Dominic John Bennett <[email protected]> wrote:

OK. Well at least we fixed the restez problem for now. It's probably an unexpected format
of the environmental sequences that caused the issue.

With the cluster^2 step... would it be possible for you to share your results folder?
Perhaps you compress it and upload it to a shared folder/link on google drive or
similar? Then I'd be able to investigate more properly what is going wrong with
the later version of phylotaR.

Your 95% there in terms of the pipeline run.... thanks for your persistence!

Dom
On Fri, Jun 12, 2020 at 4:50 AM Alexandre Pedro <[email protected]> wrote:

Hi Dom,

Gosh, thank you for your 100% patience!! After a long day I was able to download a
very large clade within Serpentes using just the PhylotaR, so it might be just as you
said, some weirdo sequences in a particular plant taxon must be causing problems
with the overall phylotaR run.

OK, I just started a fresh run for the Araceae using the same parameters just to make
sure and clean up previous frustrating tryouts, and will send you the results as soon
as I can. I've officially turned into a night owl during this pandemic, so I'm heading to
bed now at 5 am...If it goes exactly as it did earlier today, by the time I wake up it
will have finished and I'll send it right away.

Thanks so very much!!

Alex
On Fri, Jun 12, 2020 at 6:12 AM Alexandre Pedro <[email protected]> wrote:

Hi,

It finished pretty quickly, and at the exact same stage with the same error message.
You can download it from this link: [LINK REMOVED]

Please let me know if you have any issues with the file

Thanks!
On Fri, 12 Jun 2020 at 11:14, Alexandre Pedro <[email protected]> wrote:

Just to make everything easier, here's the setup line I used for the taxon 4454 (Araceae)

setup(wd=wd, txid=txid, ncbi_dr=ncbi_dr, v = TRUE, overwrite = T, btchsz = 100, ncps = 8,
db_only = F, srch_trm = "NOT predicted[TI] NOT \"whole genome shotgun\"[TI] NOT unverified[TI] NOT \"synthetic construct\"[Organism] NOT refseq[filter] NOT TSA[Keyword] NOT environmental[TI]")
On Sat, Jun 13, 2020 at 7:55 AM Dominic John Bennett <[email protected]> wrote:

Hi Alex,

It seemed to be quite a minor error in the end to do with factors -- not quite sure why it wasn't an issue in the older phylotar version.

You should be able to run the complete pipeline with a newly updated package from 

remotes::install_github("ropensci/phylotaR")
# after reinstalling, to be safe, I would restart the R session to make sure
# you're using the newly installed package
phylotaR::clusters2_run(wd = "[YOUR WORKING DIRECTORY]")

Thanks for your efforts, it made fixing a lot easier.

Dom
On Sat, 13 Jun 2020 at 21:59, Alexandre Pedro <[email protected]> wrote:

Hi Dom, 

It works perfectly. I got curious why this error did not show up in
smaller clades or even the large Serpentes clade I tried...But if it's
solved, it's solved!

I'm really happy we (you) could figure it out! I used the first phylota
browser during my MSc, but as it started to get outdated over the years
I was really sad because that idea was a major breakthrough for
phylogeneticists. So you can imagine how excited I got when your
version came out. I'm really grateful (as certainly is the entire
phylogenetics community).

Please count on me if you need anything.

Thank you so much for your patience and time. My students and lab
colleagues are also very grateful.

Best

Alex
On Sat, Jun 13, 2020 at 5:32 PM Dominic John Bennett <[email protected]> wrote:

Thanks Alex! I'm glad we could work it out.

It worked for the smaller clades because the smaller clades wouldn't have
needed the cluster^2 step -- clustering of clusters is only needed for the
clades where there are so many sequences/species the number of
possible combinations is too great for a single cluster step.

I was just wondering, would it be possible for me to share a redacted version
of our conversation on phylotaR's GitHub
page (https://github.com/ropensci/phylotaR/issues)? I think it's good for
others to see how potential problems can be fixed.

Best,
Dom
Oh right, that makes absolute sense.

Of course you can share it! Again, thank you for the superb support for your program(s).

Best

Alex

Error in run.R

The clusters.ci_gi.seqs.create function in the run.R script drops the following error:

Counting species for taxon 15123
Number of sequences for taxon 15123 : 541
Will process taxon 15123
evaluation # 1:
$i
[1] 1

Processing taxid 15123 # 1 / 1
Attempting to retrieve sequences for taxid 15123
541 seqs for taxon 15123 , less than maximum of 3000 sequences. Retreiving sequences for whole subtree
Going to retrieve 541 sequences for taxid 15123
Retreiving seqs 1 to 500 for taxid 15123
Done retreiving 500 ( 1 to 500 ) seqs for taxid 15123
Retreiving seqs 501 to 541 for taxid 15123
Done retreiving 41 ( 501 to 541 ) seqs for taxid 15123
Finished retreiving 541 sequences for taxid 15123
Writing 541 sequences for taxon 15123 to file ./sequences//15123-max-3000.fa
Making sequence dataframe
Done making sequence dataframe
Processing taxid 15123 of rank genus attempting to make subtree clusters
result of evaluating expression:
<simpleError in .children(taxon, nodes): unused argument (nodes)>
got results for task 1
accumulate got an error result
numValues: 1, numResults: 1, stopped: FALSE
returning status FALSE
numValues: 1, numResults: 1, stopped: TRUE
not calling combine function due to errors
Error in { : task 1 failed - "unused argument (nodes)"

Reproducible example:


#source the phylotaR package
source("blast.R")
source("ci_gi.R")
source("cl.R")
source("clusters.R")
source("db.R")
source("ncbi-remote.R")
source("nodes.R")
source("query-local.R")

#run an analysis follwoing the run.sh script, this is the run.r copied
library(foreach)
library(doMC)

set.seed(111)

options(error=recover)

## Adjustable parameters:
## Maximum number of sequences per species
MODEL.THRESHOLD <<- 3000
## Maximum number of sequences to blast in a single run; if taxon has more subtree sequences
## than that, its children will get clustered
MAX.BLAST.SEQS <<- 10000
## Maximum characters in one sequence
MAX.SEQUENCE.LENGTH <<- 25000
## directory for sequence cache; will be created if does not exist
SEQS.CACHE.DIR <<- "./sequences/"
## Download file ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz,
## unzip and specify directory where file 'nodes.dmp' is located
taxdir <- 'C:/Users/alexander.zizka/Dropbox (Antonelli Lab)/Arbeit/Gothenburg/projects/31_bromeliaceae_distribution/analyses/phylotaR/phylotaR-master/R/taxdmp'
## Number of processing units
CORES <<- 4

registerDoMC(CORES)

## Do analysis for Bromeliaceae family
taxid <- 4613
nodes.create(taxid, taxdir=taxdir, file.name='dbfiles-bromeliaceae-nodes.tsv')

clusters.ci_gi.seqs.create(15123, 'dbfiles-bromeliaceae-nodes.tsv', 
                           files=list(clusters='dbfiles-bromeliaceae-clusters.tsv',
                                      ci_gi='dbfiles-bromeliaceae-ci_gi.tsv',
                                      seqs='dbfiles-bromeliaceae-seqs.tsv'))

SessionInfo

> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252    LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C                            LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] foreach_1.4.3         CHNOSZ_1.1.0          XML_3.98-1.9          rentrez_1.1.0         igraph_1.1.2         
 [6] RSQLite_2.0           data.table_1.10.4-3   bindrcpp_0.2          speciesgeocodeR_2.0-8 sp_1.2-5             
[11] forcats_0.2.0         stringr_1.2.0         dplyr_0.7.4           purrr_0.2.4           readr_1.1.1          
[16] tidyr_0.7.2           tibble_1.3.4          ggplot2_2.2.1         tidyverse_1.2.0      

loaded via a namespace (and not attached):
 [1] viridis_0.4.0     httr_1.3.1        bit64_0.9-7       jsonlite_1.5      viridisLite_0.2.0 modelr_0.1.1     
 [7] assertthat_0.2.0  blob_1.1.0        cellranger_1.1.0  yaml_2.1.14       lattice_0.20-35   glue_1.2.0       
[13] digest_0.6.12     rvest_0.3.2       colorspace_1.3-2  picante_1.6-2     Matrix_1.2-11     plyr_1.8.4       
[19] psych_1.7.8       pkgconfig_2.0.1   broom_0.4.2       raster_2.5-8      haven_1.1.0       scales_0.5.0     
[25] mgcv_1.8-20       lazyeval_0.2.1    cli_1.0.0         mnormt_1.5-5      magrittr_1.5      crayon_1.3.4     
[31] readxl_1.0.0      memoise_1.1.0     nlme_3.1-131      MASS_7.3-47       xml2_1.1.1        foreign_0.8-69   
[37] vegan_2.4-4       tools_3.4.2       hms_0.3           geosphere_1.5-7   munsell_0.4.3     cluster_2.0.6    
[43] compiler_3.4.2    rlang_0.1.4       grid_3.4.2        iterators_1.0.8   rstudioapi_0.7    codetools_0.2-15 
[49] gtable_0.2.0      curl_3.0          DBI_0.7           reshape2_1.4.2    R6_2.2.2          gridExtra_2.3    
[55] lubridate_1.7.1   knitr_1.17        rgdal_1.2-15      rgeos_0.3-26      bit_1.1-12        bindr_0.1        
[61] permute_0.9-4     ape_5.0           stringi_1.1.5     parallel_3.4.2    Rcpp_0.12.13     

Change gi codes to accessions

Apparently, gi codes are on their way out of the NCBI databases. Unfortunately the whole code relies on gis, and so does phylota. However, this might change in the future. We should consider moving to accession numbers.

Possibility to supply NCBI API key when running pipeline to access NCBI data faster?

Hi there,

thanks for your efforts developing phylotaR!

Unfortunately, I'm encountering issues when trying to run phylotaR for larger taxonomic groups (e. g. txid=33630). It happens during the download stage when fetching data from NCBI:

taxise_run(wd = wd)
download_run(wd = wd)

after downloading some part of the data I get

Retrying in [1s] for [fetch]
Retrying in [3s] for [fetch]
Retrying in [6s] for [fetch]
Retrying in [10s] for [fetch]
Retrying in [60s] for [fetch]
Retrying in [300s] for [fetch]
Retrying in [300s] for [fetch]
Retrying in [300s] for [fetch]
...

I therefore wondered if its possible to set a NCBI API key to access the NCBI data faster as described in this blog post https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/?

Many thanks in advance for your help!

Jan

PhylotaR setup issue

Hi,
I'm trying to run phylotaR in R 3.4.4 with the following code:

devtools::install_github('ropensci/phylotaR')
library(phylotaR)
wd<-getwd()
ncbi_dr<-"C:/Program Files/ncbi-blast-2.7.1+/bin"
txid<-9504
setup(wd = wd, txid = txid, ncbi_dr = ncbi_dr, v = TRUE)

then I get:


phylotaR: Implementation of PhyLoTa in R [v1.0.0]

Checking for valid NCBI BLAST+ Tools ...
Error in if (tst) { : missing value where TRUE/FALSE needed
Inoltre: Warning message:
In blast_setup(d = ncbi_dr, v = v, wd = wd) :
NAs introduced by coercion

Do you have any suggestion? I could not find a solution for this problem

Best,
Matteo

Description issues

  • Version formatted as .. (e.g. 0.1.0)

  • The Date field should not be used

  • For the URL and BugReports entries, it would be better if these
    pointed to the repository that holds the source code (see Repos issues
    section below).

  • The License slot says GPL-2 but your actual LICENSE is MIT, this needs
    to be resolved.

Remove old code

There is quite some old code in the repository: clusters.R will be replaced with the contents of cl.R. There is also a db.R which connects to a sqlite database, I suppose we won't need that for a proper package. Also a run.R should not be in the package.

Resolve BLAST dependency

The code relies on blastn for the clustering. I'm not sure if this is acceptable for a CRAN (or bioconductor or similar) submission.

This should be checked and if it is impossible to depend on an external tool, blastn should be included in the package.

As far as I know, all blast R packages do remote blasting, which is however not an option for this package.

Local vs remote NCBI access

At the moment there is some hybrid version of ncbi database access. nodes.R relies on a local copy of the NCBI taxonomy nodes table but the retreival of sequences for clustering is done with remote access using the package rentrez.

nodes.R also has remote access to NCBI taxonomy implemented, but this needs to be debugged.

No periods or tildes in file paths

Hi Dom,

when I try to run a dummy analysis I run into the issue that my $HOME directory is /Users/rutger.vos, which the code wants to parse on the period. It tries to instantiate a log file called /Users/rutger.log, which is not allowed.

In trying to circumvent this I tested whether I can use a tilde instead (i.e. for my $HOME, use ~) but it doesn't like that either.

Best,

Rutger

makeblastdb failing to run

Hello,

I'm having trouble to run the pipeline due to an error in BLAST.

This is my code and, on the last line, the error.

> library(phylotaR)
> wd <- 'C:/bioinfo/NCBI/Serpentes'
> ncbi_dr <- 'C:/bioinfo/NCBI/blast-2.10.0+/bin'
> txid <- 8570
> setup(wd = wd, txid = txid, ncbi_dr = ncbi_dr, v = TRUE)
-------------------------------------------------
phylotaR: Implementation of PhyLoTa in R [v1.2.0]
-------------------------------------------------
Checking for valid NCBI BLAST+ Tools ...
Found: [C:/bioinfo/NCBI/blast-2.10.0+/bin/makeblastdb]
Found: [C:/bioinfo/NCBI/blast-2.10.0+/bin/blastn]
. . Running makeblastdb
Erro: makeblastdb failed to run. Check BLAST log files.

makeblastdb failed to run

Hi Dom,

I could not run pipeline due to an error with makeblastdb.

This is my code:

wd <- "C:/Users/jorge/Documents/Mente/BIOVERA humboldt"
ncbi_dr <- "C:/Program Files/NCBI/blast-2.7.1+/bin"
txid <- 241806 

setUp(wd=wd, txid=txid, ncbi_dr=ncbi_dr)
run(wd=wd)

Everythin works until cluster. Then it appears the following message:
Error in error(ps = ps, paste0("makeblastdb failed to run. Check BLAST log files.")) :
Error: makeblastdb failed to run. Check BLAST log files.

This is the logfile:
USAGE
makeblastdb.exe [-h] [-help] [-in input_file] [-input_type type]
-dbtype molecule_type [-title database_title] [-parse_seqids]
[-hash_index] [-mask_data mask_data_files] [-mask_id mask_algo_ids]
[-mask_desc mask_algo_descriptions] [-gi_mask]
[-gi_mask_name gi_based_mask_names] [-out database_name]
[-max_file_sz number_of_bytes] [-logfile File_Name] [-taxid TaxID]
[-taxid_map TaxIDMapFile] [-version]

DESCRIPTION
Application to create BLAST databases, version 2.7.1+

Use '-help' to print detailed descriptions of command line arguments

Error: Too many positional arguments (1), the offending value: humboldt/blast/taxon-1203511-typ-subtree-db.fa
Error: (CArgException::eSynopsis) Too many positional arguments (1), the offending value: humboldt/blast/taxon-1203511-typ-subtree-db.fa

I attach you my log file:
log.txt

PhylotaR Cluster error: blastn failed to run

Hi! I've been trying to use phylotaR with the code:

wd <- 'C:/Users/maari/Desktop/Trial'
ncbi_dr <- 'C:/Program Files/NCBI/blast-2.9.0+/bin'
txid <- 231623
setup(wd = wd, txid = txid, ncbi_dr = ncbi_dr, v = T)
run(wd = wd)

Everything seems to work fine until the Cluster stage, where I get the following error:

Error in stages_run(wd = wd, frm = 1, to = nstages, stgs_msg = stgs_msg) :
Unexpected Error : blastn failed to run. Check BLAST log files.

BLAST query/options error: '"6' is not a valid output format
Please refer to the BLAST+ user manual.

Could you help solve this error?
I've already tried running the code with different taxa, R and Blast versions, but always seem to get an error in the cluster stage.
I've also tried installing the most recent version of phylotaR using the command:
install.packages('phylotaR')
I'm working on a windows 10 machine.

Infrasturcture and support for blastn alternatives

There are alternatives to NCBI's blastn that are often much faster, but:

  • some are less sensitive
  • some only work with specific sequence types
  • some are only beneficial in certain computing environments (e.g. HPC)

There is a large number: https://en.wikipedia.org/wiki/List_of_sequence_alignment_software
In particular: usearch, blat, megablast, plast, diammond (nucleotide capabilities pending: bbuchfink/diamond#117)

Instead of developing a new code for each individual search tool alternative, provide the user with general input/output functions that a user could adapt for their own choice of search tool. The benefits of this solution:

  • no need to worry about ensuring each dependency works across versions and OSs.
  • no need to keep up-to-date with latest developments in search-tool development

Create unit tests

It would be good if there were standardized tests so that users can verify correct functioning and use them as code samples.

Several species

I am currently working with phylotaR for a group of species. But since I have seen in your paper, the examples and the vignette, phylotaR only Works for one txid.
Is there a form of providing several txid to work with species of several groups?
I have tried to supply the id of my species as a vector of characters, but it seems that it doesn´t work because the program does not advance from taxise.
Best regards,

Run issues

  • log should include BLAST version
  • sessionInfo needs to be recorded
  • 'Don't panic' statement?
  • More information on what R is actually doing during a run

Documentation issues

  • use devtools::spell_check
  • add return fields to all functions
  • create examples for each function, investigate devtools::use_data
  • what does MAD, type and seed mean?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.