bioconductor / uniprot.ws Goto Github PK

View Code? Open in Web Editor NEW

3.0 9.0 7.0 1.58 MB

R Interface to UniProt Web Services

Home Page: https://bioconductor.org/packages/UniProt.ws

R 100.00%

core-package bioconductor-package

uniprot.ws's People

Contributors

Stargazers

Watchers

Forkers

flying-sheep kayla-morrell jmacdon tttpob ii-bioinfo-thomas denghb001 zhilongjia

uniprot.ws's Issues

Does specifying organism_id in mapUniProt constrain return values to that organism?

The following simple example for one gene follows the docs I think:

mapUniProt(
from = "Gene_Name",
to = "UniProtKB-Swiss-Prot",
debug=T,
columns = c("accession",
"xref_alphafolddb",
"xref_pdb",
"organism_id",
"xref_chembl",
"protein_name",
"xref_geneid"),
query = list(organism_id = 9606, ids = c("TSPAN6")))

But I get several TSPAN6 genes back from other organisms. Is there an error in my example or has something changed on Uniprot's end for how the organism id is specified?

select() obscure warning

a <- UniProt.ws::select(up, keys, columns, keytype)
Getting mapping data for P52875 ... and KEGG_ID
Getting extra data for P52875,Q9DCT2,Q9CR16,Q3U0V1,P62737,P23198
Getting extra data for O88685,Q5DTX6,C0HKE5,P60710,Q8VDI7,Q5SF07
Getting extra data for P27773,P60867,Q80WJ1,Q9JIL5,Q3UTZ3,Q99PI4
Getting extra data for Q9CQG2,Q8K2J7,Q9CZJ2,Q8CFE3,Q6P2K6,Q0VAY3
Getting extra data for Q9QZZ4,Q9CYA6,E9Q6B2,Q9CQ92,Q9D4I2,Q5S003
Getting extra data for Q9ER60,P97496,P23506,Q9D0L8,Q8R3R8,Q7TPD0
Getting extra data for Q9QUK4,P0C0S6,Q99LN9,Q9D5A0,P04247,Q923G2
'select()' returned many:many mapping between keys and columns
Warning message:
In matrix(vec2, ncol = 2, byrow = TRUE) :
  data length [377] is not a sub-multiple or multiple of the number of rows [189]

What does this warning message mean? Would it be possible to write a clearer exception?

queryUniProt never returns more than 25 results

Thanks for contributing this helpful annotation package.

I was trying to map thousands of UniRef90 accessions and found I can send 100 at a time via queryUniProt, but then I saw that we were missing many annotations, and finally figured out that queryUniProt never returns more than 25 results, although I didn't see this documented. I tried the following code on Windows with R 4.3 and on Linux with R 4.2. I use here a vector of 150 query IDs, but I tried with multiple such vectors from my data.

library(UniProt.ws)
query.v <- c('R6C8N8', 'A0A3A9HRI6', 'R6BVD9', 'R6BMZ2', 'A0A358M5A6', 'R6BRP4', 'R6C9B3', 'R6BLD8', 'A5ZAW9', 'R6CBY1', 'R6BMB3', 'A0A1Y3XLA9', 'R6BBV5', 'A0A0E2SWM2', 'A0A316PRR4', 'A0A1C6AL72', 'R6BSU9', 'R6CC32', 'R6BEV4', 'R6BST6', 'A0A358M7H1', 'A0A374P5I8', 'A0A088F5T2', 'A5ZAX1', 'A0A358M8X5', 'A0A396EI11', 'R6BJS0', 'C6LD63', 'R6BMD0', 'R6BFY9', 'A0A088F5S8', 'R6BQG6', 'A0A3D4MUP0', 'A0A358M6E8', 'R6BDQ2', 'A0A0E2RTL1', 'A0A0E2Q1M3', 'R6C8M4', 'A5Z504', 'R6BD02', 'R6C6P2', 'A5ZAX0', 'A0A373KUA3', 'E4VXQ7', 'R6BPD2', 'R6CAK0', 'R6BLS1', 'R6BCW4', 'R6BIA0', 'R6BD68', 'R6CA17', 'R6BCI1', 'A0A1C6HY36', 'A0A088F9N8', 'R6BS32', 'A0A358M8P8', 'R6BCJ7', 'R6BL83', 'R6BLK9', 'R6BE07', 'A0A0E2SS73', 'A0A0E2T0J7', 'R6C899', 'A0A174XYI8', 'R6BPE5', 'R6BKA0', 'R6BAX6', 'R6BD43', 'UPI000E4BFED0', 'R6C6S3', 'R6BSM4', 'R6BP28', 'R6BNU1', 'A0A374X5R5', 'R6BQI3', 'A0A373KQQ2', 'A0A316T132', 'R6BKT7', 'A0A174GIS1', 'R6CBF6', 'R6BKF0', 'R6BH03', 'R6BS29', 'A0A316T1V8', 'R6CD19', 'A0A380LAC8', 'R6BDS8', 'D4JBQ7', 'R6C9Z2', 'K1GDK7', 'R6BLZ3', 'R6BJ26', 'D1JQF5', 'R6C7G4', 'R6BLL3', 'R6BLE0', 'R6BFA2', 'R6BIF3', 'R6BLD5', 'R6CBT3', 'R6BM30', 'R6BD16', 'R6BRP9', 'R6BLM0', 'A0A3E2TAM6', 'R6BJ41', 'I9VME5', 'R6BM25', 'R6BT30', 'J1H0Z1', 'R6BPQ1', 'R6BG04', 'R6BF42', 'A0A174GZ57', 'R6BSR7', 'R6C7R4', 'A0A1C6BTV1', 'R6BDN0', 'A0A316Q377', 'A0A373KZS3', 'A0A174RJT6', 'R6BCA5', 'A0A078S5I3', 'R6BEA3', 'A0A1H6VIB0', 'A0A174D2Y3', 'R6BVQ7', 'R6BE99', 'B0P2S6', 'R6C8N5', 'A0A1D3TUX2', 'UPI000C837F06', 'R6BSS0', 'R6BDH1', 'R6BRB4', 'R6BVK4', 'R6BF87', 'R6BER1', 'N2B8C0', 'A0A143X6X3', 'R6C7V5', 'R6BSL5', 'R6BER0', 'R6BCS1', 'R6BVU5', 'R6UPC4', 'R6C7N1', 'A0A3D4G7Z6')
qup <- queryUniProt(query = query.v, fields = c("accession", "id",  "gene_names", "protein_name"))
stopifnot(nrow(qup) > 25) # always produces an error

and here's my Windows session info

sessionInfo()

R version 4.3.0 (2023-04-21 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8 LC_MONETARY=English_United States.utf8 LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] UniProt.ws_2.40.0 RSQLite_2.3.1 BiocGenerics_0.46.0

loaded via a namespace (and not attached):
[1] KEGGREST_1.40.0 gtable_0.3.3 xfun_0.39 ggplot2_3.4.2 Biobase_2.60.0 rjsoncons_1.0.0 vctrs_0.6.2
[8] tools_4.3.0 bitops_1.0-7 generics_0.1.3 stats4_4.3.0 curl_5.0.0 tibble_3.2.1 fansi_1.0.4
[15] AnnotationDbi_1.62.1 blob_1.2.4 pkgconfig_2.0.3 BiocBaseUtils_1.2.0 dbplyr_2.3.2 S4Vectors_0.38.1 lifecycle_1.0.3
[22] GenomeInfoDbData_1.2.10 compiler_4.3.0 Biostrings_2.68.0 progress_1.2.2 munsell_0.5.0 GenomeInfoDb_1.36.0 htmltools_0.5.5
[29] RCurl_1.98-1.12 yaml_2.3.7 pillar_1.9.0 crayon_1.5.2 cachem_1.0.8 tidyselect_1.2.0 digest_0.6.31
[36] dplyr_1.1.2 pander_0.6.5 fastmap_1.1.1 grid_4.3.0 colorspace_2.1-0 cli_3.6.1 magrittr_2.0.3
[43] utf8_1.2.3 prettyunits_1.1.1 filelock_1.0.2 scales_1.2.1 bit64_4.0.5 rmarkdown_2.21 XVector_0.40.0
[50] httr_1.4.6 bit_4.0.5 png_0.1-8 hms_1.1.3 memoise_2.0.1 evaluate_0.21 knitr_1.42
[57] IRanges_2.34.0 BiocFileCache_2.8.0 rlang_1.1.1 Rcpp_1.0.10 glue_1.6.2 DBI_1.1.3 rstudioapi_0.14
[64] jsonlite_1.8.4 R6_2.5.1 zlibbioc_1.46.0

UniProt.ws::select() does not work

Hello again,

Please see the code below. When I run it, the following error occurs

#Load Packages
library(UniProt.ws)
library(dplyr)
library(annotables)
library(tibble)

options(url.method="libcurl")
#-----------------------------------

#load uniprot object taxId 9606 is for human
hu <- UniProt.ws(taxId = 9606)

#extract all the ENTREZ_gene names from hu

egs = keys(hu, "ENTREZ_GENE")

res <- UniProt.ws::select(hu, keys = "28976", columns = c("ENTRY-NAME"),keytype = "ENTREZ_GENE")

Getting mapping data for 28976 ... and ACC
error while trying to retrieve data in chunk 1:
no results after 5 attempts; please try again later
continuing to try
Error in colnames<-(*tmp*, value = *vtmp*) :
attempt to set 'colnames' on an object with less than two dimensions

Query UniProt using Entry Name

I am trying to make a query to UniProtKB using the Entry Name and retrieve the corresponding ID (i.e. entry as defined on the UniProt website). While I know Entry Name is not "stable", but unfortunately, it is all I have been given for a data. Is this possible given the current implementation?

database <- UniProt.ws(taxId = 9606)
select(x = database,  
            keys = "ALBU_HUMAN", 
            columns = c("ENTRY-NAME", "ENTRY"), 
            keytype = "UNIPROTKB")

Thanks!

Missing IDs

Hi, great package generally but I see for example a tomato ENSEMBL_GENOME ID Solyc01g090660.2 which can be found on the uniprot website and has a uniprot ID of D0UDC1 is lost when using the package. Any idea why that might be? Is the website more updated than the package? In fact there are nearly 1000 IDs that are lost when using the package, at least some of which can be found on the website.

Thanks,
Fred.

no server connection

A simple script

library(UniProt.ws)
up <- UniProt.ws(taxId=9606)

results in the error

Error in file(con, "r") : cannot open the connection
In addition: Warning message:
In file(con, "r") : 
InternetOpenUrl failed: ´A connection with the server could not be established´

I can install any R package i.e. internet on my machine is fine. Any suggestions how to continue to debug?

getForm error (unsupported protocol) when querying web service

This error is related to the one shown in http://bioconductor.org/checkResults/release/bioc-LATEST/UniProt.ws/tokay2-buildsrc.html

url <- "https://www.uniprot.org/mapping/"
params <- c(from = "P_ENTREZGENEID", to = "ACC", format = "tab", query = "1 2")
RCurl::getForm(url, .params = params, .opts = list(FOLLOWLOCATION = TRUE))
#> Error in function (type, msg, asError = TRUE) : error:14077102:SSL routines:SSL23_GET_SERVER_HELLO:unsupported protocol

^{Created on 2021-08-27 by the reprex package (v2.0.1)}

Session info

sessioninfo::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value                       
#>  version  R version 4.1.1 (2021-08-10)
#>  os       Windows 10 x64              
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  English_United States.1252  
#>  ctype    English_United States.1252  
#>  tz       America/New_York            
#>  date     2021-08-27                  
#> 
#> - Packages -------------------------------------------------------------------
#>  package          * version  date       lib source        
#>  AnnotationDbi      1.54.1   2021-06-08 [1] Bioconductor  
#>  assertthat         0.2.1    2019-03-21 [1] CRAN (R 4.1.0)
#>  Biobase            2.52.0   2021-05-19 [1] Bioconductor  
#>  BiocFileCache      2.0.0    2021-05-19 [1] Bioconductor  
#>  BiocGenerics     * 0.38.0   2021-05-19 [1] Bioconductor  
#>  Biostrings         2.60.2   2021-08-05 [1] Bioconductor  
#>  bit                4.0.4    2020-08-04 [1] CRAN (R 4.1.0)
#>  bit64              4.0.5    2020-08-30 [1] CRAN (R 4.1.0)
#>  bitops             1.0-7    2021-04-24 [1] CRAN (R 4.1.0)
#>  blob               1.2.2    2021-07-23 [1] CRAN (R 4.1.0)
#>  cachem             1.0.6    2021-08-19 [1] CRAN (R 4.1.1)
#>  cli                3.0.1    2021-07-17 [1] CRAN (R 4.1.0)
#>  crayon             1.4.1    2021-02-08 [1] CRAN (R 4.1.0)
#>  curl               4.3.2    2021-06-23 [1] CRAN (R 4.1.0)
#>  DBI                1.1.1    2021-01-15 [1] CRAN (R 4.1.0)
#>  dbplyr             2.1.1    2021-04-06 [1] CRAN (R 4.1.0)
#>  digest             0.6.27   2020-10-24 [1] CRAN (R 4.1.0)
#>  dplyr              1.0.7    2021-06-18 [1] CRAN (R 4.1.0)
#>  ellipsis           0.3.2    2021-04-29 [1] CRAN (R 4.1.0)
#>  evaluate           0.14     2019-05-28 [1] CRAN (R 4.1.0)
#>  fansi              0.5.0    2021-05-25 [1] CRAN (R 4.1.0)
#>  fastmap            1.1.0    2021-01-25 [1] CRAN (R 4.1.0)
#>  filelock           1.0.2    2018-10-05 [1] CRAN (R 4.1.0)
#>  fs                 1.5.0    2020-07-31 [1] CRAN (R 4.1.0)
#>  generics           0.1.0    2020-10-31 [1] CRAN (R 4.1.0)
#>  GenomeInfoDb       1.28.1   2021-07-01 [1] Bioconductor  
#>  GenomeInfoDbData   1.2.6    2021-08-27 [1] Bioconductor  
#>  glue               1.4.2    2020-08-27 [1] CRAN (R 4.1.0)
#>  highr              0.9      2021-04-16 [1] CRAN (R 4.1.0)
#>  htmltools          0.5.2    2021-08-25 [1] CRAN (R 4.1.1)
#>  httr               1.4.2    2020-07-20 [1] CRAN (R 4.1.0)
#>  IRanges            2.26.0   2021-05-19 [1] Bioconductor  
#>  KEGGREST           1.32.0   2021-05-19 [1] Bioconductor  
#>  knitr              1.33     2021-04-24 [1] CRAN (R 4.1.0)
#>  lifecycle          1.0.0    2021-02-15 [1] CRAN (R 4.1.0)
#>  magrittr           2.0.1    2020-11-17 [1] CRAN (R 4.1.0)
#>  memoise            2.0.0    2021-01-26 [1] CRAN (R 4.1.0)
#>  pillar             1.6.2    2021-07-29 [1] CRAN (R 4.1.0)
#>  pkgconfig          2.0.3    2019-09-22 [1] CRAN (R 4.1.0)
#>  png                0.1-7    2013-12-03 [1] CRAN (R 4.1.0)
#>  purrr              0.3.4    2020-04-17 [1] CRAN (R 4.1.0)
#>  R6                 2.5.1    2021-08-19 [1] CRAN (R 4.1.1)
#>  rappdirs           0.3.3    2021-01-31 [1] CRAN (R 4.1.0)
#>  Rcpp               1.0.7    2021-07-07 [1] CRAN (R 4.1.0)
#>  RCurl            * 1.98-1.4 2021-08-17 [1] CRAN (R 4.1.1)
#>  reprex             2.0.1    2021-08-05 [1] CRAN (R 4.1.0)
#>  rlang              0.4.11   2021-04-30 [1] CRAN (R 4.1.0)
#>  rmarkdown          2.10     2021-08-06 [1] CRAN (R 4.1.0)
#>  RSQLite          * 2.2.8    2021-08-21 [1] CRAN (R 4.1.1)
#>  rstudioapi         0.13     2020-11-12 [1] CRAN (R 4.1.0)
#>  S4Vectors          0.30.0   2021-05-19 [1] Bioconductor  
#>  sessioninfo        1.1.1    2018-11-05 [1] CRAN (R 4.1.0)
#>  stringi            1.6.2    2021-05-17 [1] CRAN (R 4.1.0)
#>  stringr            1.4.0    2019-02-10 [1] CRAN (R 4.1.0)
#>  tibble             3.1.4    2021-08-25 [1] CRAN (R 4.1.1)
#>  tidyselect         1.1.1    2021-04-30 [1] CRAN (R 4.1.0)
#>  UniProt.ws       * 2.33.0   2021-08-27 [1] Bioconductor  
#>  utf8               1.2.2    2021-07-24 [1] CRAN (R 4.1.0)
#>  vctrs              0.3.8    2021-04-29 [1] CRAN (R 4.1.0)
#>  withr              2.4.2    2021-04-18 [1] CRAN (R 4.1.0)
#>  xfun               0.24     2021-06-15 [1] CRAN (R 4.1.0)
#>  XVector            0.32.0   2021-05-19 [1] Bioconductor  
#>  yaml               2.2.1    2020-02-01 [1] CRAN (R 4.1.0)
#>  zlibbioc           1.38.0   2021-05-19 [1] Bioconductor  
#> 
#> [1] C:/Users/XXXXX/Documents/R/win-library/4.1
#> [2] C:/Program Files/R/R-4.1.1/library

In comparison, this works:

params <- c(from = "P_ENTREZGENEID", to = "ACC", format = "tab", query = "1 2")
httr::GET("https://www.uniprot.org/mapping/", query = as.list(params))
#> Response [https://www.uniprot.org/mapping/M20210827F248CABF64506F29A91F8037F07B67D1148B2ER.tab]
#>   Date: 2021-08-27 16:04
#>   Status: 200
#>   Content-Type: text/plain
#>   Size: 35 B
#> From To
#> 1    P04217
#> 1    V9HWD8
#> 2    P01023

^{Created on 2021-08-27 by the reprex package (v2.0.1)}

Session info

sessioninfo::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value                       
#>  version  R version 4.1.1 (2021-08-10)
#>  os       Windows 10 x64              
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  English_United States.1252  
#>  ctype    English_United States.1252  
#>  tz       America/New_York            
#>  date     2021-08-27                  
#> 
#> - Packages -------------------------------------------------------------------
#>  package     * version date       lib source        
#>  cli           3.0.1   2021-07-17 [1] CRAN (R 4.1.0)
#>  curl          4.3.2   2021-06-23 [1] CRAN (R 4.1.0)
#>  digest        0.6.27  2020-10-24 [1] CRAN (R 4.1.0)
#>  evaluate      0.14    2019-05-28 [1] CRAN (R 4.1.0)
#>  fastmap       1.1.0   2021-01-25 [1] CRAN (R 4.1.0)
#>  fs            1.5.0   2020-07-31 [1] CRAN (R 4.1.0)
#>  glue          1.4.2   2020-08-27 [1] CRAN (R 4.1.0)
#>  highr         0.9     2021-04-16 [1] CRAN (R 4.1.0)
#>  htmltools     0.5.2   2021-08-25 [1] CRAN (R 4.1.1)
#>  httr          1.4.2   2020-07-20 [1] CRAN (R 4.1.0)
#>  knitr         1.33    2021-04-24 [1] CRAN (R 4.1.0)
#>  magrittr      2.0.1   2020-11-17 [1] CRAN (R 4.1.0)
#>  R6            2.5.1   2021-08-19 [1] CRAN (R 4.1.1)
#>  reprex        2.0.1   2021-08-05 [1] CRAN (R 4.1.0)
#>  rlang         0.4.11  2021-04-30 [1] CRAN (R 4.1.0)
#>  rmarkdown     2.10    2021-08-06 [1] CRAN (R 4.1.0)
#>  rstudioapi    0.13    2020-11-12 [1] CRAN (R 4.1.0)
#>  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 4.1.0)
#>  stringi       1.6.2   2021-05-17 [1] CRAN (R 4.1.0)
#>  stringr       1.4.0   2019-02-10 [1] CRAN (R 4.1.0)
#>  withr         2.4.2   2021-04-18 [1] CRAN (R 4.1.0)
#>  xfun          0.24    2021-06-15 [1] CRAN (R 4.1.0)
#>  yaml          2.2.1   2020-02-01 [1] CRAN (R 4.1.0)
#> 
#> [1] C:/Users/mramo/Documents/R/win-library/4.1
#> [2] C:/Program Files/R/R-4.1.1/library

Infinite loop for large query?

Hi,
I have recently switched to using this very useful package for retrieving GO annotations for my UniProt IDs, it is so far the most complete method of those I have tested, so thanks a lot for making this.

A few days ago I started encountering a strange issue that made me think the code may have recently changed and introduced an infinite loop that only affects large queries. However, it may also be that I am not using it properly, as I was never sure I was doing it right.

My code goes:

up <- UniProt.ws(taxId = tx)  # tx = taxon ID
res <- UniProt.ws::select(up, keys = proteins, columns = c("GO","GO-ID"), keytype = "UNIPROTKB") # proteins = non-redundant character vector of ALL UniProt accessions in the species' proteome (removed the "-2", etc... isoform suffix).

Getting extra data for D6RG00,A0A1B0GTD4,F5H7Q5,E0YMJ1,J3KSB8,J3KQ18
'select()' returned 1:1 mapping between keys and columns
...

This works nicely with small numbers of IDs, and used to take about 1-2h for all proteins in a proteome. Not ideal, but decent. However a few days ago the same script started never stopping... and when I looked at the protein IDs printed for every sequence of 6 IDs retrieved, I noticed that the script was endlessly looping through the same IDs, even though my proteins character vector is not redundant. I have changed my code to have a loop which now queries only 6 IDs at a time, but this looks like a bug to me.

On a side note: if this is not the correct way to do this, I will be happy to correct my code.
To fasten the process and decrease the amount of data I have to collect, I am considering logging all downloaded mappings between accessions and GO annotations into a local database, that way I would only have to query the service with new IDs in future, however I am unsure as to how often GO annotations are updated by UniProt? Is there a benefit to check once in a while that, say, GO annotations for accession "D6RG00" are unchanged since last time?

Creating new UniProt.ws object fails to query uniprot.org, HTTP status '400 Bad Request'

> library(UniProt.ws)
> UniProt.ws()
Error in file(con, "r") : 
  cannot open the connection to 'https://www.uniprot.org/uniprot/?query=organism:9606&format=tab&columns=id'
In addition: Warning message:
In file(con, "r") :
  cannot open URL 'https://rest.uniprot.org/uniprotkb/query=organism:9606&format=tab&columns=id': HTTP status was '400 Bad Request'
>
> sessionInfo()
R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8    LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                           LC_TIME=English_United States.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] UniProt.ws_2.36.0   BiocGenerics_0.42.0 RSQLite_2.2.14     

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.8.3           pillar_1.7.0           compiler_4.2.1         BiocManager_1.30.18    dbplyr_2.2.1          
 [6] GenomeInfoDb_1.32.2    XVector_0.36.0         bitops_1.0-7           tools_4.2.1            zlibbioc_1.42.0       
[11] bit_4.0.4              tibble_3.1.7           lifecycle_1.0.1        memoise_2.0.1          BiocFileCache_2.4.0   
[16] pkgconfig_2.0.3        png_0.1-7              rlang_1.0.3            DBI_1.1.3              cli_3.3.0             
[21] filelock_1.0.2         curl_4.3.2             fastmap_1.1.0          GenomeInfoDbData_1.2.8 httr_1.4.3            
[26] dplyr_1.0.9            rappdirs_0.3.3         Biostrings_2.64.0      generics_0.1.2         S4Vectors_0.34.0      
[31] vctrs_0.4.1            IRanges_2.30.0         tidyselect_1.1.2       stats4_4.2.1           bit64_4.0.5           
[36] glue_1.6.2             Biobase_2.56.0         R6_2.5.1               fansi_1.0.3            AnnotationDbi_1.58.0  
[41] purrr_0.3.4            magrittr_2.0.3         blob_1.2.3             ellipsis_0.3.2         KEGGREST_1.36.2       
[46] assertthat_0.2.1       utf8_1.2.2             RCurl_1.98-1.7         cachem_1.0.6           crayon_1.5.1

Announcement on the new UniProt website:

I started encountering this error today and I believe this may be due to a change in the UniProt API coinciding with the launch of the new UniProt website. I otherwise cannot find any announcement from UniProt about the new API. The new API syntax breaks the UniProt.ws package and appears to have slightly different behavior in its return values.

The package is trying to connect to https://www.uniprot.org/uniprot/?query=organism:9606&format=tab&columns=id which is no longer valid with the new API, and returns HTML 400 status code.

UniProt's new API (https://rest.uniprot.org/uniprotkb/) breaks the package. The closest functioning API call to the above that I can determine is: https://rest.uniprot.org/uniprotkb/search?query=organism_id:9606&format=tsv&fields=accession

The old website (https://legacy.uniprot.org/) will return the old result, but I suspect the legacy API will be retired at some point. The website says the legacy website will be available until the 2022_03 release, at which point the legacy API would also be shut down. https://legacy.uniprot.org/uniprot/?query=organism:9606&format=tab&columns=id

I believe all the API calls in the package will need to be updated to support the new UniProt API following this documentation: https://www.uniprot.org/help/return_fields

Wrapping package functions into larger functions

Hi,
I am encountering a strange error. I had wrapped the package's UniProt.ws and UniProt.ws::select functions into a larger function - and had noticed that it had become slower, but had assumed this was due to the underlying code having changed, since I had updated to a newer version (cf. an issue I reported a few months ago, now solved). I am observing that running the function's script as a script is much faster (maybe 5 times or so!) than calling my wrapper function. Moreover, sometimes the function fails when its script itself succeeds. Any reason why this may be happening, i.e. why running this script's functions inside larger functions is much slower than running them as is?
Kind regards,

Armel

not all UniProt IDs are searchable

While the UniProt.org web site will map terms like APOB_HUMAN to EntrezID, HGNC etc, the UniProt.ws tool doesn't support this field.

library(UniProt.ws)
up <- UniProt.ws(taxId=9606)
types <- keytypes(up); types # 92 are returned

i tried mapping using all the keytypes (starting with kt=UNIPROTKB) and no data is returned.
Am i missing something?

here is the query

keys <- c('APOB_HUMAN','THRB_HUMAN')
columns <- c("HGNC",'ENTREZ_GENE')
kt <- 'UNIPROTKB'
res <- select(up, keys, columns, kt)

here is the output ( i actually did this from a command-line script)
Error in .select(x, keys, columns, keytype) :
No data is available for the keys provided.
Calls: select -> select -> .select
Execution halted

A suggestion about keytypes in Uniprot.ws

If we have a gene name, e.g. BRCA1, we cannot to retrieve any information using Uniprot.ws because the keytypes don't contain GENENAME, gene symbol (like, hgnc_symbol) or other ID format. However, the Uniprot database provide Gene name in their ID Retrieving/mapping program. Besides, I wish columns or keytypes could be classified into categories as database shown.

UniProt.ws lookup fails on some valid UniProt IDs

Consider the following query:

up.ws <- UniProt.ws::UniProt.ws(taxId=9606)
res <- UniProt.ws::select(x=up.ws, keys=c("HGNC:417", "HGNC:30235", "HGNC:19732", "HGNC:51504","HGNC:51514","HGNC:51513"), columns="ENSEMBL", keytype="HGNC")

Using version 2.22.0 of the UniProt.ws (on R version 3.5.2), this returns:
HGNC ENSEMBL
1 HGNC:417 ENSG00000136872
2 HGNC:30235 NA
3 HGNC:19732 NA
4 NA NA
5 NA NA
6 NA NA

Why is the result NA for HGNC:30235 and HGNC:19732? These HGNC IDs are linked to Ensembl genes by HGNC. And why are HGNC IDs 51504, 51514, and 51513 returned as NA's? These are valid HGNC IDs.

Mapping error

I followed the code you provided:

two error occure: can not connect to database?
#########

up
"UniProt.ws" object:
An interface object for UniProt web services
Current Taxonomy ID:
9606
Current Species name:
Homo sapiens
To change Species see: help('availableUniprotSpecies')
egs=keys(up,"ENTREZ_GENE")
Getting mapping data for Q00266 ... and P_ENTREZGENEID
error while trying to retrieve data in chunk 1:
no results after 5 attempts; please try again later
continuing to try

#########

keys<-c("1","2")
columns<-c("PDB","HGNC","SEQUENCE")
kt<-"ENTREZ_GENE"
res<-select(up,keys,columns,kt)
Getting mapping data for 1 ... and ACC
error while trying to retrieve data in chunk 1:
no results after 5 attempts; please try again later
continuing to try
Error in colnames<-(*tmp*, value = *vtmp*) :

##########

additional columns

Columns such as Function[CC] are programmatically accessible from UniProt, but currently not available. See this post. Would it be possible to add further columns? Thanks.

extracting uniprot annotations using Entrez Gene IDs

I am attempting to extract select uniprot fields using Entrez Gene IDs in R. The output appears strange. In the first few rows, only the ID column is populated and the next few rows, all columns except the ID column is populated.

Here is what I am trying:

ss <- select(up,
+        keys = c("5243","5244"),
+        columns = c("GENES","ENTRY-NAME", "REVIEWED"),
+        keytype = "ENTREZ_GENE")
Getting mapping data for 5243 ... and ACC
Getting extra data for A4D1D2,P21439
'select()' returned 1:many mapping between keys and columns

The dataframe looks like this:
ss.xlsx
I am using this package in R version 3.4.3 in Rstudio 1.0.153.
I uninstalled and reinstalled the package. Still the same error. What am I doing wrong? The same code worked back in June.

Select not working

l```
ibrary(Uniprot.ws)
res <- select(up,
keys = c("22627","22629"),
columns = c("PDB","UNIGENE","SEQUENCE"),
keytype = "ENTREZ_GENE")

Error in UseMethod("select_") :
no applicable method for 'select_' applied to an object of class "UniProt.ws"

sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.3

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base

other attached packages:
[1] readxl_1.3.1 forcats_0.4.0 stringr_1.4.0 dplyr_0.8.0.1 purrr_0.3.2
[6] readr_1.3.1 tidyr_0.8.3 tibble_2.1.1 ggplot2_3.1.0 tidyverse_1.2.1
[11] UniProt.ws_2.22.0 BiocGenerics_0.28.0 RCurl_1.95-4.12 bitops_1.0-6 RSQLite_2.1.1

loaded via a namespace (and not attached):
[1] Rcpp_1.0.1 lubridate_1.7.4 lattice_0.20-38 assertthat_0.2.0 digest_0.6.18
[6] utf8_1.1.4 BiocFileCache_1.6.0 R6_2.4.0 cellranger_1.1.0 plyr_1.8.4
[11] backports_1.1.3 stats4_3.5.3 BiocInstaller_1.32.1 httr_1.4.0 pillar_1.3.1
[16] rlang_0.3.1 lazyeval_0.2.2 curl_3.3 rstudioapi_0.9.0 blob_1.1.1
[21] S4Vectors_0.20.1 bit_1.1-14 munsell_0.5.0 broom_0.5.1 compiler_3.5.3
[26] modelr_0.1.4 pkgconfig_2.0.2 tidyselect_0.2.5 IRanges_2.16.0 fansi_0.4.0
[31] crayon_1.3.4 dbplyr_1.3.0 withr_2.1.2 rappdirs_0.3.1 grid_3.5.3
[36] nlme_3.1-137 jsonlite_1.6 gtable_0.2.0 DBI_1.0.0 magrittr_1.5
[41] scales_1.0.0 cli_1.0.1 stringi_1.4.3 xml2_1.2.0 generics_0.0.2
[46] tools_3.5.3 bit64_0.9-7 Biobase_2.42.0 glue_1.3.1 hms_0.4.2
[51] yaml_2.2.0 AnnotationDbi_1.44.0 colorspace_1.4-0 BiocManager_1.30.4 rvest_0.3.2
[56] memoise_1.1.0 haven_2.1.0

“Comments”

The information I’m interested doesn’t seem to be retrieved. When I fetch the “COMMENTS” column, I get something useless like:

> AnnotationDbi::select(up, entrez_ids, 'COMMENTS', 'ENTREZ_GENE')$COMMENTS
'Developmental stage (1); Function (1); Induction (1); Sequence similarities (1); Subcellular location (1); Tissue specificity (1)'

“Subcellular location” seems to be there, but not everything.

Issue with UniProt.ws::select()

Following is my code

#load uniprot object taxId 9606 is for human
hu <- UniProt.ws(taxId = 9606)

#extract all the ENTREZ_gene names from hu

egs = keys(hu, "ENTREZ_GENE")

use select to extract data by the keys (Keylist) of interest

res <- UniProt.ws::select(hu, keys = keylist, columns = c("ENTRY-NAME","SEQUENCE"),keytype = "ENTREZ_GENE")

I get the following error

Session info
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)

Matrix products: default

Random number generation:
RNG: Mersenne-Twister
Normal: Inversion
Sample: Rounding

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base

other attached packages:
[1] sqldf_0.4-11 gsubfn_0.7 proto_1.0.0 annotables_0.1.91
[5] tibble_3.1.3 dplyr_1.0.7 UniProt.ws_2.32.0 BiocGenerics_0.38.0
[9] RCurl_1.98-1.3 RSQLite_2.2.7

loaded via a namespace (and not attached):
[1] Rcpp_1.0.7 prettyunits_1.1.1 png_0.1-7
[4] ps_1.6.0 Biostrings_2.60.2 assertthat_0.2.1
[7] rprojroot_2.0.2 utf8_1.2.2 BiocFileCache_2.0.0
[10] chron_2.3-56 R6_2.5.0 GenomeInfoDb_1.28.1
[13] stats4_4.1.0 httr_1.4.2 pillar_1.6.2
[16] zlibbioc_1.38.0 rlang_0.4.11 curl_4.3.2
[19] rstudioapi_0.13 callr_3.7.0 blob_1.2.2
[22] S4Vectors_0.30.0 desc_1.3.0 devtools_2.4.2
[25] stringr_1.4.0 bit_4.0.4 compiler_4.1.0
[28] pkgconfig_2.0.3 pkgbuild_1.2.0 tcltk_4.1.0
[31] tidyselect_1.1.1 KEGGREST_1.32.0 GenomeInfoDbData_1.2.6
[34] IRanges_2.26.0 fansi_0.5.0 crayon_1.4.1
[37] dbplyr_2.1.1 withr_2.4.2 bitops_1.0-7
[40] rappdirs_0.3.3 lifecycle_1.0.0 DBI_1.1.1
[43] magrittr_2.0.1 cli_3.0.1 stringi_1.7.3
[46] cachem_1.0.5 XVector_0.32.0 fs_1.5.0
[49] remotes_2.4.0 testthat_3.0.4 ellipsis_0.3.2
[52] filelock_1.0.2 generics_0.1.0 vctrs_0.3.8
[55] tools_4.1.0 bit64_4.0.5 Biobase_2.52.0
[58] glue_1.4.2 purrr_0.3.4 processx_3.5.2
[61] pkgload_1.2.1 fastmap_1.1.0 AnnotationDbi_1.54.1
[64] sessioninfo_1.1.1 memoise_2.0.0 usethis_2.0.1

irretrievable FUNCTION and ENTREZ_GENE

Hi,

i'm trying to retrieve FUNCTION and/or ENTREZ_GENE from Uniprot on R 3.5.1 using UNIPROTKB as keys.
using simple code :
select(up, keys = xkeys, columns = c("FUNCTION"), keytype = "UNIPROTKB")
however, i'm getting in excess of 300 NAs out of the ~6000 keys used.
the keys themselves are valid on uniprot if i do a manual search and FUNCTION and ENTREZ_GENE are available for the keys that are returning NAs !.
help ?