bioconductor / uniprot.ws Goto Github PK
View Code? Open in Web Editor NEWR Interface to UniProt Web Services
Home Page: https://bioconductor.org/packages/UniProt.ws
R Interface to UniProt Web Services
Home Page: https://bioconductor.org/packages/UniProt.ws
The following simple example for one gene follows the docs I think:
mapUniProt(
from = "Gene_Name",
to = "UniProtKB-Swiss-Prot",
debug=T,
columns = c("accession",
"xref_alphafolddb",
"xref_pdb",
"organism_id",
"xref_chembl",
"protein_name",
"xref_geneid"),
query = list(organism_id = 9606, ids = c("TSPAN6")))
But I get several TSPAN6 genes back from other organisms. Is there an error in my example or has something changed on Uniprot's end for how the organism id is specified?
a <- UniProt.ws::select(up, keys, columns, keytype)
Getting mapping data for P52875 ... and KEGG_ID
Getting extra data for P52875,Q9DCT2,Q9CR16,Q3U0V1,P62737,P23198
Getting extra data for O88685,Q5DTX6,C0HKE5,P60710,Q8VDI7,Q5SF07
Getting extra data for P27773,P60867,Q80WJ1,Q9JIL5,Q3UTZ3,Q99PI4
Getting extra data for Q9CQG2,Q8K2J7,Q9CZJ2,Q8CFE3,Q6P2K6,Q0VAY3
Getting extra data for Q9QZZ4,Q9CYA6,E9Q6B2,Q9CQ92,Q9D4I2,Q5S003
Getting extra data for Q9ER60,P97496,P23506,Q9D0L8,Q8R3R8,Q7TPD0
Getting extra data for Q9QUK4,P0C0S6,Q99LN9,Q9D5A0,P04247,Q923G2
'select()' returned many:many mapping between keys and columns
Warning message:
In matrix(vec2, ncol = 2, byrow = TRUE) :
data length [377] is not a sub-multiple or multiple of the number of rows [189]
What does this warning message mean? Would it be possible to write a clearer exception?
Thanks for contributing this helpful annotation package.
I was trying to map thousands of UniRef90 accessions and found I can send 100 at a time via queryUniProt, but then I saw that we were missing many annotations, and finally figured out that queryUniProt never returns more than 25 results, although I didn't see this documented. I tried the following code on Windows with R 4.3 and on Linux with R 4.2. I use here a vector of 150 query IDs, but I tried with multiple such vectors from my data.
library(UniProt.ws)
query.v <- c('R6C8N8', 'A0A3A9HRI6', 'R6BVD9', 'R6BMZ2', 'A0A358M5A6', 'R6BRP4', 'R6C9B3', 'R6BLD8', 'A5ZAW9', 'R6CBY1', 'R6BMB3', 'A0A1Y3XLA9', 'R6BBV5', 'A0A0E2SWM2', 'A0A316PRR4', 'A0A1C6AL72', 'R6BSU9', 'R6CC32', 'R6BEV4', 'R6BST6', 'A0A358M7H1', 'A0A374P5I8', 'A0A088F5T2', 'A5ZAX1', 'A0A358M8X5', 'A0A396EI11', 'R6BJS0', 'C6LD63', 'R6BMD0', 'R6BFY9', 'A0A088F5S8', 'R6BQG6', 'A0A3D4MUP0', 'A0A358M6E8', 'R6BDQ2', 'A0A0E2RTL1', 'A0A0E2Q1M3', 'R6C8M4', 'A5Z504', 'R6BD02', 'R6C6P2', 'A5ZAX0', 'A0A373KUA3', 'E4VXQ7', 'R6BPD2', 'R6CAK0', 'R6BLS1', 'R6BCW4', 'R6BIA0', 'R6BD68', 'R6CA17', 'R6BCI1', 'A0A1C6HY36', 'A0A088F9N8', 'R6BS32', 'A0A358M8P8', 'R6BCJ7', 'R6BL83', 'R6BLK9', 'R6BE07', 'A0A0E2SS73', 'A0A0E2T0J7', 'R6C899', 'A0A174XYI8', 'R6BPE5', 'R6BKA0', 'R6BAX6', 'R6BD43', 'UPI000E4BFED0', 'R6C6S3', 'R6BSM4', 'R6BP28', 'R6BNU1', 'A0A374X5R5', 'R6BQI3', 'A0A373KQQ2', 'A0A316T132', 'R6BKT7', 'A0A174GIS1', 'R6CBF6', 'R6BKF0', 'R6BH03', 'R6BS29', 'A0A316T1V8', 'R6CD19', 'A0A380LAC8', 'R6BDS8', 'D4JBQ7', 'R6C9Z2', 'K1GDK7', 'R6BLZ3', 'R6BJ26', 'D1JQF5', 'R6C7G4', 'R6BLL3', 'R6BLE0', 'R6BFA2', 'R6BIF3', 'R6BLD5', 'R6CBT3', 'R6BM30', 'R6BD16', 'R6BRP9', 'R6BLM0', 'A0A3E2TAM6', 'R6BJ41', 'I9VME5', 'R6BM25', 'R6BT30', 'J1H0Z1', 'R6BPQ1', 'R6BG04', 'R6BF42', 'A0A174GZ57', 'R6BSR7', 'R6C7R4', 'A0A1C6BTV1', 'R6BDN0', 'A0A316Q377', 'A0A373KZS3', 'A0A174RJT6', 'R6BCA5', 'A0A078S5I3', 'R6BEA3', 'A0A1H6VIB0', 'A0A174D2Y3', 'R6BVQ7', 'R6BE99', 'B0P2S6', 'R6C8N5', 'A0A1D3TUX2', 'UPI000C837F06', 'R6BSS0', 'R6BDH1', 'R6BRB4', 'R6BVK4', 'R6BF87', 'R6BER1', 'N2B8C0', 'A0A143X6X3', 'R6C7V5', 'R6BSL5', 'R6BER0', 'R6BCS1', 'R6BVU5', 'R6UPC4', 'R6C7N1', 'A0A3D4G7Z6')
qup <- queryUniProt(query = query.v, fields = c("accession", "id", "gene_names", "protein_name"))
stopifnot(nrow(qup) > 25) # always produces an error
and here's my Windows session info
sessionInfo()
R version 4.3.0 (2023-04-21 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8 LC_MONETARY=English_United States.utf8 LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] UniProt.ws_2.40.0 RSQLite_2.3.1 BiocGenerics_0.46.0
loaded via a namespace (and not attached):
[1] KEGGREST_1.40.0 gtable_0.3.3 xfun_0.39 ggplot2_3.4.2 Biobase_2.60.0 rjsoncons_1.0.0 vctrs_0.6.2
[8] tools_4.3.0 bitops_1.0-7 generics_0.1.3 stats4_4.3.0 curl_5.0.0 tibble_3.2.1 fansi_1.0.4
[15] AnnotationDbi_1.62.1 blob_1.2.4 pkgconfig_2.0.3 BiocBaseUtils_1.2.0 dbplyr_2.3.2 S4Vectors_0.38.1 lifecycle_1.0.3
[22] GenomeInfoDbData_1.2.10 compiler_4.3.0 Biostrings_2.68.0 progress_1.2.2 munsell_0.5.0 GenomeInfoDb_1.36.0 htmltools_0.5.5
[29] RCurl_1.98-1.12 yaml_2.3.7 pillar_1.9.0 crayon_1.5.2 cachem_1.0.8 tidyselect_1.2.0 digest_0.6.31
[36] dplyr_1.1.2 pander_0.6.5 fastmap_1.1.1 grid_4.3.0 colorspace_2.1-0 cli_3.6.1 magrittr_2.0.3
[43] utf8_1.2.3 prettyunits_1.1.1 filelock_1.0.2 scales_1.2.1 bit64_4.0.5 rmarkdown_2.21 XVector_0.40.0
[50] httr_1.4.6 bit_4.0.5 png_0.1-8 hms_1.1.3 memoise_2.0.1 evaluate_0.21 knitr_1.42
[57] IRanges_2.34.0 BiocFileCache_2.8.0 rlang_1.1.1 Rcpp_1.0.10 glue_1.6.2 DBI_1.1.3 rstudioapi_0.14
[64] jsonlite_1.8.4 R6_2.5.1 zlibbioc_1.46.0
Hello again,
#Load Packages
library(UniProt.ws)
library(dplyr)
library(annotables)
library(tibble)
options(url.method="libcurl")
#-----------------------------------
#load uniprot object taxId 9606 is for human
hu <- UniProt.ws(taxId = 9606)
#extract all the ENTREZ_gene names from hu
egs = keys(hu, "ENTREZ_GENE")
Getting mapping data for 28976 ... and ACC
error while trying to retrieve data in chunk 1:
no results after 5 attempts; please try again later
continuing to try
Error in colnames<-
(*tmp*
, value = *vtmp*
) :
attempt to set 'colnames' on an object with less than two dimensions
I am trying to make a query to UniProtKB using the Entry Name and retrieve the corresponding ID (i.e. entry as defined on the UniProt website). While I know Entry Name is not "stable", but unfortunately, it is all I have been given for a data. Is this possible given the current implementation?
database <- UniProt.ws(taxId = 9606)
select(x = database,
keys = "ALBU_HUMAN",
columns = c("ENTRY-NAME", "ENTRY"),
keytype = "UNIPROTKB")
Thanks!
Hi, great package generally but I see for example a tomato ENSEMBL_GENOME ID Solyc01g090660.2 which can be found on the uniprot website and has a uniprot ID of D0UDC1 is lost when using the package. Any idea why that might be? Is the website more updated than the package? In fact there are nearly 1000 IDs that are lost when using the package, at least some of which can be found on the website.
Thanks,
Fred.
A simple script
library(UniProt.ws)
up <- UniProt.ws(taxId=9606)
results in the error
Error in file(con, "r") : cannot open the connection
In addition: Warning message:
In file(con, "r") :
InternetOpenUrl failed: ´A connection with the server could not be established´
I can install any R package i.e. internet on my machine is fine. Any suggestions how to continue to debug?
This error is related to the one shown in http://bioconductor.org/checkResults/release/bioc-LATEST/UniProt.ws/tokay2-buildsrc.html
url <- "https://www.uniprot.org/mapping/"
params <- c(from = "P_ENTREZGENEID", to = "ACC", format = "tab", query = "1 2")
RCurl::getForm(url, .params = params, .opts = list(FOLLOWLOCATION = TRUE))
#> Error in function (type, msg, asError = TRUE) : error:14077102:SSL routines:SSL23_GET_SERVER_HELLO:unsupported protocol
Created on 2021-08-27 by the reprex package (v2.0.1)
sessioninfo::session_info()
#> - Session info ---------------------------------------------------------------
#> setting value
#> version R version 4.1.1 (2021-08-10)
#> os Windows 10 x64
#> system x86_64, mingw32
#> ui RTerm
#> language (EN)
#> collate English_United States.1252
#> ctype English_United States.1252
#> tz America/New_York
#> date 2021-08-27
#>
#> - Packages -------------------------------------------------------------------
#> package * version date lib source
#> AnnotationDbi 1.54.1 2021-06-08 [1] Bioconductor
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.1.0)
#> Biobase 2.52.0 2021-05-19 [1] Bioconductor
#> BiocFileCache 2.0.0 2021-05-19 [1] Bioconductor
#> BiocGenerics * 0.38.0 2021-05-19 [1] Bioconductor
#> Biostrings 2.60.2 2021-08-05 [1] Bioconductor
#> bit 4.0.4 2020-08-04 [1] CRAN (R 4.1.0)
#> bit64 4.0.5 2020-08-30 [1] CRAN (R 4.1.0)
#> bitops 1.0-7 2021-04-24 [1] CRAN (R 4.1.0)
#> blob 1.2.2 2021-07-23 [1] CRAN (R 4.1.0)
#> cachem 1.0.6 2021-08-19 [1] CRAN (R 4.1.1)
#> cli 3.0.1 2021-07-17 [1] CRAN (R 4.1.0)
#> crayon 1.4.1 2021-02-08 [1] CRAN (R 4.1.0)
#> curl 4.3.2 2021-06-23 [1] CRAN (R 4.1.0)
#> DBI 1.1.1 2021-01-15 [1] CRAN (R 4.1.0)
#> dbplyr 2.1.1 2021-04-06 [1] CRAN (R 4.1.0)
#> digest 0.6.27 2020-10-24 [1] CRAN (R 4.1.0)
#> dplyr 1.0.7 2021-06-18 [1] CRAN (R 4.1.0)
#> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.0)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 4.1.0)
#> fansi 0.5.0 2021-05-25 [1] CRAN (R 4.1.0)
#> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.1.0)
#> filelock 1.0.2 2018-10-05 [1] CRAN (R 4.1.0)
#> fs 1.5.0 2020-07-31 [1] CRAN (R 4.1.0)
#> generics 0.1.0 2020-10-31 [1] CRAN (R 4.1.0)
#> GenomeInfoDb 1.28.1 2021-07-01 [1] Bioconductor
#> GenomeInfoDbData 1.2.6 2021-08-27 [1] Bioconductor
#> glue 1.4.2 2020-08-27 [1] CRAN (R 4.1.0)
#> highr 0.9 2021-04-16 [1] CRAN (R 4.1.0)
#> htmltools 0.5.2 2021-08-25 [1] CRAN (R 4.1.1)
#> httr 1.4.2 2020-07-20 [1] CRAN (R 4.1.0)
#> IRanges 2.26.0 2021-05-19 [1] Bioconductor
#> KEGGREST 1.32.0 2021-05-19 [1] Bioconductor
#> knitr 1.33 2021-04-24 [1] CRAN (R 4.1.0)
#> lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.1.0)
#> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.1.0)
#> memoise 2.0.0 2021-01-26 [1] CRAN (R 4.1.0)
#> pillar 1.6.2 2021-07-29 [1] CRAN (R 4.1.0)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.0)
#> png 0.1-7 2013-12-03 [1] CRAN (R 4.1.0)
#> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.1.0)
#> R6 2.5.1 2021-08-19 [1] CRAN (R 4.1.1)
#> rappdirs 0.3.3 2021-01-31 [1] CRAN (R 4.1.0)
#> Rcpp 1.0.7 2021-07-07 [1] CRAN (R 4.1.0)
#> RCurl * 1.98-1.4 2021-08-17 [1] CRAN (R 4.1.1)
#> reprex 2.0.1 2021-08-05 [1] CRAN (R 4.1.0)
#> rlang 0.4.11 2021-04-30 [1] CRAN (R 4.1.0)
#> rmarkdown 2.10 2021-08-06 [1] CRAN (R 4.1.0)
#> RSQLite * 2.2.8 2021-08-21 [1] CRAN (R 4.1.1)
#> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.1.0)
#> S4Vectors 0.30.0 2021-05-19 [1] Bioconductor
#> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.1.0)
#> stringi 1.6.2 2021-05-17 [1] CRAN (R 4.1.0)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.1.0)
#> tibble 3.1.4 2021-08-25 [1] CRAN (R 4.1.1)
#> tidyselect 1.1.1 2021-04-30 [1] CRAN (R 4.1.0)
#> UniProt.ws * 2.33.0 2021-08-27 [1] Bioconductor
#> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.1.0)
#> vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.1.0)
#> withr 2.4.2 2021-04-18 [1] CRAN (R 4.1.0)
#> xfun 0.24 2021-06-15 [1] CRAN (R 4.1.0)
#> XVector 0.32.0 2021-05-19 [1] Bioconductor
#> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.1.0)
#> zlibbioc 1.38.0 2021-05-19 [1] Bioconductor
#>
#> [1] C:/Users/XXXXX/Documents/R/win-library/4.1
#> [2] C:/Program Files/R/R-4.1.1/library
In comparison, this works:
params <- c(from = "P_ENTREZGENEID", to = "ACC", format = "tab", query = "1 2")
httr::GET("https://www.uniprot.org/mapping/", query = as.list(params))
#> Response [https://www.uniprot.org/mapping/M20210827F248CABF64506F29A91F8037F07B67D1148B2ER.tab]
#> Date: 2021-08-27 16:04
#> Status: 200
#> Content-Type: text/plain
#> Size: 35 B
#> From To
#> 1 P04217
#> 1 V9HWD8
#> 2 P01023
Created on 2021-08-27 by the reprex package (v2.0.1)
sessioninfo::session_info()
#> - Session info ---------------------------------------------------------------
#> setting value
#> version R version 4.1.1 (2021-08-10)
#> os Windows 10 x64
#> system x86_64, mingw32
#> ui RTerm
#> language (EN)
#> collate English_United States.1252
#> ctype English_United States.1252
#> tz America/New_York
#> date 2021-08-27
#>
#> - Packages -------------------------------------------------------------------
#> package * version date lib source
#> cli 3.0.1 2021-07-17 [1] CRAN (R 4.1.0)
#> curl 4.3.2 2021-06-23 [1] CRAN (R 4.1.0)
#> digest 0.6.27 2020-10-24 [1] CRAN (R 4.1.0)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 4.1.0)
#> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.1.0)
#> fs 1.5.0 2020-07-31 [1] CRAN (R 4.1.0)
#> glue 1.4.2 2020-08-27 [1] CRAN (R 4.1.0)
#> highr 0.9 2021-04-16 [1] CRAN (R 4.1.0)
#> htmltools 0.5.2 2021-08-25 [1] CRAN (R 4.1.1)
#> httr 1.4.2 2020-07-20 [1] CRAN (R 4.1.0)
#> knitr 1.33 2021-04-24 [1] CRAN (R 4.1.0)
#> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.1.0)
#> R6 2.5.1 2021-08-19 [1] CRAN (R 4.1.1)
#> reprex 2.0.1 2021-08-05 [1] CRAN (R 4.1.0)
#> rlang 0.4.11 2021-04-30 [1] CRAN (R 4.1.0)
#> rmarkdown 2.10 2021-08-06 [1] CRAN (R 4.1.0)
#> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.1.0)
#> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.1.0)
#> stringi 1.6.2 2021-05-17 [1] CRAN (R 4.1.0)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.1.0)
#> withr 2.4.2 2021-04-18 [1] CRAN (R 4.1.0)
#> xfun 0.24 2021-06-15 [1] CRAN (R 4.1.0)
#> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.1.0)
#>
#> [1] C:/Users/mramo/Documents/R/win-library/4.1
#> [2] C:/Program Files/R/R-4.1.1/library
Hi,
I have recently switched to using this very useful package for retrieving GO annotations for my UniProt IDs, it is so far the most complete method of those I have tested, so thanks a lot for making this.
A few days ago I started encountering a strange issue that made me think the code may have recently changed and introduced an infinite loop that only affects large queries. However, it may also be that I am not using it properly, as I was never sure I was doing it right.
My code goes:
up <- UniProt.ws(taxId = tx) # tx = taxon ID
res <- UniProt.ws::select(up, keys = proteins, columns = c("GO","GO-ID"), keytype = "UNIPROTKB") # proteins = non-redundant character vector of ALL UniProt accessions in the species' proteome (removed the "-2", etc... isoform suffix).
Getting extra data for D6RG00,A0A1B0GTD4,F5H7Q5,E0YMJ1,J3KSB8,J3KQ18
'select()' returned 1:1 mapping between keys and columns
...
This works nicely with small numbers of IDs, and used to take about 1-2h for all proteins in a proteome. Not ideal, but decent. However a few days ago the same script started never stopping... and when I looked at the protein IDs printed for every sequence of 6 IDs retrieved, I noticed that the script was endlessly looping through the same IDs, even though my proteins character vector is not redundant. I have changed my code to have a loop which now queries only 6 IDs at a time, but this looks like a bug to me.
On a side note: if this is not the correct way to do this, I will be happy to correct my code.
To fasten the process and decrease the amount of data I have to collect, I am considering logging all downloaded mappings between accessions and GO annotations into a local database, that way I would only have to query the service with new IDs in future, however I am unsure as to how often GO annotations are updated by UniProt? Is there a benefit to check once in a while that, say, GO annotations for accession "D6RG00" are unchanged since last time?
> library(UniProt.ws)
> UniProt.ws()
Error in file(con, "r") :
cannot open the connection to 'https://www.uniprot.org/uniprot/?query=organism:9606&format=tab&columns=id'
In addition: Warning message:
In file(con, "r") :
cannot open URL 'https://rest.uniprot.org/uniprotkb/query=organism:9606&format=tab&columns=id': HTTP status was '400 Bad Request'
>
> sessionInfo()
R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8 LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C LC_TIME=English_United States.utf8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] UniProt.ws_2.36.0 BiocGenerics_0.42.0 RSQLite_2.2.14
loaded via a namespace (and not attached):
[1] Rcpp_1.0.8.3 pillar_1.7.0 compiler_4.2.1 BiocManager_1.30.18 dbplyr_2.2.1
[6] GenomeInfoDb_1.32.2 XVector_0.36.0 bitops_1.0-7 tools_4.2.1 zlibbioc_1.42.0
[11] bit_4.0.4 tibble_3.1.7 lifecycle_1.0.1 memoise_2.0.1 BiocFileCache_2.4.0
[16] pkgconfig_2.0.3 png_0.1-7 rlang_1.0.3 DBI_1.1.3 cli_3.3.0
[21] filelock_1.0.2 curl_4.3.2 fastmap_1.1.0 GenomeInfoDbData_1.2.8 httr_1.4.3
[26] dplyr_1.0.9 rappdirs_0.3.3 Biostrings_2.64.0 generics_0.1.2 S4Vectors_0.34.0
[31] vctrs_0.4.1 IRanges_2.30.0 tidyselect_1.1.2 stats4_4.2.1 bit64_4.0.5
[36] glue_1.6.2 Biobase_2.56.0 R6_2.5.1 fansi_1.0.3 AnnotationDbi_1.58.0
[41] purrr_0.3.4 magrittr_2.0.3 blob_1.2.3 ellipsis_0.3.2 KEGGREST_1.36.2
[46] assertthat_0.2.1 utf8_1.2.2 RCurl_1.98-1.7 cachem_1.0.6 crayon_1.5.1
Announcement on the new UniProt website:
I started encountering this error today and I believe this may be due to a change in the UniProt API coinciding with the launch of the new UniProt website. I otherwise cannot find any announcement from UniProt about the new API. The new API syntax breaks the UniProt.ws
package and appears to have slightly different behavior in its return values.
The package is trying to connect to https://www.uniprot.org/uniprot/?query=organism:9606&format=tab&columns=id which is no longer valid with the new API, and returns HTML 400 status code.
UniProt's new API (https://rest.uniprot.org/uniprotkb/) breaks the package. The closest functioning API call to the above that I can determine is: https://rest.uniprot.org/uniprotkb/search?query=organism_id:9606&format=tsv&fields=accession
The old website (https://legacy.uniprot.org/) will return the old result, but I suspect the legacy API will be retired at some point. The website says the legacy website will be available until the 2022_03 release, at which point the legacy API would also be shut down. https://legacy.uniprot.org/uniprot/?query=organism:9606&format=tab&columns=id
I believe all the API calls in the package will need to be updated to support the new UniProt API following this documentation: https://www.uniprot.org/help/return_fields
Hi,
I am encountering a strange error. I had wrapped the package's UniProt.ws and UniProt.ws::select functions into a larger function - and had noticed that it had become slower, but had assumed this was due to the underlying code having changed, since I had updated to a newer version (cf. an issue I reported a few months ago, now solved). I am observing that running the function's script as a script is much faster (maybe 5 times or so!) than calling my wrapper function. Moreover, sometimes the function fails when its script itself succeeds. Any reason why this may be happening, i.e. why running this script's functions inside larger functions is much slower than running them as is?
Kind regards,
Armel
While the UniProt.org web site will map terms like APOB_HUMAN to EntrezID, HGNC etc, the UniProt.ws tool doesn't support this field.
library(UniProt.ws)
up <- UniProt.ws(taxId=9606)
types <- keytypes(up); types # 92 are returned
here is the query
keys <- c('APOB_HUMAN','THRB_HUMAN')
columns <- c("HGNC",'ENTREZ_GENE')
kt <- 'UNIPROTKB'
res <- select(up, keys, columns, kt)
If we have a gene name, e.g. BRCA1, we cannot to retrieve any information using Uniprot.ws
because the keytypes don't contain GENENAME, gene symbol (like, hgnc_symbol) or other ID format. However, the Uniprot database provide Gene name in their ID Retrieving/mapping program. Besides, I wish columns or keytypes could be classified into categories as database shown.
Consider the following query:
up.ws <- UniProt.ws::UniProt.ws(taxId=9606)
res <- UniProt.ws::select(x=up.ws, keys=c("HGNC:417", "HGNC:30235", "HGNC:19732", "HGNC:51504","HGNC:51514","HGNC:51513"), columns="ENSEMBL", keytype="HGNC")
Using version 2.22.0 of the UniProt.ws (on R version 3.5.2), this returns:
HGNC ENSEMBL
1 HGNC:417 ENSG00000136872
2 HGNC:30235 NA
3 HGNC:19732 NA
4 NA NA
5 NA NA
6 NA NA
Why is the result NA for HGNC:30235 and HGNC:19732? These HGNC IDs are linked to Ensembl genes by HGNC. And why are HGNC IDs 51504, 51514, and 51513 returned as NA's? These are valid HGNC IDs.
I followed the code you provided:
two error occure: can not connect to database?
#########
up
"UniProt.ws" object:
An interface object for UniProt web services
Current Taxonomy ID:
9606
Current Species name:
Homo sapiens
To change Species see: help('availableUniprotSpecies')
egs=keys(up,"ENTREZ_GENE")
Getting mapping data for Q00266 ... and P_ENTREZGENEID
error while trying to retrieve data in chunk 1:
no results after 5 attempts; please try again later
continuing to try
#########
keys<-c("1","2")
columns<-c("PDB","HGNC","SEQUENCE")
kt<-"ENTREZ_GENE"
res<-select(up,keys,columns,kt)
Getting mapping data for 1 ... and ACC
error while trying to retrieve data in chunk 1:
no results after 5 attempts; please try again later
continuing to try
Error incolnames<-
(*tmp*
, value =*vtmp*
) :
##########
Columns such as Function[CC]
are programmatically accessible from UniProt, but currently not available. See this post. Would it be possible to add further columns? Thanks.
I am attempting to extract select uniprot fields using Entrez Gene IDs in R. The output appears strange. In the first few rows, only the ID column is populated and the next few rows, all columns except the ID column is populated.
Here is what I am trying:
ss <- select(up,
+ keys = c("5243","5244"),
+ columns = c("GENES","ENTRY-NAME", "REVIEWED"),
+ keytype = "ENTREZ_GENE")
Getting mapping data for 5243 ... and ACC
Getting extra data for A4D1D2,P21439
'select()' returned 1:many mapping between keys and columns
The dataframe looks like this:
ss.xlsx
I am using this package in R version 3.4.3 in Rstudio 1.0.153.
I uninstalled and reinstalled the package. Still the same error. What am I doing wrong? The same code worked back in June.
l```
ibrary(Uniprot.ws)
res <- select(up,
keys = c("22627","22629"),
columns = c("PDB","UNIGENE","SEQUENCE"),
keytype = "ENTREZ_GENE")
Error in UseMethod("select_") :
no applicable method for 'select_' applied to an object of class "UniProt.ws"
sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.3
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] readxl_1.3.1 forcats_0.4.0 stringr_1.4.0 dplyr_0.8.0.1 purrr_0.3.2
[6] readr_1.3.1 tidyr_0.8.3 tibble_2.1.1 ggplot2_3.1.0 tidyverse_1.2.1
[11] UniProt.ws_2.22.0 BiocGenerics_0.28.0 RCurl_1.95-4.12 bitops_1.0-6 RSQLite_2.1.1
loaded via a namespace (and not attached):
[1] Rcpp_1.0.1 lubridate_1.7.4 lattice_0.20-38 assertthat_0.2.0 digest_0.6.18
[6] utf8_1.1.4 BiocFileCache_1.6.0 R6_2.4.0 cellranger_1.1.0 plyr_1.8.4
[11] backports_1.1.3 stats4_3.5.3 BiocInstaller_1.32.1 httr_1.4.0 pillar_1.3.1
[16] rlang_0.3.1 lazyeval_0.2.2 curl_3.3 rstudioapi_0.9.0 blob_1.1.1
[21] S4Vectors_0.20.1 bit_1.1-14 munsell_0.5.0 broom_0.5.1 compiler_3.5.3
[26] modelr_0.1.4 pkgconfig_2.0.2 tidyselect_0.2.5 IRanges_2.16.0 fansi_0.4.0
[31] crayon_1.3.4 dbplyr_1.3.0 withr_2.1.2 rappdirs_0.3.1 grid_3.5.3
[36] nlme_3.1-137 jsonlite_1.6 gtable_0.2.0 DBI_1.0.0 magrittr_1.5
[41] scales_1.0.0 cli_1.0.1 stringi_1.4.3 xml2_1.2.0 generics_0.0.2
[46] tools_3.5.3 bit64_0.9-7 Biobase_2.42.0 glue_1.3.1 hms_0.4.2
[51] yaml_2.2.0 AnnotationDbi_1.44.0 colorspace_1.4-0 BiocManager_1.30.4 rvest_0.3.2
[56] memoise_1.1.0 haven_2.1.0
The information I’m interested doesn’t seem to be retrieved. When I fetch the “COMMENTS” column, I get something useless like:
> AnnotationDbi::select(up, entrez_ids, 'COMMENTS', 'ENTREZ_GENE')$COMMENTS
'Developmental stage (1); Function (1); Induction (1); Sequence similarities (1); Subcellular location (1); Tissue specificity (1)'
“Subcellular location” seems to be there, but not everything.
Following is my code
#load uniprot object taxId 9606 is for human
hu <- UniProt.ws(taxId = 9606)
#extract all the ENTREZ_gene names from hu
egs = keys(hu, "ENTREZ_GENE")
res <- UniProt.ws::select(hu, keys = keylist, columns = c("ENTRY-NAME","SEQUENCE"),keytype = "ENTREZ_GENE")
I get the following error
Getting mapping data for 28976 ... and ACC
error while trying to retrieve data in chunk 1:
no results after 5 attempts; please try again later
continuing to try
Error in colnames<-
(*tmp*
, value = *vtmp*
) :
attempt to set 'colnames' on an object with less than two dimensions
Session info
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)
Matrix products: default
Random number generation:
RNG: Mersenne-Twister
Normal: Inversion
Sample: Rounding
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] sqldf_0.4-11 gsubfn_0.7 proto_1.0.0 annotables_0.1.91
[5] tibble_3.1.3 dplyr_1.0.7 UniProt.ws_2.32.0 BiocGenerics_0.38.0
[9] RCurl_1.98-1.3 RSQLite_2.2.7
loaded via a namespace (and not attached):
[1] Rcpp_1.0.7 prettyunits_1.1.1 png_0.1-7
[4] ps_1.6.0 Biostrings_2.60.2 assertthat_0.2.1
[7] rprojroot_2.0.2 utf8_1.2.2 BiocFileCache_2.0.0
[10] chron_2.3-56 R6_2.5.0 GenomeInfoDb_1.28.1
[13] stats4_4.1.0 httr_1.4.2 pillar_1.6.2
[16] zlibbioc_1.38.0 rlang_0.4.11 curl_4.3.2
[19] rstudioapi_0.13 callr_3.7.0 blob_1.2.2
[22] S4Vectors_0.30.0 desc_1.3.0 devtools_2.4.2
[25] stringr_1.4.0 bit_4.0.4 compiler_4.1.0
[28] pkgconfig_2.0.3 pkgbuild_1.2.0 tcltk_4.1.0
[31] tidyselect_1.1.1 KEGGREST_1.32.0 GenomeInfoDbData_1.2.6
[34] IRanges_2.26.0 fansi_0.5.0 crayon_1.4.1
[37] dbplyr_2.1.1 withr_2.4.2 bitops_1.0-7
[40] rappdirs_0.3.3 lifecycle_1.0.0 DBI_1.1.1
[43] magrittr_2.0.1 cli_3.0.1 stringi_1.7.3
[46] cachem_1.0.5 XVector_0.32.0 fs_1.5.0
[49] remotes_2.4.0 testthat_3.0.4 ellipsis_0.3.2
[52] filelock_1.0.2 generics_0.1.0 vctrs_0.3.8
[55] tools_4.1.0 bit64_4.0.5 Biobase_2.52.0
[58] glue_1.4.2 purrr_0.3.4 processx_3.5.2
[61] pkgload_1.2.1 fastmap_1.1.0 AnnotationDbi_1.54.1
[64] sessioninfo_1.1.1 memoise_2.0.0 usethis_2.0.1
Hi,
i'm trying to retrieve FUNCTION and/or ENTREZ_GENE from Uniprot on R 3.5.1 using UNIPROTKB as keys.
using simple code :
select(up, keys = xkeys, columns = c("FUNCTION"), keytype = "UNIPROTKB")
however, i'm getting in excess of 300 NAs out of the ~6000 keys used.
the keys themselves are valid on uniprot if i do a manual search and FUNCTION and ENTREZ_GENE are available for the keys that are returning NAs !.
help ?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.