bioconductor / organism.dplyr Goto Github PK

Home Page: https://bioconductor.org/packages/Organism.dplyr

R 70.14% TSQL 29.86%

organism.dplyr's Introduction

The package creates an on disk sqlite database to hold data of an organism combined from an 'org' package (e.g., org.Hs.eg.db) and agenome coordinate functionality of the 'TxDb' package (e.g., TxDb.Hsapiens.UCSC.hg38.knownGene). It aims to provide an integrated presentation of identifiers and genomic coordinates.

organism.dplyr's People

Contributors

Stargazers

Watchers

Forkers

jorainer dvantwisk

organism.dplyr's Issues

Use AnnotationFilters to avoid naming conflicts

https://github.com/Biocondcutor/AnnotationFilters

GRangesFilter needs a show method

Filters

There are filter concepts in S4Vectors, ensembldb, and now here. Shouldn't we have just one? One thing that drove us to implement our own filters rather than re-using ensembldb was the ability to easily generate them programmatically, whereas these are all 'hand-crafted' in EnsemblDb.

Appropriate table structure

Organism.dplyr simplifies the bimap and table structure of org and TxDb packages to a small number of tables, but what are the optimal arrangement and membership of tables? Already Organism.dplyr is much more user-friendly than the org / TxDb / Homo.sapiens packages, so is valuable for that reason alone. Note that the genes(), transcripts(), exons(), and cds() verbs are already contracted to return a GRanges; we have genes_tbl() etc returning tibbles.

atlantic salmon annotation

Could you please add Atlantic salmon as an annotation file under organism in biocondutor?
https://bioconductor.org/packages/release/BiocViews.html#___Organism
Many thanks in advance

Bioconductor BBS: Organism.dplyr / BioC 3.18, 10/27/93

Hi Organism.dplyr maintainer,

According to the Multiple platform build/check report for BioC 3.18,
the Organism.dplyr package has the following problem(s):

o ERROR for 'R CMD build' on nebbiolo2. See the details here:
https://master.bioconductor.org/checkResults/3.18/bioc-LATEST/Organism.dplyr/nebbiolo2-buildsrc.html

Please take the time to address this by committing and pushing
changes to your package at git.bioconductor.org

Notes:

This was the status of your package at the time this email was sent to you.
Given that the online report is updated daily (in normal conditions) you
could see something different when you visit the URL(s) above, especially if
you do so several days after you received this email.

It is possible that the problems reported in this report are false positives,
either because another package (from CRAN or Bioconductor) breaks your
package (if yours depends on it) or because of a Build System problem.
If this is the case, then you can ignore this email.

Please check the report again 24h after you've committed your changes to the
package and make sure that all the problems have gone.

If you have questions about this report or need help with the
maintenance of your package, please use the Bioc-devel mailing list:

https://bioconductor.org/help/support/

(all package maintainers are requested to subscribe to this list)

For immediate notification of package build status, please
subscribe to your package's RSS feed. Information is at:

https://bioconductor.org/developers/rss-feeds/

Thanks for contributing to the Bioconductor project!

do schema tables have consistent names?

e.g., entrez_omim_pm vs ensembl_pm

persistent cache for result of src_organism()?

From the current vignette:

Running src_organism() without a given path will save the sqlite file to a tempdir():

...

It might be more convenient if the default behavior was to write the sqlite into a folder
like the one used for AnnotationHub, with src_organism checking for a relevant database
when invoked. It takes a bit of time to build the database, and if I understand the default behavior correctly, it will be lost when the session ends.

`exonsBy()` and others don't return same columns as GRanges version

E.g., in test-src_organims-class.R, "mouse" test

> names(mcols(exonsBy(src)[[1]]))
Joining, by = c("ensembl", "tx_id")
[1] "tx_id"     "exon_id"   "exon_name" "exon_rank"
> names(mcols(exonsBy(txdb)[[1]]))
[1] "exon_id"   "exon_name" "exon_rank"

Avoid Camel_snake filter names

remove redundant sqlite file

I created two sqlite data bases for testing, and somehow missed example.sqlite! I'll fix this, probably removing example.sqlite and updating vignette / documentation

`filter=` arguments should accept a single filter, as well as list of filters

`fiveUTRsBYTranscript(, filter=)` does not return values for all transcripts

> transcripts_tbl(src, filter=list(SymbolFilter("ADA")))
Joining, by = "entrez"
Source:   query [?? x 7]
Database: sqlite 3.11.1 [/home/mtmorgan/a/Organism.dplyr/inst/extdata/light.hg38.knownGene.sqlite]

  tx_chrom tx_start   tx_end tx_strand  tx_id    tx_name symbol
     <chr>    <int>    <int>     <chr>  <int>      <chr>  <chr>
1    chr20 44619522 44626491         - 169786 uc061xfj.1    ADA
2    chr20 44619522 44651742         - 169787 uc002xmj.4    ADA
3    chr20 44619810 44651691         - 169789 uc061xfl.1    ADA
> fiveUTRsByTranscript(src, filter = list(SymbolFilter("ADA")))
Joining, by = "entrez"
Joining, by = "entrez"
GRangesList object of length 1:
$169787 
GRanges object with 1 range and 5 metadata columns:
      seqnames               ranges strand |     tx_id   exon_id   exon_name
         <Rle>            <IRanges>  <Rle> | <integer> <integer> <character>
  [1]    chr20 [44651608, 44651742]      - |    169787    501401        <NA>
      exon_rank      symbol
      <integer> <character>
  [1]         1         ADA

-------

src_organism not working

Hello every time I run:

src<- src_organism("TxDb.Hsapiens.UCSC.hg38.knownGene")

I get the error:

Error in collect(): ! Failed to collect lazy table. Caused by error in db_collect(): ! Arguments in ... must be used. ✖ Problematic argument: • ..1 = Inf ℹ Did you misspell an argument name?

Not sure how to resolve this.

> sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.6 LTS

Matrix products: default
BLAS/LAPACK: FlexiBLAS OPENBLAS; LAPACK version 3.10.1

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: America/Los_Angeles
tzcode source: system (glibc)

attached base packages:
[1] tools stats4 stats graphics grDevices utils datasets methods base

other attached packages:
[1] tinytex_0.45 viridis_0.6.5 viridisLite_0.4.2
[4] rtracklayer_1.60.0 tidylog_1.0.2 data.table_1.15.4
[7] janitor_2.2.0 stringr_1.5.0 stringi_1.7.12
[10] forcats_1.0.0 readODS_2.2.0 patchwork_1.2.0
[13] ggrepel_0.9.3 ggplot2_3.5.0 RColorBrewer_1.1-3
[16] karyoploteR_1.26.0 regioneR_1.32.0 DOSE_3.26.1
[19] TxDb.Hsapiens.UCSC.hg38.knownGene_3.17.0 GenomicFeatures_1.52.1 GenomicRanges_1.52.0
[22] GenomeInfoDb_1.36.1 AnnotationDbi_1.62.2 IRanges_2.34.1
[25] S4Vectors_0.38.1 Biobase_2.60.0 BiocGenerics_0.46.0
[28] Organism.dplyr_1.28.0 AnnotationFilter_1.24.0 dplyr_1.1.4
[31] biomaRt_2.56.1 BiocManager_1.30.22

loaded via a namespace (and not attached):
[1] rstudioapi_0.14 magrittr_2.0.3 rmarkdown_2.22 BiocIO_1.10.0
[5] zlibbioc_1.46.0 vctrs_0.6.5 memoise_2.0.1 Rsamtools_2.16.0
[9] RCurl_1.98-1.12 base64enc_0.1-3 htmltools_0.5.5 S4Arrays_1.0.4
[13] progress_1.2.3 curl_5.2.1 Formula_1.2-5 htmlwidgets_1.6.2
[17] plyr_1.8.9 lubridate_1.9.3 cachem_1.0.8 GenomicAlignments_1.36.0
[21] lifecycle_1.0.3 pkgconfig_2.0.3 Matrix_1.6-5 R6_2.5.1
[25] fastmap_1.1.1 snakecase_0.11.1 GenomeInfoDbData_1.2.10 MatrixGenerics_1.12.2
[29] digest_0.6.31 colorspace_2.1-0 pkgload_1.3.2 bezier_1.1.2
[33] Hmisc_5.1-0 RSQLite_2.3.1 org.Hs.eg.db_3.17.0 filelock_1.0.2
[37] timechange_0.3.0 fansi_1.0.4 httr_1.4.6 compiler_4.3.1
[41] withr_2.5.0 bit64_4.0.5 htmlTable_2.4.1 backports_1.4.1
[45] BiocParallel_1.34.2 DBI_1.2.2 rappdirs_0.3.3 DelayedArray_0.26.6
[49] rjson_0.2.21 HDO.db_0.99.1 foreign_0.8-84 zip_2.3.0
[53] nnet_7.3-19 glue_1.6.2 restfulr_0.0.15 GOSemSim_2.26.0
[57] grid_4.3.1 checkmate_2.3.1 cluster_2.1.4 reshape2_1.4.4
[61] fgsea_1.26.0 generics_0.1.3 gtable_0.3.4 BSgenome_1.68.0
[65] tidyr_1.3.1 ensembldb_2.24.0 hms_1.1.3 xml2_1.3.4
[69] utf8_1.2.3 XVector_0.40.0 pillar_1.9.0 splines_4.3.1
[73] BiocFileCache_2.8.0 lattice_0.22-6 bit_4.0.5 biovizBase_1.48.0
[77] tidyselect_1.2.1 GO.db_3.17.0 Biostrings_2.68.1 knitr_1.46
[81] gridExtra_2.3 ProtGenerics_1.32.0 SummarizedExperiment_1.30.2 xfun_0.43
[85] matrixStats_1.3.0 lazyeval_0.2.2 yaml_2.3.7 evaluate_0.21
[89] codetools_0.2-20 tibble_3.2.1 qvalue_2.32.0 cli_3.6.1
[93] rpart_4.1.23 munsell_0.5.1 dichromat_2.0-0.1 Rcpp_1.0.10
[97] dbplyr_2.5.0 png_0.1-8 XML_3.99-0.14 parallel_4.3.1
[101] blob_1.2.4 prettyunits_1.1.1 bitops_1.0-7 VariantAnnotation_1.46.0
[105] scales_1.3.0 openxlsx_4.2.5.2 purrr_1.0.1 crayon_1.5.2
[109] clisymbols_1.2.0 bamsignals_1.32.0 rlang_1.1.1 cowplot_1.1.1
[113] fastmatch_1.1-3 KEGGREST_1.40.0`

support GRangesFilter() without being in a list

The following should work

src = src_organism(dbpath=hg38light())
filter = GRangesFilter(GenomicRanges::GRanges("chr8:18391245-18401218"))
exons(src, filter)

but currently requires

exons(src, list(filter))

Error regarding new dbplyr (v1.3.0.9000) changes

There is an error that is occurring in the tests due to new changes to dbplyr. The tests give the following:

         ERROR
        Running the tests in ‘tests/testthat.R’ failed.
        Last 13 lines of output:
          [32] 95456 - 95461 == -5
          [33] 95456 - 95461 == -5
          [34] 95456 - 95461 == -5
          [35] 95456 - 95461 == -5
          [47] 95461 - 95456 ==  5
          ...
   
          ══ testthat results
    ═════════════════════════════════════════════════════════════════════════
          OK: 200 SKIPPED: 0 FAILED: 1
          1. Failure: select (@test-src_organism-select.R#44)
   
          Error: testthat unit tests failed
          In addition: Warning message:
          call dbDisconnect() when finished working with a connection
          Execution halted

build and check within Bioc timelimits

`transcripts()` does not populate metadata

hg38light <- system.file(
    package="Organism.dplyr", "extdata", "light.hg38.knownGene.sqlite"
)
src <- src_organism(dbpath=hg38light)
transcripts(src)@metadata

table id_go_all has unappealing field names

it is unpleasant to have to modify select statements when switching to
id_go_all (which has goall, evidenceall, etc.) from id_go (which has go, evidence...)

i'd propose using the same simple field names for both tables