Giter VIP home page Giter VIP logo

dataspacer's Introduction

DataSpaceR

R build status codecov CRAN Status Project Status: Active – The project has reached a stable, usable state and is being actively developed. lifecycle

DataSpaceR is an R interface to the CAVD DataSpace, a data sharing and discovery tool that facilitates exploration of HIV immunological data from pre-clinical and clinical HIV vaccine studies.

This package is intended for use by immunologists, bioinformaticians, and statisticians in HIV vaccine research, or anyone interested in the analysis of HIV immunological data across assays, studies, and time.

This package simplifies access to the database by taking advantage of the standardization of the database to hide all the Rlabkey specific code away from the user, and it allows the users to access the study-specific datasets via an object-oriented paradigm.

Examples & Documentation

For more detailed examples and detailed documentation, see the introductory vignette and the pkgdown site.

For a quick guide of how to use the API, see our cheat sheet .

Installation

Install from CRAN:

install.packages("DataSpaceR")

You can install the latest development version from GitHub with devtools:

# install.packages("devtools")
devtools::install_github("ropensci/DataSpaceR")

Register and set DataSpace credential

The database is accessed with the user’s credentials. A netrc file storing login and password information is required.

  1. Create an account and read the terms of use
  2. On your R console, create a netrc file using a function from DataSpaceR:
library(DataSpaceR)
writeNetrc(
  login = "[email protected]", 
  password = "yourSecretPassword",
  netrcFile = "/your/home/directory/.netrc" # use getNetrcPath() to get the default path 
)

This will create a netrc file in your home directory.

Alternatively, you can manually create a netrc file in the computer running R.

  • On Windows, this file should be named _netrc
  • On UNIX, it should be named .netrc
  • The file should be located in the user’s home directory, and the permissions on the file should be unreadable for everybody except the owner
  • To determine home directory, run Sys.getenv("HOME") in R

The following three lines must be included in the .netrc or _netrc file either separated by white space (spaces, tabs, or newlines) or commas. Multiple such blocks can exist in one file.

machine dataspace.cavd.org
login [email protected]
password supersecretpassword

See here for more information about netrc.

Usage

The general idea is that the user:

  1. creates an instance of DataSpaceConnection class via connectDS
  2. browses available studies and groups in the instance via availableStudies and availableGroups
  3. creates a connection to a specific study via getStudy or a group via getGroup
  4. retrieves datasets by name via getDataset

for example:

library(DataSpaceR)
#> By exporting data from the CAVD DataSpace, you agree to be bound by the Terms of Use available on the CAVD DataSpace sign-in page at https://dataspace.cavd.org

con <- connectDS()
con
#> <DataSpaceConnection>
#>   URL: https://dataspace.cavd.org
#>   User: [email protected]
#>   Available studies: 273
#>     - 77 studies with data
#>     - 5049 subjects
#>     - 423195 data points
#>   Available groups: 6
#>   Available publications: 1530
#>     - 12 publications with data

connectDS() will create a connection to DataSpace.

available studies can be listed by availableStudies field

knitr::kable(head(con$availableStudies))
study_name short_name title type status stage species start_date strategy network data_availability ni_data_availability
cor01 NA The correlate of risk targeted intervention study (CORTIS): A randomized, partially-blinded, clinical trial of isoniazid and rifapentine (3HP) therapy to prevent pulmonary tuberculosis in high-risk individuals identified by a transcriptomic correlate of risk Phase III Inactive Assays Completed Human NA NA GH-VAP NA NA
cvd232 Parks_RV_232 ​Limiting Dose Vaginal SIVmac239 Challenge of RhCMV-SIV vaccinated Indian rhesus macaques. Pre-Clinical NHP Inactive Assays Completed Rhesus macaque 2009-11-24 Vector vaccines (viral or bacterial) CAVD NA NA
cvd234 Zolla-Pazner_Mab_test1 Study Zolla-Pazner_Mab_Test1 Antibody Screening Inactive Assays Completed Non-Organism Study 2009-02-03 Prophylactic neutralizing Ab CAVD NA NA
cvd235 mAbs potency Weiss mAbs potency Antibody Screening Inactive Assays Completed Non-Organism Study 2008-08-21 Prophylactic neutralizing Ab CAVD NA NA
cvd236 neutralization assays neutralization assays Antibody Screening Active In Progress Non-Organism Study 2009-02-03 Prophylactic neutralizing Ab CAVD NA NA
cvd238 Gallo_PA_238 HIV-1 neutralization responses in chronically infected individuals Antibody Screening Inactive Assays Completed Non-Organism Study 2009-01-08 Prophylactic neutralizing Ab CAVD NA NA

available groups can be listed by availableGroups field

knitr::kable(con$availableGroups)
group_id label original_label description created_by shared n studies
216 mice mice NA readjk FALSE 75 cvd468, cvd483, cvd316, cvd331
217 CAVD 242 CAVD 242 This is a fake group for CAVD 242 readjk FALSE 30 cvd242
220 NYVAC durability comparison NYVAC_durability Compare durability in 4 NHP studies using NYVAC-C (vP2010) and NYVAC-KC-gp140 (ZM96) products. ehenrich TRUE 78 cvd281, cvd434, cvd259, cvd277
224 cvd338 cvd338 NA readjk FALSE 36 cvd338
228 HVTN 505 case control subjects HVTN 505 case control subjects Participants from HVTN 505 included in the case-control analysis drienna TRUE 189 vtn505
230 HVTN 505 polyfunctionality vs BAMA HVTN 505 polyfunctionality vs BAMA Compares ICS polyfunctionality (CD8+, Any Env) to BAMA mfi-delta (single Env antigen) in the HVTN 505 case control cohort drienna TRUE 170 vtn505

Note: A group is a curated collection of participants from filtering of treatments, products, studies, or species, and it is created in the DataSpace App.

Check out the reference page of DataSpaceConnection for all available fields and methods.

create an instance of cvd408

cvd408 <- con$getStudy("cvd408")
cvd408
#> <DataSpaceStudy>
#>   Study: cvd408
#>   URL: https://dataspace.cavd.org/CAVD/cvd408
#>   Available datasets:
#>     - Binding Ab multiplex assay
#>     - Demographics
#>     - Intracellular Cytokine Staining
#>     - Neutralizing antibody
#>   Available non-integrated datasets:
class(cvd408)
#> [1] "DataSpaceStudy" "R6"

available datasets can be listed by availableDatasets field

knitr::kable(cvd408$availableDatasets)
name label n integrated
BAMA Binding Ab multiplex assay 1080 TRUE
Demographics Demographics 20 TRUE
ICS Intracellular Cytokine Staining 3720 TRUE
NAb Neutralizing antibody 540 TRUE

which will print names of available datasets.

Neutralizing Antibody dataset (NAb) can be retrieved by:

NAb <- cvd408$getDataset("NAb")
dim(NAb)
#> [1] 540  33
colnames(NAb)
#>  [1] "participant_id"      "participant_visit"   "visit_day"          
#>  [4] "assay_identifier"    "summary_level"       "specimen_type"      
#>  [7] "antigen"             "antigen_type"        "virus"              
#> [10] "virus_type"          "virus_insert_name"   "clade"              
#> [13] "neutralization_tier" "tier_clade_virus"    "target_cell"        
#> [16] "initial_dilution"    "titer_ic50"          "titer_ic80"         
#> [19] "response_call"       "nab_lab_source_key"  "lab_code"           
#> [22] "exp_assayid"         "titer_id50"          "titer_id80"         
#> [25] "nab_response_id50"   "nab_response_id80"   "slope"              
#> [28] "vaccine_matched"     "study_prot"          "virus_full_name"    
#> [31] "virus_species"       "virus_host_cell"     "virus_backbone"

Check out the reference page of DataSpaceStudy for all available fields and methods.

Note: The package uses a R6 class to represent the connection to a study and get around some of R’s copy-on-change behavior.

Meta

  • Please report any issues or bugs.
  • License: GPL-3
  • Get citation information for DataSpaceR in R doing citation(package = 'DataSpaceR')
  • Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

ropensci_footer

dataspacer's People

Contributors

helenmiller16 avatar jeroen avatar jmtaylor-fhcrc avatar juyeongkim avatar seaaan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

dataspacer's Issues

Duplicate records found in mAb object's `mabs` value

There are duplicate records found in the mAb object's mabs value. See the code example below. Taking unique on the mabs value returns fewer records than the mabs value.

library(DataSpaceR)
con <- connectDS()
con$filterMabGrid("mab_mixture", c("PGT121", "PGDM1400"))
mab <- con$getMab()

nrow(mab$mabs)
nrow(unique(mab$mabs))

`availableGroups$studies` showing incorrectly parsed study IDs

Issue

availableGroups$studies showing incorrectly parsed study IDs.

Reproducible example

> con$availableGroups$studies
[[1]]
character(0)

[[2]]
[1] "cvd338 Pooled" "cvd317"        "cvd338"        "cvd305"        "cvd317 Pooled" "cvd324"        "cvd320"       

[[3]]
[1] "cvd483" "cvd316" "cvd331"

[[4]]
character(0)

The second group should have cvd338, cvd317, cvd305, cvd324, cvd320. "Pooled" string is not parsed for cvd338 and cvd317.

Session info

> sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.3 LTS

Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.18.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
                                                                                                                                                                                  
attached base packages:                                                                                                                                                           
[1] stats     graphics  grDevices utils     datasets  methods   base                                                                                                              
                                                                                                                                                                                  
other attached packages:                                                                                                                                                          
[1] DataSpaceR_0.5.1                                                                                                                                                              
                                                                                                                                                                                  
loaded via a namespace (and not attached):                                                                                                                                        
[1] httr_1.3.1          compiler_3.4.3      rjson_0.2.15                                                                                                                          
[4] assertthat_0.2.0    Rlabkey_2.2.1       R6_2.2.2                                                                                                                              
[7] Rcpp_0.12.16        data.table_1.10.4-3 digest_0.6.15         

Can't retrieve datasets over 100,000 rows

Issue

getDataset() method can't retrieve all of the rows for datasets with more than 100,000 rows.

Reproducible example

> library(DataSpaceR)
> con <- connectDS()
> sdy <- con$getStudy("")
> sdy$availableDatasets
           name                           label      n
1:         BAMA      Binding Ab multiplex assay  81128
2: Demographics                    Demographics   1874
3:      ELISPOT        Enzyme-Linked ImmunoSpot   5610
4:          ICS Intracellular Cytokine Staining 137963
5:          NAb           Neutralizing antibody  46507
> ICS <- sdy$getDataset("ICS")
> dim(ICS)
[1] 100000     29

Session info

> sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.3 LTS

Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.18.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
                                                                                                                                                                                  
attached base packages:                                                                                                                                                           
[1] stats     graphics  grDevices utils     datasets  methods   base                                                                                                              
                                                                                                                                                                                  
other attached packages:                                                                                                                                                          
[1] DataSpaceR_0.5.1                                                                                                                                                              
                                                                                                                                                                                  
loaded via a namespace (and not attached):                                                                                                                                        
[1] httr_1.3.1          compiler_3.4.3      rjson_0.2.15                                                                                                                          
[4] assertthat_0.2.0    Rlabkey_2.2.1       R6_2.2.2                                                                                                                              
[7] Rcpp_0.12.16        data.table_1.10.4-3 digest_0.6.15         

Username not displayed when comments found in netrc file

When comments are found in the netrc file, the username is not parsed correctly from the netrc file for printing of the connection object.

This causes the failure of a test in test-connection.R at line 43 as of the posting date of this issue.

Make Monoclonal Antibodies (NAbMAb) data available

App workflow:

  • Navigate to MAb Grid
  • Select "MAb/Mixture"
  • Click "Export CSV" or "Export Excel" to download MAb data

Potential workflow with DataSpaceR:

devtools::install_github("CAVDDataSpace/DataSpaceR", ref = "fb_mAb")

# connect to DataSpace
library(DataSpaceR)
con <- connectDS()

# explore mAb Grid and decide on which MAb to pull data
con$mAbGrid

# filter mAb grid by mixture name, donor species, viruses, isotype, hxb2 location, viruses, clades, tiers, curve IC50), or studies
con$filterMabGrid(using = "isotype", value = "IgG")
con$filterMabGrid(using = "donor_species", value = "llama")

# get mAb dataset of mAb fixtures from filtered mAb grid
myMab <- con$getMab()

# explore mAb object (mimic the contents of excel file or csv files)
myMab$mabs
myMab$nabMab
myMab$studies
myMab$assays
myMab$metadata
myMab$variableDefinitions
myMab$studyAndMabs
  • Implement mAbGrid
  • Implement filterMAb() and clearMAbfilter()
  • Implement DataSpaceMAb class
  • Implement getMAb()
  • Write tests
  • Document
  • Write a vignette

https://github.com/LabKey/cds/blob/bdc9346651346029f8d623e99617f2cc53533450/webapp/Connector/src/utility/MabQuery.js#L493

Dataset variable formatting suggestion

Suggestion

When getting data back from DSR there are mixed casing in the returned dataset.

con <- connectDS()
vtn <- con$getStudy("vtn505")
ics <- vtn$getDataset("ICS")
names(ics)

...returns...

 [1] "ParticipantId"                      "ParticipantVisit/Visit"            
 [3] "visit_day"                          "assay_identifier"                  
 [5] "summary_level"                      "cell_type"                         
 [7] "cell_name"                          "antigen"                           
 [9] "antigen_type"                       "peptide_pool"                      
[11] "protein"                            "protein_panel"                     
[13] "protein_panel_protein"              "protein_panel_protein_peptide_pool"
[15] "specimen_type"                      "functional_marker_name"            
[17] "functional_marker_type"             "clade"                             
[19] "vaccine_matched"                    "response_call"                     
[21] "pctpos"                             "pctpos_adj"                        
[23] "pctpos_neg"                         "lab_code"                          
[25] "exp_assayid"                        "ics_lab_source_key"                
[27] "response_method"                    "control"                           
[29] "pooled_info"                        "study_prot"                        
[31] "functionality_score"                "polyfunctionality_score"

Here, ParticipantId and ParticipantVisit/Visit are formatted differently than the rest of the variables. consider formatting the names of these two fields to match the others.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.