Giter VIP home page Giter VIP logo

distributed's Introduction

Overview of the distributed package

Introduction

In distributed data networks such as the Sentinel and PCORnet, minimizing the amount of data shared across data partners is important for reducing the danger of potential privacy breach. In this simulation project, we examined the performance of data analysis methods at various levels of data sharing such as meta-analysis, summary table data, risk set data, and individual-level data. The preliminary version of the results can be found in the presentation slides. The final results are in the corresponding publication (Pharmacoepidemiology and Drug Safety 2018 [in press]). More information of privacy-protecting analytic and data-sharing methods is available at www.distributedanalysis.org.

System Requirement

Running the entire simulation requires working installation of R as well as UNIX SAS that can be called via sas command. Most part of the simulation is in pure R, but the experimental weighted risk set analysis is implemented in SAS. When the sas command is not found, this part of simulation is skipped gracefully.

Installation

The package can be installed as follows from the shell if you have an archive file. If it asks for dependencies, you may need to install these required packages first.

R CMD install distributed_0.3.0.tar.gz

Another way to install the package is to directly install from Github within R.

## Install devtools (if you do not have it already)
install.packages("devtools")
## Install directly from github (develop branch)
devtools::install_github(repo = "kaz-yos/distributed")

Using devtools may requires some preparation, please see the following link for information.

http://www.rstudio.com/projects/devtools/

Overview of Package Contents

The distributed package contains the functions used to generate, prepare, and analyze data. It also contains script files that are used to run the simulation study. The functions can be loaded in R using library(distributed). The scripts are found in the inst subfolder in this repository. The unarchived folder contains the following R and shell scripts. You need to create subfolders data, log, log_odyssey, and summary for execution.

./scripts
├── 01.GenerateData.R
├── 01.GenerateData_odyssey.sh
├── 02.PrepareData.R
├── 02.PrepareData_odyssey.sh
├── 03.AnalyzeData.R
├── 03.AnalyzeData_odyssey.sh
├── 04.AggregateResults.R
├── 04.AggregateResults_odyssey.sh
├── 05.AssessMethodsByScenario.R
└── 06.AssessMethodsByScenarioSeries.R

Running Simulation

The simulation has the following distinct phases.

  • Data Generation
  • Data Preparation
  • Data Analysis
  • Result Aggregation
  • Method Assessment

Data Generation

Running the following will generate files containing simulated distributed data network under the data subfolder. The 4 at the end of the command specifies the number of CPU cores to use.

Rscript 01.GenerateData.R 4

Multiple files are generated for each scenario to lessen the resource requirement. Each new file generated under the data subfolder has a name such as ScenarioRaw001_part001_R50.RData, where first number indicates the scenario, part number indicates which part it is in a series of files under this scenario, and R50 indicates the number of iterations included in the file.

If you are running the simulation in a Linux cluster environment with job managers, the following script can aid dispatching the data generation job to a node. These shell script is designed for the Harvard University’s Odyssey cluster (SLURM job manager). Thus, the script will require modification according to the configuration of the cluster system you are using.

sh 01.GenerateData_odyssey.sh

Data Preparation

This step fits summary score models in the data and performs matching, stratification, and weighting by these estimated summary scores. Conceptually, this part corresponds to what each site does in a distributed data network. The process has to be run on each data file as follows. The 4 at the end of the command specifies the number of CPU cores to use.

Rscript 02.PrepareData.R ./data/ScenarioRaw001_part001_R50.RData 4

This will generate a new file named ScenarioPrepared001_part001_R50.RData under the data subfolder. This process can be repeated for each file via for loop, but it is better suited for a cluster system. The following script dispatches the data preparation job on each file to a separate node, thereby, allowing parallel execution. Again the files included are specialized for the cluster the authors used, and need modification before use at a different system.

sh 02.PrepareData_odyssey.sh ./data/ScenarioRaw*

Data Analysis

This step conducts the actual analysis of prepared data for the treatment effect of interest. The process has to be run on each data file as follows. The 4 at the end of the command specifies the number of CPU cores to use.

Rscript 03.AnalyzeData.R ./data/ScenarioPrepared001_part001_R50.RData 4

This will generate a new file named ScenarioAnalyzed001_part001_R50.RData under the data subfolder. Again this can be repeated using a for loop or dispatched to multiple nodes in a cluster system.

sh 03.AnalyzeData_odyssey.sh ./data/ScenarioPrepared*

Result Aggregation

This step aggregates the analysis results into a summary file. The following will load all data files with names containing ScenarioAnalyzed (analysis result files), and output assessment results in the summary subfolder.

sh 04.AggregateResults_odyssey.sh

An R data file named analysis_summary_data.RData will be generated under the data subfolder.

Method Assessment

The following steps are less computationally intensive and designed for local execution with the analysis_summary_data.RData file in the data subfolder.

Rscript 05.AssessMethodsByScenario.R
Rscript 06.AssessMethodsByScenarioSeries.R

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.