Overview of the distributed package

Introduction

In distributed data networks such as the Sentinel and PCORnet, minimizing the amount of data shared across data partners is important for reducing the danger of potential privacy breach. In this simulation project, we examined the performance of data analysis methods at various levels of data sharing such as meta-analysis, summary table data, risk set data, and individual-level data. The preliminary version of the results can be found in the presentation slides. The final results are in the corresponding publication (Pharmacoepidemiology and Drug Safety 2018 [in press]). More information of privacy-protecting analytic and data-sharing methods is available at www.distributedanalysis.org.

System Requirement

Running the entire simulation requires working installation of R as well as UNIX SAS that can be called via sas command. Most part of the simulation is in pure R, but the experimental weighted risk set analysis is implemented in SAS. When the sas command is not found, this part of simulation is skipped gracefully.

Installation

The package can be installed as follows from the shell if you have an archive file. If it asks for dependencies, you may need to install these required packages first.

R CMD install distributed_0.3.0.tar.gz

Another way to install the package is to directly install from Github within R.

## Install devtools (if you do not have it already)
install.packages("devtools")
## Install directly from github (develop branch)
devtools::install_github(repo = "kaz-yos/distributed")

Using devtools may requires some preparation, please see the following link for information.

http://www.rstudio.com/projects/devtools/

Overview of Package Contents

The distributed package contains the functions used to generate, prepare, and analyze data. It also contains script files that are used to run the simulation study. The functions can be loaded in R using library(distributed). The scripts are found in the inst subfolder in this repository. The unarchived folder contains the following R and shell scripts. You need to create subfolders data, log, log_odyssey, and summary for execution.

./scripts
├── 01.GenerateData.R
├── 01.GenerateData_odyssey.sh
├── 02.PrepareData.R
├── 02.PrepareData_odyssey.sh
├── 03.AnalyzeData.R
├── 03.AnalyzeData_odyssey.sh
├── 04.AggregateResults.R
├── 04.AggregateResults_odyssey.sh
├── 05.AssessMethodsByScenario.R
└── 06.AssessMethodsByScenarioSeries.R

Running Simulation

The simulation has the following distinct phases.

Data Generation
Data Preparation
Data Analysis
Result Aggregation
Method Assessment

Data Generation

Running the following will generate files containing simulated distributed data network under the data subfolder. The 4 at the end of the command specifies the number of CPU cores to use.

Rscript 01.GenerateData.R 4

Multiple files are generated for each scenario to lessen the resource requirement. Each new file generated under the data subfolder has a name such as ScenarioRaw001_part001_R50.RData, where first number indicates the scenario, part number indicates which part it is in a series of files under this scenario, and R50 indicates the number of iterations included in the file.

If you are running the simulation in a Linux cluster environment with job managers, the following script can aid dispatching the data generation job to a node. These shell script is designed for the Harvard University’s Odyssey cluster (SLURM job manager). Thus, the script will require modification according to the configuration of the cluster system you are using.

sh 01.GenerateData_odyssey.sh

Data Preparation

This step fits summary score models in the data and performs matching, stratification, and weighting by these estimated summary scores. Conceptually, this part corresponds to what each site does in a distributed data network. The process has to be run on each data file as follows. The 4 at the end of the command specifies the number of CPU cores to use.

Rscript 02.PrepareData.R ./data/ScenarioRaw001_part001_R50.RData 4

This will generate a new file named ScenarioPrepared001_part001_R50.RData under the data subfolder. This process can be repeated for each file via for loop, but it is better suited for a cluster system. The following script dispatches the data preparation job on each file to a separate node, thereby, allowing parallel execution. Again the files included are specialized for the cluster the authors used, and need modification before use at a different system.

sh 02.PrepareData_odyssey.sh ./data/ScenarioRaw*

Data Analysis

This step conducts the actual analysis of prepared data for the treatment effect of interest. The process has to be run on each data file as follows. The 4 at the end of the command specifies the number of CPU cores to use.

Rscript 03.AnalyzeData.R ./data/ScenarioPrepared001_part001_R50.RData 4

This will generate a new file named ScenarioAnalyzed001_part001_R50.RData under the data subfolder. Again this can be repeated using a for loop or dispatched to multiple nodes in a cluster system.

sh 03.AnalyzeData_odyssey.sh ./data/ScenarioPrepared*

Result Aggregation

This step aggregates the analysis results into a summary file. The following will load all data files with names containing ScenarioAnalyzed (analysis result files), and output assessment results in the summary subfolder.

sh 04.AggregateResults_odyssey.sh

An R data file named analysis_summary_data.RData will be generated under the data subfolder.

Method Assessment

The following steps are less computationally intensive and designed for local execution with the analysis_summary_data.RData file in the data subfolder.

Rscript 05.AssessMethodsByScenario.R
Rscript 06.AssessMethodsByScenarioSeries.R

kaz-yos / distributed Goto Github PK

distributed's Introduction

Overview of the distributed package

Introduction

System Requirement

Installation

Overview of Package Contents

Running Simulation

Data Generation

Data Preparation

Data Analysis

Result Aggregation

Method Assessment

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent