Giter VIP home page Giter VIP logo

sirad's Introduction

Secure Infrastructure for Research with Administrative Data (SIRAD)

sirad is an integration framework for data from administrative systems. It deidentifies administrative data by removing and replacing personally identifiable information (PII) with a global anonymized identifier, allowing researchers to securely join data on an individual from multiple tables without knowing the individual's identity. It is developed by Research Improving People's Lives (RIPL).

For a worked example using synthetic data, please see sirad-example.

More detailed documentation of the sirad configuration file and layout file formats is available in the wiki.

To learn more about the motivation for creating this package and its potential uses, please see our articles in Communications of the ACM and Software Impacts:

J.S. Hastings, M. Howison, T. Lawless, J. Ucles, P. White. (2019). Unlocking Data to Improve Public Policy. Communications of the ACM 62(10): 48-53. doi:10.1145/3335150

M. Howison, M. Goggins. (2022). SIRAD: Secure Infrastructure for Research with Administrative Data. Software Impacts 12: 100245. doi:10.1016/j.simpa.2022.100245

Installation

Requires Python 3.7 or later.

To install from PyPI using pip:
pip install sirad

To install a development version from the current directory:
pip install -e .

Running

There is a single command line script included, sirad.

sirad supports the following arguments:

  • process - split raw data files into data and PII files
  • research - create a versioned set of research files with a unique anonymous identifier

Configuration

To set configuration options, create a file called sirad_config.py and place either in the directory where you are executing the sirad command or somewhere else on your Python path. See _options in config.py for a complete list of possible options and default values.

The following options are available:

  • DATA_SALT: secret salt used for hashing data values. This shouldn't be shared. A warning will be outputted if it is not set. Defaults to None.

  • PII_SALT: secret salt used for hashing pii values. This shouldn't be shared. A warning will be issued if it is not set. Defaults to None.

  • LAYOUTS: directory that contains layout files. Defaults to layouts/.

  • RAW_DIR, DATA_DIR, PII_DIR, LINK_DIR, RESEARCH_DIR: paths to where the original data, the processed files, and the research files will be saved.

  • VERSION: the current version number of the processed and research files.

Layout files

sirad uses YAML files to define the layout, or structure, of raw data files. These YAML files define each column in the incoming data and how it should be processed. More documentation to come on this YAML format.

The following file formats are supported:

  • csv - change delimiter with delimiter option
  • fixed with
  • xlsx (xls not currently supported)

Development

Sample test data is randomly generated using Faker; none of the information identifies real individuals.

  • tax.txt - sample tax return data. Includes first, last, DOB and SSN.
  • credit_scores.txt - sample credit score information. Includes first, last and DOB but no SSN.

Run unit tests as:

python -m unittest discover

Contributors

  • Mark Howison
  • Ted Lawless
  • John Ucles
  • Preston White
  • Marcelle Goggins

sirad's People

Contributors

mhowison avatar m-goggins avatar

Stargazers

rasmi avatar Chelsea Kelly-Reif avatar saphir-lab avatar  avatar Paul Stey avatar

Watchers

James Cloos avatar  avatar Chelsea Kelly-Reif avatar  avatar  avatar

sirad's Issues

Improve performance of research command

The groupby command can be very slow in Pandas for large tables with many distinct combinations.
Replace groupby in research.py with drop_duplicates in the stage that matches DOB/names to distinct valid SSN.

Add delimiter detection

For CSV input files, print an INFO message based on a regex search of the first line to try to determine a potential delimiter to help with debugging/troubleshooting.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.