Giter VIP home page Giter VIP logo

data-privacy-toolkit's Introduction

IBM Data Privacy Toolkit (DPT)

DataPrivacyToolkit-Library DataPrivacyToolkit-Toolkit DataPrivacyToolkit-Spark CodeQL Contributor Covenant license

The IBM Data Privacy Toolkit, (formerly known as Privacy Masking and Anonymization, or PRIMA) is a toolkit for data type identification, privacy risk assessment, data masking and data anonymization that is exposed as a Java/Scala library and as a REST API. The toolkit consists of four main components:

  • Type identification
  • Masking providers
  • Privacy risk assessment
  • Anonymization providers

Each feature will be briefly described in the remaining of this text.

Type identification

Our toolkit provides a type identification mechanism that frees the user from the burden of specifying the types of the fields of the data to be analysed. This is a core feature of our toolkit for it was designed to operate in a modern infrastructure where dynamic and heterogeneous schema coexist. Our toolkit identifies more than 30 types; from general types like name, race, date and time to industry-specific attributes like ICD codes (medical diseases) and credit card numbers. The type identification mechanism is extensible; it provides a programming interface for users to implement their own identifiers based on either regular expressions or dictionaries.

Masking providers

Our toolkit provides masking functionality for all supported data types. The masking providers are designed and implemented with utility preservation in mind. For example, masking a credit card number in a utility-preserving way means to maintain the vendor information (VISA, MasterCard etc.) and randomise the unique identifier.

Our masking framework supports two important properties:

  • Consistent data masking: consistency is required for correlation across different masking operations.
  • Compound data masking: our toolkit identifies relationships between the data and uses them as an input to the masking process. For example, a city attribute is linked with the country attribute. Whenever city attribute is masked, the country attribute reflects the correct country for the new value of the city. Compound data masking is essential to create more realistic masked output. Our toolkit detects and masks accordingly relationship for location data (city, country, continent), dates and numerical properties (sum, product, approximate sum).

Privacy risk identification

The privacy risk identification component identifies combinations in the data that appear less times than certain threshold. For example, a combination of age plus gender plus ZIP code in a dataset can appear 2 times, which practically means that an adversary can narrow down his/her re-identification attack to two people. Our toolkit includes a set of algorithms that perform privacy risk identification with scalability and performance in mind.

Anonymization providers

Masking providers protect unique identifiers, like names, e-mails etc. Combinations of attributes that lead to privacy risks (also known as quasi-identifiers), such as the ones that are detected by the privacy risk identification algorithms of our toolkit, require a separate anonymization process. Our toolkit includes anonymization algorithms that can be applied to a dataset and protect the quasi-identifiers by replacing them with generalised values.

References

Academic References:

  • M. Kesarwani, A. Kaul, S. Braghin, N. Holohan, and S. Antonatos, "Secure k-anonymization over encrypted databases," 2021 IEEE 14th International Conference on Cloud Computing (CLOUD), 2021, pp. 20-30, doi: 10.1109/CLOUD53861.2021.00015.
  • M. Pachilakis, S. Antonatos, K. Levacher, and S. Braghin, "PrivLeAD: Privacy Leakage Detection on the Web," Intelligent Systems and Applications. IntelliSys 2020. Advances in Intelligent Systems and Computing, vol 1250. Springer, Cham. doi: 10.1007/978-3-030-55180-3_32.
  • S. Braghin, J. H. Bettencourt-Silva, K. Levacher, and S. Antonatos, "An Extensible De-Identification Framework for Privacy Protection of Unstructured Health Information: Creating Sustainable Privacy Infrastructures", In MEDINFO 2019: Health and Wellbeing e-Networks for All (pp. 1140-1144). IOS Press. doi: 10.3233/SHTI190404.
  • S. Antonatos, S. Braghin, N. Holohan, and P. Mac Aonghusa, "AnonTokens: tracing re-identification attacks through decoy records." arXiv preprint arXiv:1906.09829 (2019).
  • S. Antonatos, S. Braghin, N. Holohan, Y. Gkoufas and P. Mac Aonghusa, "PRIMA: An End-to-End Framework for Privacy at Scale," 2018 IEEE 34th International Conference on Data Engineering (ICDE), 2018, pp. 1531-1542, doi: 10.1109/ICDE.2018.00171.
  • N. Holohan, S. Antonatos, S. Braghin, and P. Mac Aonghusa, "(k, ϵ)-anonymity: k-anonymity with ϵ-differential privacy,", 2017, arXiv preprint arXiv:1710.01615.
  • A. Gkoulalas-Divanis and S. Braghin, "IPV: A system for identifying privacy vulnerabilities in datasets," in IBM Journal of Research and Development, vol. 60, no. 4, pp. 14:1-14:10, July-Aug. 2016, doi: 10.1147/JRD.2016.2576818.
  • A. Gkoulalas-Divanis, S. Braghin and S. Antonatos, "FPVI: A scalable method for discovering privacy vulnerabilities in microdata," 2016 IEEE International Smart Cities Conference (ISC2), 2016, pp. 1-8, doi: 10.1109/ISC2.2016.7580849.
  • A. Gkoulalas-Divanis and S. Braghin, "Efficient algorithms for identifying privacy vulnerabilities," 2015 IEEE First International Smart Cities Conference (ISC2), 2015, pp. 1-8, doi: 10.1109/ISC2.2015.7366170.

Citing this Project

If you use the DPT code, please consider citing:

@software{dpt-code,
  author = {DPT Team},
  month = {12},
  title = {{IBM Data Privacy Toolkit}},
  url = {https://github.com/IBM/data-privacy-toolkit},
  version = {main},
  year = {2022}
}

and

@inproceedings{antonatos2018prima,
  title={{PRIMA: an end-to-end framework for privacy at scale}},
  author={Antonatos, Spiros and Braghin, Stefano and Holohan, Naoise and Gkoufas, Yiannis and Mac Aonghusa, Pol},
  booktitle={2018 IEEE 34th International Conference on Data Engineering (ICDE)},
  pages={1531--1542},
  year={2018},
  organization={IEEE}
}

License

The DPT project is under the Apache 2.0 license. Please see details here.

data-privacy-toolkit's People

Contributors

stefano81 avatar dependabot[bot] avatar marcosimioni avatar ibm-open-source-bot avatar nedshivina avatar amdadulbari avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.