Giter VIP home page Giter VIP logo

analysing_pii_leakage's Introduction

Analyzing Leakage of Personally Identifiable Information in Language Models

Build Build Build

This repository contains the official code for our IEEE S&P 2023 paper using GPT-2 language models and Flair Named Entity Recognition (NER) models. It allows fine-tuning (i) undefended, (ii) differentially-private and (iii) scrubbed language models on ECHR and Enron and attacking them using the attacks presented in our paper.

Publication

Analyzing Leakage of Personally Identifiable Information in Language Models. Nils Lukas, Ahmed Salem, Robert Sim, Shruti Tople, Lukas Wutschitz and Santiago Zanella-Béguelin. Symposium on Security and Privacy (S&P '23). San Francisco, CA, USA.

arXiv

Build & Run

We recommend setting up a conda environment for this project.

$ conda create -n pii-leakage python=3.10
$ conda activate pii-leakage
$ pip install -e .

Usage

We explain the following functions. The scripts are in the ./examples folder and run configurations are in the ./configs folder.

  • Fine-Tune: Fine-tune a pre-trained LM on a dataset (optionally with DP or scrubbing).
  • PII Extraction: Given a fine-tuned LM, return a set of PII.
  • PII Reconstruction: Given a fine-tuned LM and a masked sentence, reconstruct the most likely PII candidate for the masked tokens.
  • PII Inference: Given a fine-tuned LM, a masked sentence and a set of PII candidates, choose the most likely candidate.

Fine-Tuning

We demonstrate how to fine-tune a GPT-2 small (Huggingface) model on the ECHR dataset (i) without defenses, (ii) with scrubbing and (iii) with differentially private training (ε=8).

No Defense

$ python fine_tune.py --config_path ../configs/fine-tune/echr-gpt2-small-undefended.yml

With Scrubbing

Note: All PII will be scrubbed from the dataset. Scrubbing is a one-time operation that requires tagging all PII in the dataset first which can take many hours depending on your setup. We do not provide tagged datasets.

$ python fine_tune.py --config_path ../configs/fine-tune/echr-gpt2-small-scrubbed.yml

With DP (ε=8.0)

Note: We use the dp-transformers wrapper around PyTorch's opacus library.

$ python fine_tune.py --config_path ../configs/fine-tune/echr-gpt2-small-dp8.yml

Attacks

Assuming your fine-tuned model is located at ../echr_undefended run the following attacks. Otherwise, you can edit the model_ckpt attribute in the ../configs/<ATTACK>/echr-gpt2-small-undefended.yml file to point to the location of the model.

PII Extraction

This will extract PII from the model's generated text.

$ python extract_pii.py --config_path ../configs/pii-extraction/echr-gpt2-small-undefended.yml

PII Reconstruction

This will reconstruct PII from the model given a target sequence.

$ python reconstruct_pii.py --config_path ../configs/pii-reconstruction/echr-gpt2-small-undefended.yml

PII Inference

This will infer PII from the model given a target sequence and a set of PII candidates.

$ python reconstruct_pii.py --config_path ../configs/pii-inference/echr-gpt2-small-undefended.yml

Evaluation

Use the evaluate.py script to evaluate our privacy attacks against the LM.

$ python evaluate.py --config_path ../configs/evaluate/pii-extraction.yml

This will compute the precision/recall for PII extraction and accuracy for PII reconstruction/inference attacks.

Datasets

The provided ECHR dataset wrapper already tags all PII in the dataset. The PII tagging is done using the Flair NER modules and can take several hours depending on your setup, but is a one-time operation that will be cached in subsequent runs.

Fine-Tuned Models

Unfortunately, we do not provide fine-tuned model checkpoints. This repository does support loading models remotely, which can be done by providing a URL instead of a local path in the configuration files for the model_ckpt attribute.

Citation

Please consider citing the following paper if you found our work useful.

@InProceedings{lukas2023analyzing,
  title      = {Analyzing Leakage of Personally Identifiable Information in Language Models},
  author     = {Lukas, Nils and Salem, Ahmed and Sim, Robert and Tople, Shruti and Wutschitz, Lukas and Zanella-B{\'e}guelin, Santiago},
  booktitle  = {2023 IEEE Symposium on Security and Privacy (SP)},
  year       = {2023},
  publisher  = {IEEE Computer Society},
  pages      = {346-363},
  doi        = {10.1109/SP46215.2023.00154}
}

analysing_pii_leakage's People

Contributors

microsoft-github-operations[bot] avatar microsoftopensource avatar nilslukas avatar s-zanella avatar shrutitople avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.