Giter VIP home page Giter VIP logo

privacypanda's Introduction

Binder License

PrivacyPanda

PrivacyPanda is a package for detecting and removing personal, private data (such as names and addresses) from pandas dataframes.

Why privacypanda

The volume of available information - personal, private information - of each and every one of us is vast and growing. This information can be used to build such a clear picture of who you are that bad actors can know you better than your partner does. In the wrong hands, this data can influence the way you shop, the way you vote, the way you think...

A necessary step to protect ourselves is to anonymize data - to strip it of any identifying features like our names or addresses. While many of the people handling private data are trustworthy, honest and ethical, we can't always trust that they will successfully scrub a dataset of any information which may be used against us.

privacypanda aims to make data anonymization a little bit easier by providing tools to detect identifying features in pandas dataframes and expunge them.

How to install

privacypanda requires python of 3.7 or above and pandas >= 1.0.0.

privacypanda can be installed via pip with

pip install privacypanda

Alternatively, to install from source:

  1. clone the repository
  2. navigate to the project folder
  3. run pip install -e .

How to use

See the example notebooks for more extensive usage. Click this link to run the example notebooks online.

With privacypanda you can audit the privacy of your dataframe:

import pandas as pd
import privacypanda as pp

data = pd.DataFrame(
    {
        "privateData":
            [
                "[email protected]",
                "AB1 1AB",
                "Some other data",
            ],
        "nonPrivateData":
            [
                1,
                2,
                3,
            ],
    }
)

print(pp.report_privacy(data))

This prints the names of any colums in the data which break privacy, and the ways in which privacy is broken.

>>> "privateData": ["address", "email"]

Contributing

All contributions are important and welcomed. Please see the contributing guide for more information.

License

The PrivacyPanda project is licensed with Apache 2.0. Please refer to the license for more information.

privacypanda's People

Contributors

ttitcombe avatar

Stargazers

 avatar

Watchers

 avatar  avatar

privacypanda's Issues

Create privacy check decorator

There should be a function to check the privacy of a dataframe which can be used as a decorator.

E.g.

>>> @checkprivacy
>>> def func_that_returns_privacy_breach():
>>>   return df_containing_private_data
PrivacyError

Add binder badge

Once example notebooks have been added (#13), a binder badge should be added to the readme to allow users to try PrivacyPanda without setting up an environment

Create simple report

A user should be able to query a dataframe and see a list of privacy breaching columns and their reason for breaching privacy.

Basic functionality is to print a list of column: reason pairs

Identify simple names

The issue

A key feature of privacypanda is to identify names in a dataframe. For the first implementation, we should aim to identify common western "Firstname [Middlename] LastName" names.

Proposed solution

There should exist a function which, when called with a dataframe, returns the names of columns which contain any names

Things to consider

Are there libraries containing common names?
Do not identify column/row pairs at the moment
Do not be overly concerned with efficiency at the moment

Remove privacy-breaching columns

The issue
The simplest way to handle edge cases is to remove a column which contains any breach of privacy.

Proposed solution
There should exist a function which identifies columns in a dataframe which contain privacy-breaching data and returns the dataframe with those columns removed

Remove rows containing private data

The issue:
#7 introduced a function for removing columns which contain at least 1 privacy breach. While this reduced the scope for false negatives, it can remove a lot of useful data

Proposed solution:
anonymize should accept a boolean flag (default False) which, if True, would drop rows which contain a privacy breach

Allow users to specify search strictness

The issue:
Currently the rules for what is and isn't a breach of privacy is hardcoded. I.e. "10 Downing Street" is a breach but "london" is not. Users should be able to specify a severity level for considering something a breach

Proposed solution:
Check and anonymization functions to take an enum which specifies a preset level of severity.
E.g.
check_addresses(severity=STRICT_ADDRESSES) where STRICT_ADDRESSES considers towns or countries a breach privacy.

Similar checks could be made for requiring a first name AND a last name, a streetname without a house number etc.

Fuzzy search for private data

The feature
Small typos in names and places should be recognised as private data. We should introduce fuzzy searching of data. This should introduce a "privacy likelihood" with a greater resemblance to a privacy breaking keyword (e.g. "street", "@gmail.com") producing a greater score

Example
"10 Downing Stret" should be recognised as an address

Considered implementation
Use fuzzywuzzy

Simplify check code

The code for the various checks is largely the same. The only thing that changes are the regex patterns which are searched and, when datetimes are implemented, the dataframe column datatypes which are searched for breaches.
It would be cleaner to create a generated "check" function, and supply it with the various regex patterns

Convert names into unique ids

There should be a function which takes a given column and converts the entries to unique ids. The function should check the content of the column to check if that's valid. I.e. if it consists of names then go ahead, but if it consists of sex then the function may suggest onehot encoding instead

How does this enhance privacy?
Turning e.g. Alice -> 0; Bob -> 1 removes information which directly links data entries to a person, but still identifies entries, so multiple entries by the same person can be linked

Identify simple UK phone numbers

Phone numbers should be considered a breach of privacy

Phone numbers differ from region to region. To being with, we should consider UK mobile numbers, of the form 07 + nine digits, with +44 possibly replacing the leading 0

Identify simple addresses

The issue

We must be able to identify columns containing simple addresses. Addresses can get very complicated and there any many edge cases that will need considering at some point, however for this first implementation we should only consider "Housenumber streetname" and "post/zipcode" addresses.

British postcodes are of the format LetterLetterDigit[Character] DigitLetterLetter e.g. SW1A 2AA. Common street suffixes are street, road, way, avenue... any others?

Proposed solution

There should exist a function which, when called with a dataframe, returns the names of columns containing simple addresses

Things to consider

  • Do not consider address edge cases at the moment.
  • Do not consider town names or greater at the moment. I.e. "London" should not be considered a breach of privacy.

Create example notebooks

Jupyter notebooks should be created to demonstrate the basic functionality of PrivacyPanda. These notebooks should be placed in the examples folder

Support addresses without building numbers

We currently support building number + street name addresses, however the street name alone is often enough to breach privacy. This is especially true in rural areas where streets can have only a handful of buildings.

We should identify street names alone as an address. E.g. "Downing Street" as well as "10 Downing Street". The numberless address should be a different severity level, according to #8

Identify street name abbreviations

#6 introduced the capability to detect full street names, with street types "street", "road", "avenue", "way".

Addresses can be abbreviated e.g. 10 Downing St. instead of 10 Downing Street. Abbreviated street types should be included in an address search

Clean private data

The issue:
#4 and #9 remove lots of data if private data is detected. Another solution would be to replace private data with a non-private placeholder

Proposed solution:
anonymize to take an optional boolean argument (default False) which, if True, would replace data with a preset placeholder. The current idea is to replace private data with a placeholder specific to the data type, i.e. "AB1 1AB" -> "POSTCODE"; "Joe Bloggs" -> "Firstname Lastname". Although, this may still be a breach of privacy and we should consider a generic placeholder

This boolean flag would take preference over #9.

Identify emails

Email addresses should be considered a breach of privacy.

While we could naively assume .*@.* is an emaill, this would lead to a lot of false negatives. To begin with, we should identify emails using a whitelist of domains e.g. gmail, hotmail and .co.uk, .com, .org, .edu etc.

Onehot private data

The issue
Rather than removing private categorical variables such as sex, it would be useful to use a function which converts them into onehot vectors with non-descript names e.g. "Categorical var 1" and "Categorical var 2". This enhances privacy slightly by making it more difficult to identify personal parameters for somebody who is not intimately knowledgeable with the data or how it was anonymised.

There could be an option to randomly flip the onehot encoding order i.e. sometimes "male" is [1, 0], sometimes it's [0, 1]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.