NIST Privacy Collaborative Reseach

The Data-Centric AI Community just launched a small community project to experiment with the NIST Challenge!

Goal: To learn about Synthetic Data and how it can be used to prepare sensitive private data for public release!
Dates: From April to July. You can also join at any time, follow the weekly plan, and post questions on our Discord.
Where: 🤖-nist-challenge channel in our Discord Server
Touch Points: We meet every Friday around 4 PM GTM on the 🧠-code-with-me channel to discuss the project.

Overview

🎯 The overall goal of the project is to explore synthetic data to prepare sensitive private data for public release.

📀 NIST has launched a benchmark of 3 datasets, MA, TX (Texas), and NATIONAL which you can use in the project.

📊 To provide an evaluation of the de-identified data against the target/real data, NIST has created the sdnist package that can be installed according to the instructions below.

💻 To create the de-identified data, we'll use ydata-synthetic package, explore different model settings and study the effect this has on the final results.

🧭 Learning Outcomes

Week	What you will learn
1	Goal and objectives of the project. You'll connect with other learners in the DCAI Discord Server and be added to the NIST Team to access the 🤖-nist-challenge channel and receive permissions to collaborate on the GitHub project.
2	Basics of Synthetic Data. You will learn more about what is synthetic data, how is it generated, what are the main applications.
3	Basics of Data Profiling. You will learn what is data profiling, how to understand your data with descriptive statistics, and what are common data quality issues. You will also explore the NIST datasets with `ydata-profiling` and preprocess the data according to your findings.
4 & 5	Generation of Synthetic Data. You will explore Deep Learning models (Generative Adversarial Networks -- GAN) to generate realistic synthetic data using `ydata-synthetic`.
6 & 7	Basics of Evaluating Synthetic Data. You will explore some strategies to evaluate synthetic data and investigate possible improvements to your solution. We will explore the `sdnist` package to evaluate our synthetic data.
8	Project Showcase. You will learn how to best showcase and publicize your project in your data portfolio, CV, GitHub, or Medium Account.

🔨 Tasks

Week 1:

Read the instructions and information about the challenge
Learn about the benchmark data released -- The NIST Diverse Communities Data Excerpts
Post questions and ideas on the 🤖-nist-challenge channel

Week 2:

Learn about the basic aspects of Synthetic Data:
Post questions and comments on the 🤖-nist-challenge channel
Meet us on Friday (May 5) to discuss what you've learned (check the available slots on our 📅 Discord Calendar)

Week 3:

Learn about the basic aspects of Data Profiling:
- 📺 Auditing Data Quality with ydata-profiling: learn about what is data profiling, what common data quality issues we find in real-world domains (can you spot a few in the NIST datasets?), and how ydata-profiling can help you diagnose and overcome them
- 📖 Awesome Data Science Tools to Master in 2023: Data Profiling Edition: learn more about data profiling and existing open source tools to understand your data to the fullest!
- 📖 Auditing Data Quality with YData Profiling: an overview of ydata-profiling functionalities and how-to's
Start profiling the NIST data:
- Installydata-profiling (check the Installation Instructions below) and don't forget to star it, thank you! ⭐️
- Choose one of the NIST datasets (MA, TX, or NATIONAL):
  - The datasets are available here
  - Run a Profile Report on your data (check the Installation Instructions below)
  - Create an excel file to register your learnings. Suggestion for the columns: Feature Name | Data Type (Numeric/Categorical) | Missing Values (Y/N) | Notes/Observations. Your observations should be based on the profiling report, but also on the description of the features provided
Post questions and comments on the 🤖-nist-challenge channel.
Meet us on Friday (May 12) to discuss what you've learned (check the available slots on our 📅 Discord Calendar). Don't forget to bring your excel file with the data description and your profiling report!

Weeks 4 & 5:

Investigate ydata-synthetic and some of the models used to Generate Synthetic Data:
Start experimenting with ydata-synthetic (check the Installation Instructions below and don't forget to star it, thank you! ⭐️). If you prefer a UI experience, you can also leverage the Streamlit App in version 1.0.0:
Compare your synthetic data with the real data using the .compare() functionality of ydata-profiling:
- 📖 How to compare 2 datasets with ydata-profiling. What are the obtained results? Are there any aspects that you can improve?
Post questions and comments on the 🤖-nist-challenge channel! You can upload your profiling reports the the channel so that we can discuss changes and improvements.

⚙️ Installation Instructions

📦 How to create and use Virtual Environments?

A lot of troubleshooting arises due to misalignments between environments and package requirements. If you're new to data science development, maybe you just install packages unto your global Python environment. This may turn into a lot of headaches when project requirements are conflicting.

Virtual Environments are ideal to overcome this issue: they isolate your installations from the "global" environment, so that you don't have to worry about conflicts. If you've never used virtual environments for your data science projects, you can start by installing anaconda. If you need a little convincing that this is a nice tool to have on your belt, then check this post comparing conda with pip, venv, and pyenv.

Once anaconda is installed, creating a new environment is as easy as running this on your shell:

conda create --name synth-env python=3.10

This creates a new environment called synth-env with Python version 3.10.X. You can then switch to this environment by activating it:

conda activate synth-env

In this new environment, you can still call pip to install python packages, such as ydata-synthetic:

pip install ydata-synthetic

Now you can open up your Python editor or Jupyter Notebook and use the synth-env as your development environment, without having to worry with conflicting versions or packages between projects! Once you're done, you can deactivate the environment using:

conda deactivate synth-env

Suggested Materials

📖 Environments, Conda, Pip, aaaaah!: How to manage Python Environments without a headache
📺 How to "pip install ydata-synthetic" without errors!: How to install anaconda, create a virtual environment using conda, install packages with pip, and use the virtual environments in PyCharm or Jupyter Notebooks

📊 How to install ydata-profiling and create a Profiling Report?

You may start by creating your virtual environment and installing the package:

conda create -n synth-env python=3.10
conda activate synth-env
pip install ydata-profiling==4.1.2

Then, in your Jupyter Notebook or other editor (e.g., PyCharm), load your Pandas DataFrame as you normally would and the generation of the profiling report is straightforward:

import pandas as pd
from pandas_profiling import ProfileReport

# Read the data from a csv file (NIST "MA" data in the example)
df = pd.read_csv("ma2019.csv")

# Generate the data profiling report 
original_report = ProfileReport(df, title='Original Data')
original_report.to_file("original_report.html")

You can then navigate the report to investigate the data quality issues generated, and study the basic descriptive statistics of your data!

Additional Materials

📚 Examples with real-world datasets: A list of examples and data profiling reports and usage of ydata-profiling
🙇🏽‍♂️ Read the Docs: Documentation: from installation and quickstart to integrations and advanced usage

🤖 How to install ydata-synthetic and create a synthesizer?

You may use you previous virtual environment (synth-env). Activate it and and install the package:

conda activate synth-env
pip install ydata-synthetic==1.1.0

Then, you can leverage one of the models available in the package. In this example, we will be using CTGAN:

# Load data
data = fetch_data('adult')
num_cols = ['age', 'fnlwgt', 'capital-gain', 'capital-loss', 'hours-per-week']
cat_cols = ['workclass','education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country', 'target']

# Defining the training and model parameters
batch_size = 500
epochs = 500+1
learning_rate = 2e-4
beta_1 = 0.5
beta_2 = 0.9

# Create and train the model
ctgan_args = ModelParameters(batch_size=batch_size,
                             lr=learning_rate,
                             betas=(beta_1, beta_2))

train_args = TrainParameters(epochs=epochs)

synth = RegularSynthesizer(modelname='ctgan', model_parameters=ctgan_args)
synth.fit(data=data, train_arguments=train_args, num_cols=num_cols, cat_cols=cat_cols)

# Generate new samples
synth_data = synth.sample(1000)

print(synth_data)

You can also check further examples with other models.

data-centric-ai-community / nist-crc-2023 Goto Github PK

nist-crc-2023's Introduction

NIST Privacy Collaborative Reseach

Overview

🧭 Learning Outcomes

🔨 Tasks

Week 1:

Week 2:

Week 3:

Weeks 4 & 5:

⚙️ Installation Instructions

Suggested Materials

Additional Materials

nist-crc-2023's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent