Giter VIP home page Giter VIP logo

data-centric-ai-community / nist-crc-2023 Goto Github PK

View Code? Open in Web Editor NEW
27.0 6.0 2.0 4.92 MB

NIST Collaborative Research Cycle on Synthetic Data. Learn about Synthetic Data week by week!

Home Page: https://pages.nist.gov/privacy_collaborative_research_cycle/pages/participate.html

License: MIT License

Jupyter Notebook 91.07% HTML 8.93%
ctgan data-analysis data-science deeplearning deidentification gans generative-adversarial-network machine-learning privacy-enhancing-technologies python

nist-crc-2023's Introduction

Discord Youtube Medium YData Synthetic

NIST Privacy Collaborative Reseach

The Data-Centric AI Community just launched a small community project to experiment with the NIST Challenge!

  • Goal: To learn about Synthetic Data and how it can be used to prepare sensitive private data for public release!
  • Dates: From April to July. You can also join at any time, follow the weekly plan, and post questions on our Discord.
  • Where: ๐Ÿค–-nist-challenge channel in our Discord Server
  • Touch Points: We meet every Friday around 4 PM GTM on the ๐Ÿง -code-with-me channel to discuss the project.

Overview

๐ŸŽฏ The overall goal of the project is to explore synthetic data to prepare sensitive private data for public release.

๐Ÿ“€ NIST has launched a benchmark of 3 datasets, MA, TX (Texas), and NATIONAL which you can use in the project.

๐Ÿ“Š To provide an evaluation of the de-identified data against the target/real data, NIST has created the sdnist package that can be installed according to the instructions below.

๐Ÿ’ป To create the de-identified data, we'll use ydata-synthetic package, explore different model settings and study the effect this has on the final results.

๐Ÿงญ Learning Outcomes

Week What you will learn
1 Goal and objectives of the project. You'll connect with other learners in the DCAI Discord Server and be added to the NIST Team to access the ๐Ÿค–-nist-challenge channel and receive permissions to collaborate on the GitHub project.
2 Basics of Synthetic Data. You will learn more about what is synthetic data, how is it generated, what are the main applications.
3 Basics of Data Profiling. You will learn what is data profiling, how to understand your data with descriptive statistics, and what are common data quality issues. You will also explore the NIST datasets with ydata-profiling and preprocess the data according to your findings.
4 & 5 Generation of Synthetic Data. You will explore Deep Learning models (Generative Adversarial Networks -- GAN) to generate realistic synthetic data using ydata-synthetic.
6 & 7 Basics of Evaluating Synthetic Data. You will explore some strategies to evaluate synthetic data and investigate possible improvements to your solution. We will explore the sdnist package to evaluate our synthetic data.
8 Project Showcase. You will learn how to best showcase and publicize your project in your data portfolio, CV, GitHub, or Medium Account.

๐Ÿ”จ Tasks

Week 1:

Week 2:

Week 3:

  • Learn about the basic aspects of Data Profiling:

  • Start profiling the NIST data:

    • Installydata-profiling (check the Installation Instructions below) and don't forget to star it, thank you! โญ๏ธ
    • Choose one of the NIST datasets (MA, TX, or NATIONAL):
      • The datasets are available here
      • Run a Profile Report on your data (check the Installation Instructions below)
      • Create an excel file to register your learnings. Suggestion for the columns: Feature Name | Data Type (Numeric/Categorical) | Missing Values (Y/N) | Notes/Observations. Your observations should be based on the profiling report, but also on the description of the features provided
  • Post questions and comments on the ๐Ÿค–-nist-challenge channel.

  • Meet us on Friday (May 12) to discuss what you've learned (check the available slots on our ๐Ÿ“… Discord Calendar). Don't forget to bring your excel file with the data description and your profiling report!

Weeks 4 & 5:

โš™๏ธ Installation Instructions

๐Ÿ“ฆ How to create and use Virtual Environments?

A lot of troubleshooting arises due to misalignments between environments and package requirements. If you're new to data science development, maybe you just install packages unto your global Python environment. This may turn into a lot of headaches when project requirements are conflicting.

Virtual Environments are ideal to overcome this issue: they isolate your installations from the "global" environment, so that you don't have to worry about conflicts. If you've never used virtual environments for your data science projects, you can start by installing anaconda. If you need a little convincing that this is a nice tool to have on your belt, then check this post comparing conda with pip, venv, and pyenv.

Once anaconda is installed, creating a new environment is as easy as running this on your shell:

conda create --name synth-env python=3.10

This creates a new environment called synth-env with Python version 3.10.X. You can then switch to this environment by activating it:

conda activate synth-env

In this new environment, you can still call pip to install python packages, such as ydata-synthetic:

pip install ydata-synthetic

Now you can open up your Python editor or Jupyter Notebook and use the synth-env as your development environment, without having to worry with conflicting versions or packages between projects! Once you're done, you can deactivate the environment using:

conda deactivate synth-env

Suggested Materials

๐Ÿ“Š How to install ydata-profiling and create a Profiling Report?

You may start by creating your virtual environment and installing the package:

conda create -n synth-env python=3.10
conda activate synth-env
pip install ydata-profiling==4.1.2

Then, in your Jupyter Notebook or other editor (e.g., PyCharm), load your Pandas DataFrame as you normally would and the generation of the profiling report is straightforward:

import pandas as pd
from pandas_profiling import ProfileReport

# Read the data from a csv file (NIST "MA" data in the example)
df = pd.read_csv("ma2019.csv")

# Generate the data profiling report 
original_report = ProfileReport(df, title='Original Data')
original_report.to_file("original_report.html")

You can then navigate the report to investigate the data quality issues generated, and study the basic descriptive statistics of your data!

Additional Materials

๐Ÿค– How to install ydata-synthetic and create a synthesizer?

You may use you previous virtual environment (synth-env). Activate it and and install the package:

conda activate synth-env
pip install ydata-synthetic==1.1.0

Then, you can leverage one of the models available in the package. In this example, we will be using CTGAN:

# Load data
data = fetch_data('adult')
num_cols = ['age', 'fnlwgt', 'capital-gain', 'capital-loss', 'hours-per-week']
cat_cols = ['workclass','education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country', 'target']

# Defining the training and model parameters
batch_size = 500
epochs = 500+1
learning_rate = 2e-4
beta_1 = 0.5
beta_2 = 0.9

# Create and train the model
ctgan_args = ModelParameters(batch_size=batch_size,
                             lr=learning_rate,
                             betas=(beta_1, beta_2))

train_args = TrainParameters(epochs=epochs)

synth = RegularSynthesizer(modelname='ctgan', model_parameters=ctgan_args)
synth.fit(data=data, train_arguments=train_args, num_cols=num_cols, cat_cols=cat_cols)

# Generate new samples
synth_data = synth.sample(1000)

print(synth_data)

You can also check further examples with other models.

nist-crc-2023's People

Contributors

agent007 avatar miriamspsantos avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.