Giter VIP home page Giter VIP logo

fscs's Introduction

Fair Selective Classification via Sufficiency

This is the repository for our entry in the ML Reproducibility Challenge 2021 Fall Edition, where we reproduce the paper "Fair Selective Classification via Sufficiency".

Downloading Datasets

The evalutation is done on four different datasets: Adult, CelebA, Civil Comments and CheXpert. This section describes how to obtain each of the datasets. If you just want to test the model in general, the Adult dataset is most easy to obtain and train on. The final data directory should look something like this:

data
    / adult
        / adult.data
        / adult.train
    / celeba
        / img_align_celeba
        / list_attr_celeba.txt
        / list_eval_partition.txt
    / chexpert
        / CheXpert-v1.0small
            / train
            / valid
            / train.csv
            / valid.csv
    / civil
        / all_data.csv
        / identity_individual_annotations.csv
        / sample_submission.csv
        / test.csv
        / test_private_expanded.csv
        / test_public_expanded.csv
        / toxicity_individual_annotations.csv
        / train.csv

Adult

The Adult dataset can be downloaded for free, without any registration using the Adult dataset link. The dataset we used was preprocessed using this repo, and is available in the data directory.

CelebA

The CelebA dataset is available for download through Google Drive via their website.

Civil Comments

The Civil Comments is available on Kaggle. To be able to download this dataset you will need to register on Kaggle, verify on Kaggle with your telephone number and accept the competition rules. Then you can download the API key in the profile section. This file is called "kaggle.json" and this should be put in the ".kaggle" folder.

CheXpert

The CheXpert dataset is available after registering. After recieving the download link, download the small dataset to data/chexpert and unzip there.

Dataset preprocessing

The preprocessing for each of the datasets is done in data.py, in the individual __init__ functions. This section will go over the exact preprocessing steps per dataset. Since we aim to replicate the work of [CITE], this part is not designed to be changed except for the attribute on which the sufficiency criterion is applied.

Adult

For the Adult dataset we normalize the columns with continues values to have zero mean and unit variance. Furthermore, we one hot encode the columns with categorical values. The test data misses one categorical value for the "native-country" column, namely "Holand-Netherlands", which gets added and filled with all zeroes to be compatible with the training model. For some continues variables (i.e. 'fnlwgt', 'capital-gain' and 'capital-loss') a few outliers really skew the variance and mean and variance metrics, which should be further investigated.

CheXpert

The images of the CheXpert data are cropped to 224 by 224 pixels (the bottom and right side information are thrown out), and this greyscale image is stacked three times to simulated R, G and B channels (since the DensNet121 expects this as input). The Pleural Effusion is the attribute for this dataset and has three possible values that correspond (in some way) to 'positive', 'negative' and 'undecided'. Everything that was not one, i.e. 'positive', was mapped to zero to make this a binary attribute.

CelebA

The images of the Celeba dataset are resized to 224 by 224 pixels by using scaling. Both the attributes and the dataset partition were taken from text files. The attribute text file contained an non uniform amount of whitespaces which was important to keep in mind when trying to use pandas.

Civil Comments

The Civil Comments dataset contained a lot of rows where the attribute "Christian" was not defined. Because this was the sensitive attribute we had to use all rows where this was the case were removed. The toxicity value was a continous value between 0 and 1. Therefor a threshold was set at 0.5, all values lower than this were seen as non-toxic and all values higher were seen as toxic.

Running the code

The train_model.py can be used to train a model with the specified parameters e.g. python train_model.py --dataset adult --lmbda 0.7 --optimizer adam --seed 42 --progress_bar. Using the default settings will train a regularized model for the specified dataset for 20 epochs (this should be adjusted for the specific dataset). Please see our paper for more details on hyperparameter tuning.

A trained model can be copied to the models directory, and then evaluated using the evaluate function in results.ipynb.

fscs's People

Contributors

lodewijkvankeizerswaard avatar tijmenvanetten avatar fabianvanstijn avatar mauravd avatar

Stargazers

 avatar  avatar Dimitris Michailidis avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.