feedzai / fairgbm Goto Github PK
View Code? Open in Web Editor NEWTrain Gradient Boosting models that are both high-performance *and* Fair!
Home Page: https://arxiv.org/abs/2209.07850
License: Other
Train Gradient Boosting models that are both high-performance *and* Fair!
Home Page: https://arxiv.org/abs/2209.07850
License: Other
GBM::Train
function;label_negative_weight
in ConstrainedRecallObjective::GetGradients
;
ConstrainedRecallObjective::GetGradients
, which deals with label negative samples, and in theory should not be needed for optimizing for recall;Release FairGBM's python package in PyPI under the fairgbm
name.
This is done with a GitHub workflow via a organization-wide secret API key (PYPI_API_TOKEN
).
See the following examples:
Blocked by #38
We need to rethink and implement a versioning system for FairGBM.
Currently we're still using the version from MSFT LightGBM as of the day the code-bases diverged (3.2.1.99
).
1.0.0
);
VERSION.txt
file) with a git tag and a GitHub release;
Currently we need to use larger multipliers for larger datasets, and smaller multipliers for smaller datasets.
We could potentially just multiply the gradient flowing to each sample from the constraint loss by the number of data points (or simply not divide this loss by the number of data points).
This isn't exactly theoretically sound AFAIK, as the true gradient from FPR or FNR constraints does depend on the number of data points... We could just implement it and see if performance is affected.
Using a FairGBM model as a randomized classifier is described in detail in the FairGBM paper.
However, this library only allows the use of the last FairGBM iterate --- this should achieve similar performance with faster predictions, but it would be interesting to still be able to use the randomized classifier predictions for comparison and future research.
These randomized classifier predictions are simply generated by matching each input row with a random boosting iterate, and using that iterate to generate the row's predictions (selected at random with replacement).
This could even be done only on the Python package part, by adding a new method to the FairGBMClassifier
class, named predict_randomized
predict_proba_randomized
, or adding a new flag randomized=True
to the existing predict
and predict_proba
methods.
When importing fairgbm on a mac I immediately get the following error message:
OSError: dlopen(<env-dir>/python3.9/site-packages/fairgbm/lib_lightgbm.so, 0x0006): tried: '<env-dir>/python3.9/site-packages/fairgbm/lib_lightgbm.so' (not a mach-o file)
I have tried this with Python 3.8, 3.9, and 3.10 (I haven't tried it with earlier Python versions due to incompatibility with arm64 CPUs, but I have no reason to believe the bug would be exclusive to this architecture).
import fairgbm
fairgbm==0.9.13
It also does not work when using a linux python3.9 environment via docker containers (on macOS).
With this setup, the following (different but related to the same file) error is shown:
OSError: /usr/local/lib/python3.9/site-packages/fairgbm/lib_lightgbm.so: cannot open shared object file: No such file or directory
LightGBM allegedly handles missing values in the features (e.g., represented as NaN
).
We should also be able to handle missing values in the constraint group column (sensitive attribute).
For instance, we could do imputation with the majority group for all rows with unknown sensitive attribute.
Update the readme file
The project Readme is describing itself as LightGBM. It is understood that this is a consequence of the fork, but should be updated.
I just found out this file helpers/parameter_generator.py, which states at the beginning:
This script generates LightGBM/src/io/config_auto.cpp file
with list of all parameters, aliases table and other routines
along with parameters description in LightGBM/docs/Parameters.rst file
from the information in LightGBM/include/LightGBM/config.h file.
meaning all the changes in config_auto that are done currently in fairgbm need to be moved out of there to the original LightGBM/include/LightGBM/config.h
file.
EDIT: unlike #9, this issue is focused on splitting the ObjectiveFunction and ConstrainedObjectiveFunction logic, so as to separate all of our FairGBM-induced changes to the codebase.
Issue #9 is more focused on optimization while this one is more focused on code maintainability.
The file include/LightGBM/objective_function.h
must be heavily refactored.
First, the file should be split, so as to have a constrained_objective_function.h
+ .cpp
file, and have a new objective_functions.h
file that includes the original objective_function.h
, as well as the new one. After that is done, the whole codebase should include objective_functions.h
instead of the original one.
After that refactoring, there are plenty optimization opportunities in the ConstrainedObjectiveFunction
class which will require extensive optimization, from making runtime code indirection cheaper, to re-implementing the code with cache-friendlier data structures, so as to reduce the +80% train time when compared to the vanilla LightGBM train.
Currently the lightgbm scikit-learn API has a LGBMClassifier
.
We would like to have an equivalent FairGBMClassifier
that enforces the use of fairness (constraints).
This would simply be a subclass of LGBMClassifier that hard-codes the objective function to the constrained_cross_entropy
(and perhaps some other kwargs should be enforced as well).
Also, the lightgbm package also provides another API (non sklearn), can we create a FairGBM alias for that API as well?
We are trying to implement FairGBM in order to classify a certain feature: severity_score_class
while using districts
as the constraint group. After trying to train the features using X, Y and S with fairgbm_clf.fit(X_train, Y_train, constraint_group=S)
, the following error arises ->LightGBMError: Input data type error or field not found
. After many attempts to fix this issue, it still persists.
`import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import lightgbm
from fairgbm import FairGBMClassifier
data = pd.read_csv('total_df_final_for_models.csv').drop(columns=['Column1', 'Column2'])
TARGET_COL = "severity_score_class"
SENSITIVE_COL = "district"
def retrieve_X(data):
ignored_cols = [TARGET_COL, SENSITIVE_COL, "severity_score"]
feature_cols = [col for col in data.columns if col not in ignored_cols]
X = data[feature_cols]
return X
def retrieve_Y(data):
Y = data[TARGET_COL]
return Y
def retrieve_S(data):
data["district"] = data["district"].astype('category')
data["district_encoding"] = data["district"].cat.codes
S = data["district_encoding"]
return S
X = retrieve_X(data)
Y = retrieve_Y(data)
S = retrieve_S(data)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33, random_state=16)
fairgbm_clf = FairGBMClassifier(constraint_type="FNR", # constraint on equal group-wise TPR (equal opportunity)
n_estimators=200, # core parameters from vanilla LightGBM
random_state=16)
fairgbm_clf.fit(X_train, Y_train, constraint_group=S)`
The Y variable is multiclass as opposed to the binary predictions that FairGBM makes use of. Y consists of three levels and thus might be a problem if multiclass classification is not possible with FairGBM. The constraint group S consists of 69 districts. Maybe these are the reasons for the LightGBM Error. Every line of code works until the fairgbm_clf.fit() function
.
Data used: total_df_final_for_models.csv
The current LICENSE is the same as the one used on the TimeSHAP open-source repo.
TODO:
Related to #36
Change the maintainer and email address to a mailing list.
According to our perf
and valgrind
benchmarks, a large percentage of CPU time is spent on synchronization of separate threads during training.
The net outcome of multi-threading is still positive, however when using OMP_NUM_THREADS=4
our code will only consistently use 2 threads, seeming unable to fully parallelize.
The term constraint_group
alludes to constrained optimization, but the main use-case for FairGBM is enhancing fairness and a better kwarg name should probably be chosen.
Suggestions:
sensitive_attributes
protected_attributes
constraint_groups
-> note the pluralNOTE
this is a breaking change and will need a corresponding PR in the feedzai-openml-java repository etc.
We need to assess what are the side-effects of not having a proper ConstrainedCrossEntropy::ToString method.
From a quick run through the code it seems the ToString
method is used to pass information/configs to the Objective
class. Currently we have this commented out for all ConstrainedObjectiveFunction
sub-classes. For example:
If we uncomment this the following bug will be thrown:
This is also related to the fact that we cannot resume training from a previously trained FairGBM model.
Related to #10
Including:
We need to change the python package's name from lightgbm
to fairgbm
;
(See setup.py
line 335)
TODO: check what are the implications of this change.
Currently, the function ConstrainedObjectiveFunction::GetLagrangianGradientsWRTMultipliers
ignored the this->weights_
variable.
TODO: implement weighing -- although this may interfere with the constrained optimization process.
A test that compares the persisted txt file with the txt file of a model created with a previous lightgbm version is currently failing.
Files for version v3.0.0:
4f.txt
42f.txt
Files for version v3.2.1-fairgbm:
4f.txt
42f.txt
is_linear
is_linear=0\n
When using constrained_cross_entropy as an objective, FairGBM constrained_cross_entropy tries to use it as a metric (to report training progress). The issue here is that, currently, no such metric exists.
Everything works fine since a metric is not necessary for instantiating or training a FairGBM object, but it would be nice to have a proper metric.
We should enable creating group-wise constraints on the percentage of positive predictions (a.k.a., predicted prevalence).
This enables the popular Demographic Parity fairness metric.
TODO
constraint_type=PP
to the available group-wise constraint options;PP
constraint type for global constraints
EDIT
Minor notes for the future PP constraint implementation:
83de2426-ea6e-4308-9f0d-5ddc3b001580.pdf
Several CPU-bound FairGBM functions are currently single threaded.
TODO:
#pragma omp parallel for schedule(static)
ConstrainedObjectiveFunction::GetConstraintGradientsWRTModelOutput
should be the focus, as it is where most CPU time is spent;A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.