This package implements the UpliftRandomForest algorithm (same in causalml, but faster and less memory-consuming), and may implement more machine learning algorithm designed for uplift modeling.
Reference for detail: Piotr Rzepakowski and Szymon Jaroszewicz. Decision trees for uplift modeling with single and multiple treatments. Knowl. Inf. Syst., 32(2):303โ327, August 2012.
uplift-kit has been published on pypi, use pip install uplift-kit
to install.
from uplift_kit.trees import UpliftRandomForestModel
import pandas as pd
model = UpliftRandomForestModel(
n_estimators=10, # number of uplift trees
max_features=10, # maximum number of features considered in one split
max_depth=10, # maximum depth of one single tree
min_sample_leaf=100, # minumum number of samples classified to a leaf
eval_func="ED", # split evaluation function, support `ED, KL, CHI`
max_bins=10, # maximum bins considered when calculating best split
balance=False, # whether to use weighted average to calculate score, False mean not
regularization=True, # whether to add regularization term
alpha=0.9, # param for the regularization term
)
data = pd.read_parquet("../train.parquet")
x_names = list(data.columns[:-2])
# model will use columns of `x_names` as features
# treatment_col should contains 0,1,2,...k, where 0 indicates control sample, 1~k means treatment 1~k.
# outcome_col should only contains 0,1 as integer values, i.e. binary outcome.
model.fit(
data,
x_names=x_names,
treatment_col="treats",
outcome_col="outcome",
n_threads=8,
)
# In prediction, model will automatically choose the feature columns (x_names) from input dataframe.
# It returns a numpy array, where k columns per sample indicate the uplift value for treatment k.
test = pd.read_parquet("../test.parquet")
res = model.predict(data=test[x_names], n_threads=8)
print(res[:10])
Values in treatment_col
must be [0,1,2...]
, where 0
specifically represents control sample and [1,2...k]
indicates k types of treatment.
Values in outcome_col
indicates the outcome of treatments/control, must be 0/1 (binary outcome).
Values in x_names
columns can be either numeric or categorical (str
values). Model will handle both properly.
The predict
method returns the uplift value of each treatment as a np.array
of shape (n_samples, k)
. res[i][k]
represents the uplift value for item i
of treatment k
.
You can save
a trained model and load
it else where for prediction.
model.fit(...)
model.save("saved_model.json")
new_model = UpliftRandomForestModel()
new_model.load("saved_model.json")
new_model.predict(...)
In basic example, the predict
function used multi-thread for predicting a large dataset in default. However, the predict_row
function is suitable for predicting one single sample:
res = model.predict_row([1,2,"ASIA",...]) # input a list of features, consistent with `x_names`
res
will be a list of k
uplift values for k
treatments.