input file
label column name
k value (for kNN creation)
using header
save path
file description
read/write (read or write, for storing/retrieving the kNN DataFrame)
read/write path
currently unused
sampling methods (all remaining parameters chosen from: none, undersample, oversample, smote, smotePlus)
~/covtype/covtype10k.csv label 5 yes ~/test/results covtype10k write ~/covtype10k/minorityDF 1 none,smote
edu.vcu.sleeman.Classifier
Experiements have been performed using the following datasets:
https://archive.ics.uci.edu/ml/datasets/covertype
https://archive.ics.uci.edu/ml/datasets/detection_of_IoT_botnet_attacks_N_BaIoT
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
https://catalog.data.gov/dataset/traffic-violations-56dda
SEER: https://seer.cancer.gov/
Cancer type predictions were performed on data the SEER cancer registry. Using that data will require requesting access from SEER and agreeing to their terms of use.
data/sensors.csv
The original source (http://db.csail.mit.edu/labdata/labdata.html) from the Intel Berkely Research Lab appears to be offline so the version of the data used has been saved in the data directory of this repository.
Full results and a discussion of minority type instance difficulty can be found in our recently published paper:
Sleeman IV, William C., and Bartosz Krawczyk. "Multi-class imbalanced big data classification on Spark." Knowledge-Based Systems (2020): 106598.
Please consider citing this work if it has been helpful in your research. BibTeX reference:
@article{sleeman2020multi,
title={Multi-class imbalanced big data classification on Spark},
author={Sleeman IV, William C and Krawczyk, Bartosz},
journal={Knowledge-Based Systems},
pages={106598},
year={2020},
publisher={Elsevier}
}