Our gating algorthim requires a Samasource generated json file.
To run our file, please just enter
python runner.py json_data_file.json
##Data Our data files are
data/Getty_Training1.json
data/Getty_Training2.json
data/Getty_Validation.json
These files were generated by running the Training_Validation_Generator.py file on a master download from Samasource's data warehouse. The Training Dataset can be accessed in this repository, while the Validation dataset can be downloaded here.
The optional parameter of "batch" may be passed in the command line with the parameter "batch":
python runner.py data/Getty_Training1.json batch 892
If there is no batch with that number, or no batch specified, the script will default to the batch with the most tasks in the branch.
You can also specify if you would like to run the centroid over a loop of 30 times to estimate the profit gains over the course of a full project.
python runner.py data/Getty_Training1.json run-loop
If this parameter is on, the centroid selection iterates through a loop 30 times. This loop represents the number of gold batches all users must complete in order to finish an entire project of 300,000 tasks. These functions must also be run 30 times to decrease the variability of the k-means function that is generates random centroids. Running the loop estimates the amount of gold saved, the cost savings, and the increased profit Samasource receives if it implements the smart gating Algorithm.
The complete loop file runs in under 2 minutes for the training sets.
Plottings contains a number of functions that can be called to run graphs of the existing data.
##Dependencies
Our runner script has the following dependencies:
- Numpy
- Scipy
- Scikit learn
- Matplot Lib
##Old Code: the project's graveyard
A monty carlo simulation was developed for the threshold parameters and though not currently used, can be investigated to see what it outputs. Other unused bits of code can be found in the ```old`` folder.