ipeirotis / get-another-label Goto Github PK
View Code? Open in Web Editor NEWQuality control code for estimating the quality of the workers in crowdsourcing environments
Quality control code for estimating the quality of the workers in crowdsourcing environments
In the object-probabilities and in the summary, we should report data quality as a number between 0% (random) to 100% (perfect)
Data quality (estimated according to DS_Exp metric): 1 - DS_Exp_Cost / NoVote_Exp_Cost
Data quality (estimated according to MV_Exp metric): 1 - MV_Exp_Cost / NoVote_Exp_Cost
Data quality (estimated according to DS_Opt metric): 1 - DS_Opt_Cost / NoVote_Opt_Cost
Data quality (estimated according to MV_Opt metric): 1 - MV_Exp_Cost / NoVote_Opt_Cost
Similarly, we should report data quality results from the evaluation data:
Data quality, DS algorithm, maximum likelihood: 1 - Eval_Cost_DS_ML / NoVote_Opt_Cost
Data quality, DS algorithm, soft label: 1 - Eval_Cost_DS_Soft / NoVote_Opt_Cost
Data quality, naive majority voting algorithm: 1 - Eval_Cost_MV_ML / NoVote_Opt_Cost
Data quality, naive soft label: 1 - Eval_Cost_MV_Soft / NoVote_Opt_Cost
Overall Statistics
Categories: 2 ==> OK
Objects in Data Set: 1000 ==> OK
Workers in Data Set: 83 ==> OK
Labels Assigned by Workers: 5000 ==> OK
Data Statistics
Average[Object]: 500.5 ==> REMOVE
Average[DS_Pr[1]]: 0.313 ==> DS estimate for prior probability of category [1]
Average[DS_Pr[0]]: 0.687 ==> DS estimate for prior probability of category [0]
Average[DS_Category]: 0.297 ==> REMOVE
Average[MV_Pr[1]]: 0.281 ==> DS estimate for prior for category [1]
Average[MV_Pr[0]]: 0.719 ==> Majority Vote estimate for prior probability of category [0]
Average[MV_Category]: 0.261 ==> Majority Vote estimate for prior probability of category [1]
Average[DS_Exp_Cost]: 0.095 ==> Expected misclassification cost (for EM algorithm)
Average[MV_Exp_Cost]: 0.192 ==> Expected misclassification cost (for Majority Voting algorithm)
Average[NoVote_Exp_Cost]: 0.43 ==> Expected misclassification cost (random classification)
Average[DS_Opt_Cost]: 0.064 ==> Minimized misclassification cost (for EM algorithm)
Average[MV_Opt_Cost]: 0.139 ==> Minimized misclassification cost (for Majority Voting algorithm)
Average[NoVote_Opt_Cost]: 0.313 ==> Minimized misclassification cost (random classification)
Average[Correct_Category]: 0.491 ==> REMOVE
Average[Eval_Cost_MV_ML]: 0.304 ==> Classification cost for naïve single-class classification, using majority voting (evaluation data)
Average[Eval_Cost_DS_ML]: 0.286 ==> Classification cost for single-class classification, using EM (evaluation data)
Average[Eval_Cost_MV_Soft]: 0.33 ==> Classification cost for naïve soft-label classification (evaluation data)
Average[Eval_Cost_DS_Soft]: 0.296 ==> Classification cost for soft-label classification, using EM (evaluation data)
Worker Statistics
Average[Est. Quality (Expected)]: 30.771% ==> Worker quality (expected_quality metric, EM algorithm estimates)
Average[Est. Quality (Optimized)]: 33.361% ==> Worker quality (optimized_quality metric, EM algorithm estimates)
Average[Eval. Quality (Expected)]: 25.152% ==> Worker quality (expected_quality metric, evaluation data)
Average[Eval. Quality (Optimized)]: 29.439% ==> Worker quality (expected_quality metric, evaluation data)
Average[Number of Annotations]: 60.241 ==> Average number of labels assigned per worker
Average[Gold Tests]: 0.0 ==> Average number of gold test per worker
I am not sure that I like the approach in SummaryReport:
public <T> Object getAverage(FieldAccessor fieldAcessor, Iterable<T> objects) {
Double accumulator = 0d;
long count = 0;
boolean evalP = fieldAcessor instanceof EvalDatumFieldAccessor;
for (T object : objects) {
if (evalP) {
Datum datum = ((Datum) object);
if (! datum.isEvaluation())
continue;
}
Perhaps we can enforce a convention for returning empty data as "null", independently of the idea of EvaluationData. This special case of checking for instanceof, and then calling a method specific to Datum looks very brittle.
All the print* methods should be placed in a different class. DawidSkene should have getter methods that return Set of ids for objects and workers, and then have getter functions that return Datum and Worker, allowing the printWorkerScore, printObjectClassProbabilities, etc to operate outside DawidSkene
Right now, when writing the output files object-probabilities.txt, worker-statistics-detailed.txt and so on, we first create the content of the file in memory (using a StringBuffer) and then write the file to the disk.
Although the proper solution is what we use in DSaS (Project Troia), to use a database backend, we need a simpler solution for the cli-oriented Get-Another-Label: We should simply write line-by-line to the disk directly, without creating first a huge string in memory.
I uploaded two files in "Downloads" that illustrate the problem. They will both load and process the data relatively fast but a huge amount of time will be wasted in creating the huge stringbuffers and then writing the files.
I see there are two normalizing methods for the confusion matrix: one uses Laplace smoothing, the other does not. Is there any difference between these ? I see that normalizeLaplacean is not used. Is there any problem if I use normalizeLaplacean ? The reason is that I want to get rid of the NaN thing.
In Worker.getWorkerCost we use as normalization values either getSpammerCost or getMinSpammerCost
if (method == Worker.EXP_COST_EVAL) {
return cost / Helper.getSpammerCost(categories);
} else if (method == Worker.MIN_COST_EVAL) {
return cost / Helper.getMinSpammerCost(categories);
} else if (method == Worker.EXP_COST_EST) {
return cost / Helper.getSpammerCost(categories);
} else if (method == Worker.MIN_COST_EST) {
return cost / Helper.getMinSpammerCost(categories);
} else {
// We should never reach this
System.err.println("Error: We should have never reached this in getWorkerCost");
return Double.NaN;
}
We should change that to use only getMinSpammerCost, but also use three different values for the priors as in issue #19
We should have an obvious option for downloading the jar file so that people can run the program easily.
Expecting people to know how to download maven and compile the file is not realistic for the vast majority of the users that just want to use this program, even if it is command-line and rather esoteric.
I want to create a summary report after each execution, reporting the average cost for the objects, and quality for the workers.
Effectively compute the average of the columns in the object-probabilities.txt and in the worker-statistics.txt and report it in a summary file (and in the terminal output).
An example summary would look like that:
X categories
Y objects in data set
Z workers in data set
W labels assigned by the workers
K gold labels for the objects
DS_Exp_Cost metric, estimated: AA%
DS_Exp_Cost metric, evaluation: AB%
DS_Opt_Cost metric, estimated: AC%
DS_Opt_Cost metric, evaluation: AD%
....
[[Note: see https://github.com/ipeirotis/Get-Another-Label/wiki/How-to-Run-Get-Another-Label#wiki-objectprobabilitiestxt for the metrics]]
Expected Quality, non-weighted, estimated: WA%
Expected Quality, weighted, estimated: WB%
Expected Quality, non-weighted, evaluation data: WC%
Expected Quality, weighted, evaluation data: WD%
Optimized Quality, non-weighted, estimated: WE%
Optimized Quality, weighted, estimated: WF%
...
The following metrics should be computed using a weighted average. (See #25)
[WorkerQuality_Estm_DS_Exp_w]
[WorkerQuality_Estm_DS_Min_w]
[WorkerQuality_Eval_DS_Exp_w]
[WorkerQuality_Eval_DS_Min_w]
Currently, the reported number does not make much sense (needs to be divided by the "Number of Labels" to be properly normalized). If divided, the values are the same as the ones reported in WorkerQuality_Estm_Weighted_* by DS_ACCESSOR.
Would be better to implement #25, and report the proper values within the WORKER_ACCESSOR.
From both the paper and the code, it seems that we can set MisclassificationCost to be an arbitrary number other than 0 or 1. However, when I set "porn notporn" to be 1000 or 0.1, the results don't change at all. I also set all 4 cases (porn-porn, notporn-notporn, porn-notporn, notporn-porn) to be 0, but the results still don't change. I have a larger dataset, and no matter how I change MisclassificationCost to be different numbers across different categories, the results remain the same. Therefore, I suspect "MisclassificationCost" might have a bug.
Is there a quick answer to this issue? In the meantime, I'll try to do some debugging work too.
This is an esoteric check:
In principle, the quality of the data and the quality of the workers should finally converge into the same point. We should calculate first the quality of the workers (already doing that) and then estimate what is the "worker quality" under the level of redundancy that we observe in the data. This should be another estimate of the data quality and should be pretty close to the actual quality.
The NoVote_Min_Cost uses the value of the prior probabilities to define what is the baseline cost of a "strategic spammer"
One key thing is that the prior probabilities, which can be estimated in different ways:
I would put an advanced switch in the command line to determine what type of prior to use for the normalization. By default it should be (1), with a secondary preference for (2). The option (3) [which is the current implementation, when we do not have fixed priors, and uses the DS priors] should come with a warning.
We report the DS category, MV_Category, (the maximum likelihood estimates) and Correct_Category.
We should also report the minimum cost categories for MV and DS, in addition to the maximum likelihood.
It would be nice to add an option to pass a file with the correct labels for (some of) the objects.
Currently, we also allow to load "gold data" with the purpose of aiding and speeding up the estimation of quality for the workers. The evaluation data will not be used in the same way we currently use gold data: Specifically, the evaluation data will never be used during training. Instead, we will use the evaluation data in order to estimate the quality of the estimates generated by the algorithm.
Right now, we create the files:
dawid-skene-results.txt
naive-majority-vote.txt
differences-with-majority-vote.txt
object-probabilities.txt
For each of these files, we should also add a column with the correct label, from the evaluation data.
We should also add columns with the classification cost for each example.
The classification cost is computed as the cost of misclassifying an object of class A (taken from the evaluation data) into class B (taken from the label(s) assigned by the workers). The cost is based on the costs file. The classification cost can be computed in multiple ways:
a. Using the maximum-likelihood category from DawidSkene (EvalCost_DS_ML)
b. Using the maximum-likelihood category from Majority (EvalCost_MV_ML)
c. Using the "soft label" category from DawidSkene (EvalCost_DS_Soft)
d. Using the "soft label" category from Majority (EvalCost_MV_Soft)
For a. and b. we simply take the class (as reported in the dawid-skene-results.txt and naive-majority-vote.txt) and print the cost. For cases c and d, we use the "soft label" for the object and we computed the weighted cost. For example, if the object has an evaluation class A, the algorithm returns 60% A, 30% B, 10% C, with costs A->A = 0, A->B = 1, A->C = 2, then the classification cost is 0.6_0+0.3_1+0.1_2 = 0.5. For an object with 90% A, 9% B, 1% C the classification cost is 0.9_0+0.09_1+0.01_2 = 0.11.
We should also generate an object-label-accuracy.txt report that will report:
a. The confusion matrix of each technique (dawid-skene maxlikelihood, dawid-skene soft, majority maxlikelihood, majority soft)
b. The average misclassification cost of each algorithm.
We should create an extra confusion matrix for each worker, which should be based solely on the assigned labels and the actual evaluation data. Then we can list the estimated quality of the worker based on the evaluation data, next to the estimates that we have for the confusion matrix, the quality of the worker, etc.
We should modify these files, accordingly
worker-statistics-detailed.txt
worker-statistics-summary.txt
To compute the evaluation-data confusion matrix of a worker, we go through the objects labeled by this worker; we check that is the evaluation label of this example; we check what is the assigned label from the worker. Based on these, we compute the confusion matrix of the worker, the quality (expected and optimized) of the worker, etc.
That is the simplest part. We should list in the priors.txt file not only the estimated priors, but also the actual priors, based on the prevalence of each category in the evaluation data.
Hello,
I'm working on my dissertation with Get-Another-Label, and I'd like to access objects, workers, categories directly from DawidSkene instead of parsing the output text file. To do that, just need to add some public getters. I'd be happy to contribute a patch, if it's OK. Thanks.
Daniel Zhou, PhD student
University of Michigan School of Information
Do a global search for class/variable/method names and replace accordingly. Minor issue.
When I want to do as the tutorial do, I found there is no input.txt in BarzanMozafari. I wonder where can I get this data
Objects, Labels and Workers must be enforced to start with a lowercase alpha character.
Metrics should be uppercase
In #18, we implemented confusion matrixes for the different classification methods. We used evaluation data as "from".
We can also do the same with the estimated confusion matrixes, which capture the belief of the algorithm about its own performance. A well-calibrated result will report an estimated confusion matrix pretty close to the evaluation-based one.
To compute the estimated confusion matrix, we will use as "from" the category reported for DS-Soft and MV-Soft (for the DS_* and MV_* algorithms, respectively).
Since DS_Soft and MV_Soft are probability vectors and not a single category, we need to have a for loop that will go over the categories. Also, when we add the error in the confusion matrix cm.adderror, we will add the product of the from probability, multiplied with the to probability. (Right now we add the to probability only).
We seem to have too many metrics being reported now in the summary and in the object-probabilities.
Perhaps we should report only a small fraction of those. I think the following metrics are more than enough:
Categories: 2
Objects in Data Set: 1000
Workers in Data Set: 83
Labels Assigned by Workers: 5000
[Number of labels] Labels per worker: 60.2410
[Gold Tests] Gold tests per worker: 1.8072
DataQuality_Estm_DS_Min
DataQuality_Eval_DS_Min [only when we pass an evaluation file]
WorkerQuality_Estm_DS_Min_n
WorkerQuality_Estm_DS_Min_w
WorkerQuality_Eval_DS_Min_n [only when we pass an evaluation file]
WorkerQuality_Eval_DS_Min_w [only when we pass an evaluation file]
We should similarly limit the number of columns in the results/ (object-probabilities, worker-statistics, etc)
Right now we display the averages of worker quality, assuming that each worker is equal to others. However, since each worker is contributing different number of labels, we need to also report the weighted average of worker quality. For that, we multiply each quality metric with the "number of annotations" (aka assigned labels), and then report the sum of these products, divided by the total number of labels assigned by workers
Would be nice to see the complete confusion matrix of each technique, which is the type of errors by each one.
Should be based on evaluation data
Methods:
DS_MaxLikelihood
MV_MaxLikelihood
DS_MinCost
MV_MinCost
DS_Soft
MV_Soft
Is this project MIT license or GPL license or some other open source license? Please include a license information. I'd like to contribute patches directly to this project instead of do sub-classing. But I'd like to make sure I can still legally use the code later on. Thanks.
MIT license: http://opensource.org/licenses/mit-license.php/
I guess adding a "license.txt" file under the root directory would be fine. Thanks.
We need to revise the documentation in https://github.com/ipeirotis/Get-Another-Label/wiki/How-to-Run-Get-Another-Label to reflect the changes made in Issue #1 and in particular #1 (comment)
Most of the links from this project are broken, including the standalone download page (https://github.com/ipeirotis/Get-Another-Label/downloads)
In the distribution zipball in the downloads, the Unix executrable scripts file in the bin/ folder is called get-another-label and appears to be a binary instead of a shell script. I would rename it get-another-label.sh to make it clear that it is an executable script.
Seems strange to have the weighted and the unweighted report on worker quality under different decorators (unweighted->worker, weighted->DawidSkene).
I would recommend to allow the functionality for "weighted averages" in the Decorators and SummaryReport class.
We will need to have a getWeightedAverage method in summary report. In the weightedAverage, we need to make the following changes, compared to getAverage:
In fact, once you have a weightedAverage implementation, the getAverage is just a special case, where the weights are all equal to 1.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.