ipeirotis / get-another-label Goto Github PK

View Code? Open in Web Editor NEW

70.0 70.0 26.0 5.26 MB

Quality control code for estimating the quality of the workers in crowdsourcing environments

Java 92.75% HTML 7.25%

get-another-label's People

Contributors

Stargazers

Watchers

get-another-label's Issues

Report data quality as a number from 0% to 100%

In the object-probabilities and in the summary, we should report data quality as a number between 0% (random) to 100% (perfect)

Data quality (estimated according to DS_Exp metric): 1 - DS_Exp_Cost / NoVote_Exp_Cost
Data quality (estimated according to MV_Exp metric): 1 - MV_Exp_Cost / NoVote_Exp_Cost
Data quality (estimated according to DS_Opt metric): 1 - DS_Opt_Cost / NoVote_Opt_Cost
Data quality (estimated according to MV_Opt metric): 1 - MV_Exp_Cost / NoVote_Opt_Cost

Similarly, we should report data quality results from the evaluation data:

Data quality, DS algorithm, maximum likelihood: 1 - Eval_Cost_DS_ML / NoVote_Opt_Cost
Data quality, DS algorithm, soft label: 1 - Eval_Cost_DS_Soft / NoVote_Opt_Cost
Data quality, naive majority voting algorithm: 1 - Eval_Cost_MV_ML / NoVote_Opt_Cost
Data quality, naive soft label: 1 - Eval_Cost_MV_Soft / NoVote_Opt_Cost

Renaming descriptions of metrics in summary

Overall Statistics

Categories: 2 ==> OK
Objects in Data Set: 1000 ==> OK
Workers in Data Set: 83 ==> OK
Labels Assigned by Workers: 5000 ==> OK

Data Statistics

Average[Object]: 500.5 ==> REMOVE
Average[DS_Pr[1]]: 0.313 ==> DS estimate for prior probability of category [1]
Average[DS_Pr[0]]: 0.687 ==> DS estimate for prior probability of category [0]
Average[DS_Category]: 0.297 ==> REMOVE
Average[MV_Pr[1]]: 0.281 ==> DS estimate for prior for category [1]
Average[MV_Pr[0]]: 0.719 ==> Majority Vote estimate for prior probability of category [0]
Average[MV_Category]: 0.261 ==> Majority Vote estimate for prior probability of category [1]
Average[DS_Exp_Cost]: 0.095 ==> Expected misclassification cost (for EM algorithm)
Average[MV_Exp_Cost]: 0.192 ==> Expected misclassification cost (for Majority Voting algorithm)
Average[NoVote_Exp_Cost]: 0.43 ==> Expected misclassification cost (random classification)
Average[DS_Opt_Cost]: 0.064 ==> Minimized misclassification cost (for EM algorithm)
Average[MV_Opt_Cost]: 0.139 ==> Minimized misclassification cost (for Majority Voting algorithm)
Average[NoVote_Opt_Cost]: 0.313 ==> Minimized misclassification cost (random classification)
Average[Correct_Category]: 0.491 ==> REMOVE
Average[Eval_Cost_MV_ML]: 0.304 ==> Classification cost for naïve single-class classification, using majority voting (evaluation data)
Average[Eval_Cost_DS_ML]: 0.286 ==> Classification cost for single-class classification, using EM (evaluation data)
Average[Eval_Cost_MV_Soft]: 0.33 ==> Classification cost for naïve soft-label classification (evaluation data)
Average[Eval_Cost_DS_Soft]: 0.296 ==> Classification cost for soft-label classification, using EM (evaluation data)

Worker Statistics

Average[Est. Quality (Expected)]: 30.771% ==> Worker quality (expected_quality metric, EM algorithm estimates)
Average[Est. Quality (Optimized)]: 33.361% ==> Worker quality (optimized_quality metric, EM algorithm estimates)
Average[Eval. Quality (Expected)]: 25.152% ==> Worker quality (expected_quality metric, evaluation data)
Average[Eval. Quality (Optimized)]: 29.439% ==> Worker quality (expected_quality metric, evaluation data)
Average[Number of Annotations]: 60.241 ==> Average number of labels assigned per worker
Average[Gold Tests]: 0.0 ==> Average number of gold test per worker

Check for existence of gold in GetCategoryProbability

https://github.com/ipeirotis/Get-Another-Label/blob/master/src/main/java/com/ipeirotis/gal/scripts/Datum.java#L173

Coding style at SummaryReport and Evaluation data

I am not sure that I like the approach in SummaryReport:

public <T> Object getAverage(FieldAccessor fieldAcessor, Iterable<T> objects) {
    Double accumulator = 0d;
    long count = 0;
    boolean evalP = fieldAcessor instanceof EvalDatumFieldAccessor;

    for (T object : objects) {
        if (evalP) {
            Datum datum = ((Datum) object);

            if (! datum.isEvaluation())
                continue;
        }

Perhaps we can enforce a convention for returning empty data as "null", independently of the idea of EvaluationData. This special case of checking for instanceof, and then calling a method specific to Datum looks very brittle.

Refactor DawidSkene and get the reporting functions out of the class

All the print* methods should be placed in a different class. DawidSkene should have getter methods that return Set of ids for objects and workers, and then have getter functions that return Datum and Worker, allowing the printWorkerScore, printObjectClassProbabilities, etc to operate outside DawidSkene

When writing output files, do not use StringBuffer

Right now, when writing the output files object-probabilities.txt, worker-statistics-detailed.txt and so on, we first create the content of the file in memory (using a StringBuffer) and then write the file to the disk.

Although the proper solution is what we use in DSaS (Project Troia), to use a database backend, we need a simpler solution for the cli-oriented Get-Another-Label: We should simply write line-by-line to the disk directly, without creating first a huge string in memory.

I uploaded two files in "Downloads" that illustrate the problem. They will both load and process the data relatively fast but a huge amount of time will be wasted in creating the huge stringbuffers and then writing the files.

Refactor WorkerQualityReport to use WorkerDecorator

Confusion matrix normalize method

I see there are two normalizing methods for the confusion matrix: one uses Laplace smoothing, the other does not. Is there any difference between these ? I see that normalizeLaplacean is not used. Is there any problem if I use normalizeLaplacean ? The reason is that I want to get rid of the NaN thing.

When reporting worker quality, use getMinSpammerCost as normalization

In Worker.getWorkerCost we use as normalization values either getSpammerCost or getMinSpammerCost

if (method == Worker.EXP_COST_EVAL) {
return cost / Helper.getSpammerCost(categories);
} else if (method == Worker.MIN_COST_EVAL) {
return cost / Helper.getMinSpammerCost(categories);
} else if (method == Worker.EXP_COST_EST) {
return cost / Helper.getSpammerCost(categories);
} else if (method == Worker.MIN_COST_EST) {
return cost / Helper.getMinSpammerCost(categories);
} else {
// We should never reach this
System.err.println("Error: We should have never reached this in getWorkerCost");
return Double.NaN;
}

We should change that to use only getMinSpammerCost, but also use three different values for the priors as in issue #19

Make it easy to download directly the runnable jar

We should have an obvious option for downloading the jar file so that people can run the program easily.

Expecting people to know how to download maven and compile the file is not realistic for the vast majority of the users that just want to use this program, even if it is command-line and rather esoteric.

Remove the Maven related stuff from the How-To-Run wiki page

Generate a summary report

I want to create a summary report after each execution, reporting the average cost for the objects, and quality for the workers.

Effectively compute the average of the columns in the object-probabilities.txt and in the worker-statistics.txt and report it in a summary file (and in the terminal output).

An example summary would look like that:

X categories
Y objects in data set
Z workers in data set
W labels assigned by the workers
K gold labels for the objects

Data quality

DS_Exp_Cost metric, estimated: AA%
DS_Exp_Cost metric, evaluation: AB%
DS_Opt_Cost metric, estimated: AC%
DS_Opt_Cost metric, evaluation: AD%
....

[[Note: see https://github.com/ipeirotis/Get-Another-Label/wiki/How-to-Run-Get-Another-Label#wiki-objectprobabilitiestxt for the metrics]]

Average worker quality

Expected Quality, non-weighted, estimated: WA%
Expected Quality, weighted, estimated: WB%
Expected Quality, non-weighted, evaluation data: WC%
Expected Quality, weighted, evaluation data: WD%
Optimized Quality, non-weighted, estimated: WE%
Optimized Quality, weighted, estimated: WF%
...

Reporting weighted average for worker quality, within WORKER_ACESSORS

The following metrics should be computed using a weighted average. (See #25)

[WorkerQuality_Estm_DS_Exp_w]
[WorkerQuality_Estm_DS_Min_w]
[WorkerQuality_Eval_DS_Exp_w]
[WorkerQuality_Eval_DS_Min_w]

Currently, the reported number does not make much sense (needs to be divided by the "Number of Labels" to be properly normalized). If divided, the values are the same as the ones reported in WorkerQuality_Estm_Weighted_* by DS_ACCESSOR.

Would be better to implement #25, and report the proper values within the WORKER_ACCESSOR.

"MisclassificationCost" doesn't seem to work

From both the paper and the code, it seems that we can set MisclassificationCost to be an arbitrary number other than 0 or 1. However, when I set "porn notporn" to be 1000 or 0.1, the results don't change at all. I also set all 4 cases (porn-porn, notporn-notporn, porn-notporn, notporn-porn) to be 0, but the results still don't change. I have a larger dataset, and no matter how I change MisclassificationCost to be different numbers across different categories, the results remain the same. Therefore, I suspect "MisclassificationCost" might have a bug.

Is there a quick answer to this issue? In the meantime, I'll try to do some debugging work too.

Report expected data quality based on worker-quality estimates -- Esoteric

This is an esoteric check:

In principle, the quality of the data and the quality of the workers should finally converge into the same point. We should calculate first the quality of the workers (already doing that) and then estimate what is the "worker quality" under the level of redundancy that we observe in the data. This should be another estimate of the data quality and should be pretty close to the actual quality.

Create multiple versions of NoVote_Min_Cost for data quality -- Esoteric, fix last

The NoVote_Min_Cost uses the value of the prior probabilities to define what is the baseline cost of a "strategic spammer"

One key thing is that the prior probabilities, which can be estimated in different ways:

Use fixed priors, passed by the user in the categories.txt file (preferred)
Estimate the priors from the evaluation data (measure percentage of objects in different categories in the evaluation data)
Estimate the priors from the training data, (DS.categories.getPrior when running without fixed priors, if running with fixed priors we need to measure percentage of objects in different categories). This generates the problem that DS reports different priors than MV.

I would put an advanced switch in the command line to determine what type of prior to use for the normalization. By default it should be (1), with a secondary preference for (2). The option (3) [which is the current implementation, when we do not have fixed priors, and uses the DS priors] should come with a warning.

In the object probabilities, report more classification outputs

We report the DS category, MV_Category, (the maximum likelihood estimates) and Correct_Category.

We should also report the minimum cost categories for MV and DS, in addition to the maximum likelihood.

Add "dry-run" option for evaluation/testing

It would be nice to add an option to pass a file with the correct labels for (some of) the objects.

Currently, we also allow to load "gold data" with the purpose of aiding and speeding up the estimation of quality for the workers. The evaluation data will not be used in the same way we currently use gold data: Specifically, the evaluation data will never be used during training. Instead, we will use the evaluation data in order to estimate the quality of the estimates generated by the algorithm.

How it should work

We load another file, say evaluation-data, which will have similar format with the correctfile (a pair of objectid, label per line)
We store the evaluation label in the Datum object as a new field.
After running the algorithm, when we print the final results, we also compute the following
a. The labels for the objects as computed by the algorithm, compared to the evaluation data
b. The quality of the workers as computed by the algorithm, compared to the evaluation data
c. The priors for the categories as computed by the algorithm, compared to the evaluation data

Labels for the objects.

Right now, we create the files:
dawid-skene-results.txt
naive-majority-vote.txt
differences-with-majority-vote.txt
object-probabilities.txt

For each of these files, we should also add a column with the correct label, from the evaluation data.

We should also add columns with the classification cost for each example.

The classification cost is computed as the cost of misclassifying an object of class A (taken from the evaluation data) into class B (taken from the label(s) assigned by the workers). The cost is based on the costs file. The classification cost can be computed in multiple ways:

a. Using the maximum-likelihood category from DawidSkene (EvalCost_DS_ML)
b. Using the maximum-likelihood category from Majority (EvalCost_MV_ML)
c. Using the "soft label" category from DawidSkene (EvalCost_DS_Soft)
d. Using the "soft label" category from Majority (EvalCost_MV_Soft)

For a. and b. we simply take the class (as reported in the dawid-skene-results.txt and naive-majority-vote.txt) and print the cost. For cases c and d, we use the "soft label" for the object and we computed the weighted cost. For example, if the object has an evaluation class A, the algorithm returns 60% A, 30% B, 10% C, with costs A->A = 0, A->B = 1, A->C = 2, then the classification cost is 0.6_0+0.3_1+0.1_2 = 0.5. For an object with 90% A, 9% B, 1% C the classification cost is 0.9_0+0.09_1+0.01_2 = 0.11.

We should also generate an object-label-accuracy.txt report that will report:
a. The confusion matrix of each technique (dawid-skene maxlikelihood, dawid-skene soft, majority maxlikelihood, majority soft)
b. The average misclassification cost of each algorithm.

The quality of the workers

We should create an extra confusion matrix for each worker, which should be based solely on the assigned labels and the actual evaluation data. Then we can list the estimated quality of the worker based on the evaluation data, next to the estimates that we have for the confusion matrix, the quality of the worker, etc.

We should modify these files, accordingly
worker-statistics-detailed.txt
worker-statistics-summary.txt

To compute the evaluation-data confusion matrix of a worker, we go through the objects labeled by this worker; we check that is the evaluation label of this example; we check what is the assigned label from the worker. Based on these, we compute the confusion matrix of the worker, the quality (expected and optimized) of the worker, etc.

The priors for the categories

That is the simplest part. We should list in the priors.txt file not only the estimated priors, but also the actual priors, based on the prevalence of each category in the evaluation data.

allow access to DawidSkene programmatically

Hello,

I'm working on my dissertation with Get-Another-Label, and I'd like to access objects, workers, categories directly from DawidSkene instead of parsing the output text file. To do that, just need to add some public getters. I'd be happy to contribute a patch, if it's OK. Thanks.

Daniel Zhou, PhD student
University of Michigan School of Information

Add proper support for command-line arguments

Fixing "typo": It is Accessor, not Acessor

Do a global search for class/variable/method names and replace accordingly. Minor issue.

why there is no input file in data

When I want to do as the tutorial do, I found there is no input.txt in BarzanMozafari. I wonder where can I get this data

Enforce common nomenclature

Objects, Labels and Workers must be enforced to start with a lowercase alpha character.

Metrics should be uppercase

Report estimated confusion matrices in addition to evaluation-based

In #18, we implemented confusion matrixes for the different classification methods. We used evaluation data as "from".

We can also do the same with the estimated confusion matrixes, which capture the belief of the algorithm about its own performance. A well-calibrated result will report an estimated confusion matrix pretty close to the evaluation-based one.

To compute the estimated confusion matrix, we will use as "from" the category reported for DS-Soft and MV-Soft (for the DS_* and MV_* algorithms, respectively).

Since DS_Soft and MV_Soft are probability vectors and not a single category, we need to have a for loop that will go over the categories. Also, when we add the error in the confusion matrix cm.adderror, we will add the product of the from probability, multiplied with the to probability. (Right now we add the to probability only).

Allow people to skip answers (e.g., if someone is not married and we ask for spouse)

Have a default simple output and print all metrics only when asked

We seem to have too many metrics being reported now in the summary and in the object-probabilities.

Perhaps we should report only a small fraction of those. I think the following metrics are more than enough:

Categories: 2
Objects in Data Set: 1000
Workers in Data Set: 83
Labels Assigned by Workers: 5000
[Number of labels] Labels per worker: 60.2410
[Gold Tests] Gold tests per worker: 1.8072
DataQuality_Estm_DS_Min
DataQuality_Eval_DS_Min [only when we pass an evaluation file]
WorkerQuality_Estm_DS_Min_n
WorkerQuality_Estm_DS_Min_w
WorkerQuality_Eval_DS_Min_n [only when we pass an evaluation file]
WorkerQuality_Eval_DS_Min_w [only when we pass an evaluation file]

We should similarly limit the number of columns in the results/ (object-probabilities, worker-statistics, etc)

Add weighted summaries in the summary, for the worker quality metrics

Right now we display the averages of worker quality, assuming that each worker is equal to others. However, since each worker is contributing different number of labels, we need to also report the weighted average of worker quality. For that, we multiply each quality metric with the "number of annotations" (aka assigned labels), and then report the sum of these products, divided by the total number of labels assigned by workers

Report confusion matrix of each technique

Would be nice to see the complete confusion matrix of each technique, which is the type of errors by each one.

Should be based on evaluation data

Methods:

DS_MaxLikelihood
MV_MaxLikelihood
DS_MinCost
MV_MinCost
DS_Soft
MV_Soft

include License info

Is this project MIT license or GPL license or some other open source license? Please include a license information. I'd like to contribute patches directly to this project instead of do sub-classing. But I'd like to make sure I can still legally use the code later on. Thanks.

MIT license: http://opensource.org/licenses/mit-license.php/

I guess adding a "license.txt" file under the root directory would be fine. Thanks.

Revising documentation

We need to revise the documentation in https://github.com/ipeirotis/Get-Another-Label/wiki/How-to-Run-Get-Another-Label to reflect the changes made in Issue #1 and in particular #1 (comment)

Most webpages are down

Most of the links from this project are broken, including the standalone download page (https://github.com/ipeirotis/Get-Another-Label/downloads)

Rename bin/get-another-label to bin/get-another-label.sh

In the distribution zipball in the downloads, the Unix executrable scripts file in the bin/ folder is called get-another-label and appears to be a binary instead of a shell script. I would rename it get-another-label.sh to make it clear that it is an executable script.

Keep only necessary jar libraries in the requirements listed by Maven

Weighted Worker quality: Can we report it using the Worker Decorator?

Seems strange to have the weighted and the unweighted report on worker quality under different decorators (unweighted->worker, weighted->DawidSkene).

I would recommend to allow the functionality for "weighted averages" in the Decorators and SummaryReport class.

We will need to have a getWeightedAverage method in summary report. In the weightedAverage, we need to make the following changes, compared to getAverage:

The accumulator increases as follows: accumulator += value * weight
The count increases as follows: count += weight

In fact, once you have a weightedAverage implementation, the getAverage is just a special case, where the weights are all equal to 1.