cogcomp / lbjava Goto Github PK

View Code? Open in Web Editor NEW

13.0 6.0 17.0 45.63 MB

Learning Based Java (LBJava)

Home Page: http://cogcomp.cs.illinois.edu/page/software_view/LBJava

License: Other

Java 97.75% Groovy 0.01% Shell 0.28% Yacc 1.87% Vim Script 0.10%

machine-learning feature-extraction learning-algorithms

lbjava's Introduction

Learning Based Java

Compiling the whole package

From the root directory run the following command:

Just compile all examples: mvn compile
Compile and train all examples: mvn compile -P train-all-examples See more details here.
Test the whole project: mvn test

Compiling LBJava Core

mvn compile -pl lbjava

External Links

Here is LBJava's homepage.

Licensing

To see the full license for this software, see LICENSE or visit the download page for this software and press "Download". The next screen displays the license.

lbjava's People

Contributors

Stargazers

Watchers

Forkers

christos-c bhargav yj14n9xyz josacky xinbowu2 kordjamshidi mayhewsw mssammon slash0bz haowu40 qiangning yxd126 befeng shadowridgedev deepmeditativemind saikrishna94912

lbjava's Issues

Balas bug

From email:

I'm using your open sourced implementation of Balas algorithm. Thank you for making it open. It contains a bug at LBJ2/infer/BalasHook.java:321

for (int i = 0; i < variables; ++i)
  if (negated[i]) {
    x[i] = 1 - x[i];

At that last line must be "solution" instead of "x", because x is already copied to solution and changes
to x will not affect solution. So solution accessed through getBooleanValue will be wrong
(in the case of for maximization and positive objective coefficients).
Not a bug, but weird: LBJ2/infer/BalasHook.java:342

for (int j = 0; j < Ac.size(i); ++j)
  lhs += x[Av.get(i, j)] * Ac.get(i, j);

These 2 lines do nothing, because x always contains zeros only at that point.

Alternate Model and Lexicon formats (Human-readable or diff-able)

We frequently see issues (especially in Saul) related to discrepancy in a model's performance after saving/loading it. The binary format makes it difficult to compare two models reliably.

Proposal is to allow for alternate model and lexicon formats which are human-readable or diff-able ( .json).

Not able to generate the lbj files via `mvn lbjava:generate`

Currently we generate the lbjava files via the 2nd step mentioned here.
It seems much better if we can change it slightly so that it generates the files via the lbja-maven plugin.

Like can we do this in the root folder? :

mvn -pl lbjava-examples lbjava:generate

Or inside the examples folder lbjava-examples? :

mvn -pl lbjava:generate

Getting Fatal Error while Executing Xgboost in R

I am using Xgboost function in R for building model file.

For around 1000 dataset, the function is working correctly.

When I increased the size of dataset to around 300k with 400 features

I am getting fatal error in R, while executing below line:

fit <- xgboost(data=traindata,label=NULL,verbose = 2, nthread = 2, eta=0.3,nround = 10,objective = "multi:softmax",num_class = length(names) + 1,silent=1)

note: traindata is of type xgb.matrix

I also tried executing the same by varying parameters :eta and nrounds, but still getting fatal error and R session is aborting

sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_India.1252 LC_CTYPE=English_India.1252
LC_MONETARY=English_India.1252

[4] LC_NUMERIC=C LC_TIME=English_India.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached):
[1] tools_3.3.1
I also executed the same in RGui(console), there also I got runtime error

What is the reason for this issue?

Why sort inside vectors

Why all these sort functions inside? I don't understand the logic ...:
https://github.com/IllinoisCogComp/lbjava/blob/master/lbjava/src/main/java/edu/illinois/cs/cogcomp/lbjava/util/DVector.java#L189-L192
https://github.com/IllinoisCogComp/lbjava/blob/master/lbjava/src/main/java/edu/illinois/cs/cogcomp/lbjava/util/DVector2D.java#L236-L238
https://github.com/IllinoisCogComp/lbjava/blob/master/lbjava/src/main/java/edu/illinois/cs/cogcomp/lbjava/util/FVector.java#L197-L200
https://github.com/IllinoisCogComp/lbjava/blob/master/lbjava/src/main/java/edu/illinois/cs/cogcomp/lbjava/util/OVector.java#L201-L203

write and read functions for AdaGrad are not written

Examples are here:
https://github.com/IllinoisCogComp/lbjava/blob/master/lbjava/src/main/java/edu/illinois/cs/cogcomp/lbjava/learn/StochasticGradientDescent.java#L260-L294

[LBJava-Examples] Issue in Entity Relation example datastructures

In the Entity Relation example datastructures, calling .hashcode on any of the items causes an StackOverflow due to unbounded recursion.

ConllRawSentence - Calls .hashcode on tokens and relations in the sentence.
ConllRawToken - Calls .hashcode on the sentence object.
ConllRelation - Calls .hashcode on its two entity tokens and sentence object.

This leads to unbounded recursion. Thus, cannot add these instances in a HashSet or HashMap (as keys).

lbjava Gurobi version

In some project we use Gurobi 6, although lbjava still uses v5:
https://github.com/IllinoisCogComp/lbjava/blob/master/lbjava/pom.xml#L28-L33

Consider upgrading it ...

Weka wrapper example

Create an example + doc for Weka Wrapper

Tutorials on constrained inference

We don't have enough documentation on the inference structure.
The tutorials on the inference need to be extended.

Relevant Documentation:
http://cogcomp.cs.illinois.edu/papers/RizzoloRo07.pdf
http://cogcomp.cs.illinois.edu/page/publication_view/711
http://cogcomp.cs.illinois.edu/page/publication_view/208

This same lbj file can be helpful as well: https://gitlab-beta.engr.illinois.edu/cogcomp/illinois-comma-srl/blob/master/src/main/lbj/CommaClassifier.lbj

Augment TestReal with things what Saul has

There is an old function in Saul for testing real-valued functions.
which has the same purpose as TestReal function. Proposal: augment TestReal function so that it covers whatever being done in Saul; then we can remove the one used in Saul and just call lbjava's TestReal as dependency.

FYI @christos-c @kordjamshidi @bhargav

remove weka dependency

weka has GPL2 license, not compatible with academic use. Can we remove, or at least separate, the Weka dependency?

Add micros/macro average F1 to TestDiscrete

Whenever we use TestDiscrete it prints something like this:

[info]  Label   Precision Recall    F1   LCount PCount
[info] -----------------------------------------------
[info] false       96.874 100.000 98.412  61178  63152
[info] true       100.000   0.804  1.595   1990     16
[info] -----------------------------------------------
[info] Accuracy    96.875    -      -      -     63168

I am suggesting to add micro averaged scores (simply averaging F1, Precision, etc for each label) and macro averaged scores (which are calculated across all labels). This would save some time later.

Lexicon not thread-safe

The lexicon class is not thread safe, the lexicon hashmap therein can in some circumstances, change and evolve, but there is no attempt to synchronized it's access.

Lexicon equality

When the hashCode function of two Comparable objects yield the same value, they should also compare to equals. This is in the Java spec. In the case of the Lexicon class, this is not so.

(Inference) To return zero scores rather than throwing error.

I see these lines here in the ILPInference are problematic for Saul joint training still, when we start with the classifiers that their lableLexicon is not complete e.g. because they have net seen any examples yet, so the ILP inference can not return any score but the default could be to return zero, not throwing error. Can anyone address this issue here? https://github.com/IllinoisCogComp/lbjava/blob/master/lbjava/src/main/java/edu/illinois/cs/cogcomp/lbjava/infer/ILPInference.java#L173

Mixed feature accuracy table

NewsGroup (table for single real feature)

Condition\Algorithm	SparseAveragedPerceptron	SparseWinnow	PassiveAggresive	BinaryMIRA
1 round w/o real features	48.916	92.597	19.038	33.739
1 round w/ real features	47.753	92.491	23.268	32.364
10 rounds w/o real features	82.390	91.539	24.802	76.891
10 rounds w/ real features	82.126	91.529	12.427	75.939
50 rounds w/o real features	84.823	91.592	14.120	77.208
50 rounds w/ real features	85.299	91.433	19.566	76.891
100 rounds w/o real features	85.828	91.433	12.956	76.574
100 rounds w real features	84.770	91.486	15.442	61.026

NewsGroup (table for the same amount of Gaussian random real features as discrete ones)

Condition\Algorithm	SparseAveragedPerceptron	SparseWinnow	PassiveAggresive	BinaryMIRA
1 round w/o real features	51.454	92.597	12.057	33.739
1 round w/ real features	17.980	6.081	14.913	14.225
10 rounds w/o real features	82.813	91.539	22.369	76.891
10 rounds w/ real features	52.829		42.517	45.743
50 rounds w/o real features	84.294	91.592	21.100	77.208
50 rounds w/ real features	75.727		67.054	75.198
100 rounds w/o real features	85.506	91.433	17.768	76.574
100 rounds w real features	77.631		74.828	74.194

Badges (table for single real feature)

Condition\Algorithm	SparsePerceptron	SparseWinnow	NaiveBayes
1 round w/o real features	100.0	95.745	100.0
1 round w/ real features	100.0	95.745	100.0
10 rounds w/o real features	100.0	100.0	100.0
10 rounds w/ real features	100.0	100.0	100.0
50 rounds w/o real features	100.0	100.0	100.0
50 rounds w/ real features	100.0	100.0	100.0
100 rounds w/o real features	100.0	100.0	100.0
100 rounds w real features	100.0	100.0	100.0

Badges (table for same amount of constant real features as discrete features)

Condition\Algorithm	SparsePerceptron	SparseWinnow	NaiveBayes
1 round w/o real features	100.0	95.745	100.0
1 round w/ real features	74.468	100.0	100.0
10 rounds w/o real features	100.0	100.0	100.0
10 rounds w/ real features	78.723	100.0	100.0
50 rounds w/o real features	100.0	100.0	100.0
50 rounds w/ real features	100.0	100.0	100.0
100 rounds w/o real features	100.0	100.0	100.0
100 rounds w real features	100.0	100.0	100.0

Badges (table for same amount of of random Gaussian real features as discrete features)

Condition\Algorithm	SparsePerceptron	SparseWinnow	NaiveBayes
1 round w/o real features	100.0	95.745	100.0
1 round w/ real features	55.319	56.383	100.0
10 rounds w/o real features	100.0	100.0	100.0
10 rounds w/ real features	62.766	100.0	100.0
50 rounds w/o real features	100.0	100.0	100.0
50 rounds w/ real features	74.468	87.234	100.0
100 rounds w/o real features	100.0	100.0	100.0
100 rounds w real features	86.170	100.0	100.0

make the setWeight for SparseNetwork public

I need to use setWeight in Saul but this is is not accessible here, I tried some work arounds using scaledAdd but I see it has introduced some other bugs... . Could someone change this urgently, please?

Some suggested algorithms to add

From Mallet: https://github.com/mimno/Mallet/tree/master/src/cc/mallet/classify

Maximum Entropy classifier: https://github.com/mimno/Mallet/blob/master/src/cc/mallet/classify/MCMaxEnt.java
Rank Maximum Entropy classifier : https://github.com/mimno/Mallet/blob/master/src/cc/mallet/classify/RankMaxEnt.java
Posterior Regularization:
https://github.com/mimno/Mallet/blob/master/src/cc/mallet/classify/MaxEntPRTrainer.java
Decision Tree
Balanced Winnow: https://github.com/mimno/Mallet/blob/master/src/cc/mallet/classify/DecisionTree.java

Bug in AdaGrad implementation

AdaGrad does not increase the size of the weight vector while learning. Weight Vector dimensions might increase if there are new features seen while feature extraction from unseen training examples.

Cause:
https://github.com/IllinoisCogComp/lbjava/blob/master/lbjava/src/main/java/edu/illinois/cs/cogcomp/lbjava/learn/AdaGrad.java#L189

Example feature:

discrete MyTestFeature(MyData d) <- {
    return d.isCapitalized() ? "YES" : "NO"
}

For this example, weight vector should have size 3 - YES, NO, Bias Term. But exampleFeatures.length is only 1 here.

Compare with implementation of StochasticGradientDescent.

Discretizer for Real valued features

We should support discretization techniques for Real valued feature into discrete values. This would help when we have mixed discrete/real features in our experiments.

One reference that is popularly used: MDL Discretizer
http://www.decom.ufop.br/luiz/site_media/uploads/arquivos/bcc444_pcc142/multiintervaldiscretizationofcontinuousvaluedattributesforclassificationlearning1993.pdf

Add ability for weighted code

Add ability for learners to be weighted. See this branch of my fork: https://github.com/mayhewsw/lbjava/tree/weighted

The difficulty is that the abstract function learn() in Learner.java needs to be changed, which is a breaking change. It's easy to retrofit classifiers, but it is busy work (already done in my branch).

So far, I have implemented weighted learning only for SparseAveragePerceptron.

Any comments @mssammon @cowchipkid @danyaljj ?

What to do with `data`?

There is data folder which contains a bunch of sample data.
What should we do with them?

Xalan from Stanford kills Java cups

When I import core-utils and lbjava in the same project, java cups breaks down and cannot work.
The reason behind this is core-utils brings stanford corenlp which in turn brings xalan jar.
Xalan and java cups do not go well.

lbjava mvn site-deploy

apparently there are some issues with javadocs (possibly using jdk 8)

Tuning with respect to F1?

Is there a systematic way to tune a classifier's parameters (say output threshold etc) to maximize its F1?

Some optimization algorithms to add

Here is a good list of important techniques, some of which are missing from lbjava and we can add them:
https://github.com/factorie/factorie/tree/master/src/main/scala/cc/factorie/optimize

FYI @YimingJiang @danr-ccg

How to handle files ending with lbj when running program from cog comp of Illinois

Hi, I have downloaded the source of one of your colleagues with some lbj files:

http://cogcomp.cs.illinois.edu/page/resource_view/107

and so I install lbjava from here. After finishing a maven install of lbjava, I am still stuck.

The source comes with the following script:
java -Xmx5G -XX:MaxPermSize=5G -cp $CLASSPATH:class LBJ2.Main -sourcepath src -gsp lbj -d class article.lbj
javac -classpath $CLASSPATH:class -sourcepath lbj:src -d class src/esl/*.java

The first line of the script goes well after I replace $CLASSPATH with the path to LBJ2.jar.

Now when running the second line, one of java source in "src/esl/*.java" fails because it contains a call to a function in "article.lbj"

May I know how to get the 2nd line to succeed? Currently, the $CLASSPATH of the 2nd line is exactly the same as the first line, pointing to LBJ2.jar

Any help would be appreciated.

Constraints documentation

Related to #20, there is a possible error in the normalization part of the constraints documentation.

I think it needs to be normalizedby new Softmax() (without a ; at the end)

Unable to compile LBJava-examples

Errors like:

[ERROR] could not parse error message:   symbol:   class BadgeClassifier                                                                                  
[ERROR] location: class FeatureWeightCalculator                                                                                                           
[ERROR] /home/christod/workspace/lbjava-cogcomp/lbjava-examples/src/main/java/edu/illinois/cs/cogcomp/lbjava/examples/badges/FeatureWeightCalculator.java:15: error: cannot find symbol                                                                                                                             
[ERROR] BadgeClassifier bc = new BadgeClassifier();

increment the version of the `compileLBJ.sh`

Make sure when you run the script for incrementing the version, it also increments the LBJ version inside compileLBJ.sh.

xgboost

XGBoost is getting lots of attention. Might be worth adding it as one of the learners:
https://github.com/dmlc/xgboost/tree/master/jvm-packages

Investigate SVM Memory Usage

@Slash0BZ found out during his investigations for #101 that the current implementation of Support Vector Machine causes an increase in Heap memory usage linearly with the number of iterations of training. Need to investigate this.

Additionally, we tried commenting out the LibLinear train method calls but the issue still seems to happen.

Memory usage log: http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTcvMDIvMjcvLS1tZW1vcnkubG9nLS0xOC0yMS0zOQ==

license on top of files?

@mssammon @christos-c what do you think about adding this snippet on top of each file?

/*******************************************************************************
 * University of Illinois/Research and Academic Use License 
 * Copyright (c) 2016, 
 *
 * Developed by:
 * The Cognitive Computations Group
 * University of Illinois at Urbana-Champaign
 * http://cogcomp.cs.illinois.edu/
 *
 * Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal with the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
 *
 * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimers.
 * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimers in the documentation and/or other materials provided with the distribution.
 * Neither the names of the Cognitive Computations Group, nor the University of Illinois at Urbana-Champaign, nor the names of its contributors may be used to endorse or promote products derived from this Software without specific prior written permission.
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE CONTRIBUTORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS WITH THE SOFTWARE.
 *     
 *******************************************************************************/

readme bug

https://github.com/IllinoisCogComp/lbjava/blob/master/lbjava/doc/ALGORITHMS.md has two links for Stochastic Gradient Descent. The one under 'Classification' links to https://github.com/IllinoisCogComp/lbjava/tree/master/lbjava/doc

a readme for inference techniques

We need to have a page in the inference page which contains the list of inference machines and how to install them.

Saving learner models in non-existing dir

The Learner.saveModel() method should probably check if the directory we are about to save in exists or not. Right now, it just throws an exception and dies, which can be annoying if you've just been training a big model. Similarly for the Learner.saveLexicon() method.

unable to compile examples related to issue #5

mvn lbjava:compile => No plugin found for prefix 'lbjava' in the current project and in the plugin groups ...

SVM: inconsistency in reading and writing the model?

In the read function of SVM there is a line that doesn't seem to have counterpart in the write function. A potential bug.

OJalgoHook is too verbose

There are lots of print statements that need to be delegated to a debug mode.
Also, "Good news!: the optimizatin solution is optimal" 1) too informal 2) contains a spelling error (IntelliJ catches these 😄)

Using inference project as dependency

The inference package offers a nice abstraction over various ILP (and non-ILP) inference algorithms and hooks (just ported oj! there).

It would be nice to have LBJava's inference classes drawn from this package, since it'll be easier to update and maintain.

Move the static block inside the empty constructor

FYI @mssammon

add libsvm for kernelized SVM

https://www.csie.ntu.edu.tw/~cjlin/libsvm/

Some suggested algorithms to add:

ILP inference fails when classifier is loaded from model files

I have an ILP Inference classifier that relies on a base classifier. When the classifier is initialized and trained, the ILP inference classifier works fine.

However, if I initialize the base classifier with a model file and a lexicon file that are pre-trained, the ILP classifier gives me an error LBJava ERROR: Classifier relation_classifier did not return any scores. ILP inference cannot be performed.

The error message is produced from https://github.com/CogComp/lbjava/blob/434cf0a40e4f2ae08c96d3ae1b96f319eb531d67/lbjava/src/main/java/edu/illinois/cs/cogcomp/lbjava/infer/ILPInference.java#L173-#L174

Getting rid of `compileLBJ.sh` script and using the lbj-maven-plugin

There is this script compileLBJ.sh in the lbjava-examples/ which is used to generate the java files for example inside this module, before being able to compile the whole project.

This is really a temporary solution and we should be able to bypass this step with lbjava-maven-plugin.

Proposal: remove the script and just use the lbjava maven plugin.

@mayhewsw do you think you have take a stab at this?

FYI @christos-c

Change all "LBJ"s to "LBJava"s in the master documentations.
Add the proper documentation about examples. The documentation can be about, what the example is, what are the expected input and output, what is the expected performance, how to run (some of these can be proper links to the master documentation).