uclnlp / inferbeddings Goto Github PK

View Code? Open in Web Editor NEW

59.0 59.0 12.0 1.57 GB

Injecting Background Knowledge in Neural Models via Adversarial Set Regularisation

License: MIT License

Python 65.48% Prolog 28.25% Shell 1.13% Jupyter Notebook 5.10% R 0.04%

inferbeddings's People

Contributors

Stargazers

Watchers

Forkers

mindis xitongdashi vikingmew shuxiaobo leiloong 1780041410 afcarl gzupanda respapercode fardon realcatking zhenglinyi

inferbeddings's Issues

Improve efficiency of evaluation

Results obtained in the raw setting can be re-used for computing results in the filtered settings, without re-doing the ranking from scratch. This should nearly halve the runtime of each experiment.

https://github.com/uclmr/inferbeddings/blob/master/inferbeddings/evaluation/metrics.py#L39

Fix this without altering the output of the program (so we don't need to re-do previous experiments).

arXiv submission

There are a few things that still need to be done, that don't justify another paper, but would round things up and could be helpful in case of rejection.

What are these? For example, running on more datasets, and maybe more models.

More faithful implications

One potential problem with the current formulation is that the implications/clauses losses aren't fully consistent with their logical counterparts. In logic a clause is automatically true when the body is false. In our formulation, a body can have a very low score (effectively false), but the head score still needs to be bigger than the body score. For example, if body score==-15 and head score=-20 the model would still need to push the head score hard, but at this point both expressions are essentially false.

Notice that this reasoning only really make sense with standard loss, not with pairwise loss. The problem with pairwise loss is that one looses any sense of "trueness"...

Compare with KALE

At the moment we do not have any comparison/result with results published in other papers.

A solution is to compare directly against KALE, after using their training/validation/test sets: http://www.aclweb.org/anthology/D/D16/D16-1019.pdf

Their model is based on TransE: with a model based on ComplEx we should be able to show some improvements.

Integrate Sampled Grounding as alternative

This could be done through a refactoring of the code in which the "optimise adversarial" step can be filled in by sampling from the actual entity embeddings. Another view is to enable optimisation from grounded initialisation, and then just do 0 steps of optimisation for old-school grounding (but this isn't quite as elaborate as the sampling scheme in NAACL)

Blog Post

Regardless of whether the UAI submission comes through, I think we should publish the paper on arxiv at some point soon. For this I think it's extremely important to do a bit of PR. I have been thinking about a blog post that

gives a short overview and motivation for KBP
presents neural link predictors
presents our method
shows a simple 2D interactive animation that shows the adversarial learning process (and also the default learning process), using a simple model and problem. If we use a model like distmult, we should be able to relatively easily implement this in javascript. I have some ideas for this based on my talk slides.

We should be careful not to overdo this, and make sure this comes out with the paper. But I do think there is a "gap in the market" regarding nice neural link prediction articles and visualisations, and I think it's doable.

Is there a standard place where people put such posts (besides distill.pub)? We could just upload it to the uclmr webpage I guess.

Experiments (first AMIE+ phase)

Which datasets, rules, settings, hyperparameter ranges, hyperparam. search methods etc. ?

IIRC @tdmeeste was looking into datasets and rules (especially some criterion for generating/selecting the rules to include in the process) - how is it going?

Experiments & Datasets

Note - Experiments on FB15k can be a bit slow (take several hours); e.g. try this:

$ python3 ./bin/adv-cli.py --train data/fb15k/freebase_mtr100_mte100-train.txt --valid data/fb15k/freebase_mtr100_mte100-valid.txt --test data/fb15k/freebase_mtr100_mte100-test.txt --clauses data/fb15k/clauses/clauses_0.999.pl --nb-epochs 100 --lr 0.1 --nb-batches 10 --model TransE --similarity l2 --margin 1 --embedding-size 150 --adv-lr 0.1 --adv-init-ground --adversary-epochs 0 --discriminator-epochs 10 --adv-weight 1000 --adv-batch-size 1

Consider using other datasets, e.g. YAGO or DBpedia.

ICML Paper Title & Abstract

I would like us to tune the high level story of the paper a little. One way is to sell this just as "better rule injection". Another is to think more about "Reasoning with Low Rank Representations", or "Reason and Represent" etc. I think there is a deeper story, in which rule weight learning would fit in well. Let's use this space to hone our story. I will add one version of an abstract title here later.

Adapt Algorithm to new formulation

I copied over the algorithm environment from the old version. Needs adaptation.

This would also include generalise the "entity normalisation" step, which may use unit-ball, unit box, or nothing at all (and instead rely on regularisation). We can also not generalise, and then explicitly say in the text body that other mechanisms are possible but we stick to one for simplicity of exposition.

Decide the best ruleset for FB15k

For generating candidate rule-sets try e.g.

$ ./tools/amie-to-clauses.py -t 0.9 data/fb15k/rules/fb15k-rules_mins=1000_minis=1000.txt

Closed form adversaries - next steps

Next steps:

Synthetic dataset

Decide synthetic dataset, and which experiments to do on it.

Visualisations of Adversarial Inference Mechanism

I think there are some ways to visualise, in the paper, the way the method works. We should design one. Here is one option:

Consider link prediction with a single relation r, and some small number of entities
Inject the transitivity formula
After training the discriminator, project entities to 2D space. Plot edges between them if the discriminator thinks they are related.
After training the generator, plot the 3 points it found to violate the transitivity clause the most, again in the 2D graph. Ideally they lie somewhere close to real entities that violate the clause
Iterate

We can use this space to discuss other ideas.

Learning "Rule Weights"

I believe that our framework can be relatively easily extended to learn rule weights. I feel this is a low-hanging fruit, and may lead to better results without having to worry about where to get the rules from. If @tdmeeste or @rockt have any cycles, maybe something to look at for them. If we have the datasets already prepared, it's a matter of extending the TF loss. I have some ideas how the loss would look like. Maybe I find some time to hack this in as well.

Generally, I am looking low-hanging fruits that add heft to the paper and make us less relying on improvements by rule injection (which may or not may materialise).

Todo:

adapt syntax for clause parser to define rule weights and learnable rule weights
implement weighted loss (and negated weighted loss)
provide way to easily print out weights per clause (might need dictionary from clauses to variables)
run on synthetic dataset with partial transitivity to validate whether a non-0.5 weight is learned.
"dynamic weights" based on relation representations

Collect Hypotheses to Test

For the paper and the experiment section, it would be good to be precise about the hypotheses we like to test, and how to test them. Here is a start:

Adversarial learning is more efficient than random sampling (NAACL). Tests:
- lower ground rule violation after same amount of training time (or less time for same ground rule violation counts), ideally on real and synthetic datasets
- better accuracy after same time (this test somehow conflates things a little)
Adversarial learning is more general than EMNLP approach ...
and generality is useful in practice
- Test: show some improvements using types of formulae not supported by EMNLP, ideally over SOTA but at least for ZSL
Adversarial learning for rules works: by finding "synthetic" violators, and pushing them down, real violators disappear (presumably because they are similar to the synthetic violators)

Feel free to comment, edit and add more...

With DistMult or ComplEx, when enforcing p => q results in emb(p) ~ emb(q) and 50% ground errors

Here's the code for replicating the issue:

$ ./bin/adv-cli.py --train data/synth/simple-tiny/data.tsv --lr 0.1 --model DistMult --similarity dot --margin 1 --embedding-size 30 --nb-epochs 1000 --clauses data/synth/simple-tiny/clauses.pl --adv-lr 0.1 --adv-ground-samples 100 --adv-weight 1000000 --adversary-epochs 10 --discriminator-epochs 1 --debug

Here's the embeddings of p and q after some epochs (if you run the code, the output is a colored Hinton diagram):

┌──────────────────────────────────────────────────────────────────────────────────────────┐
│ ▇  ▁  ▅     ▂  ▂  ▂  ▁  ▄        █  ▂  ▅  ▇  ▃  ▂     ▂  ▄  ▃  ▃  ▆  ▇  ▇  ▁  ▃  ▄  █  ▁ │
│ ▇  ▁  ▅     ▂  ▃  ▂  ▁  ▄     ▁  █  ▂  ▅  ▇  ▃  ▂     ▂  ▄  ▃  ▃  ▆  ▇  ▇  ▂  ▃  ▄  █  ▁ │
└──────────────────────────────────────────────────────────────────────────────────────────┘

Using TransE results in ~0% ground errors, for some reason.

Check magnitude of relation embeddings

Results on DistMult are not that enthusiastic (as e.g. in TransE). I am afraid this may be happening because the model tries to minimize the overall loss by increasing the magnitude of the predicate embeddings.

If this is happening (need to verify this) consider adding a regularizer (indeed the original DistMult paper [1] uses one).

[1] https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/ICLR2015_updated.pdf

Other Generator Distributions

Currently, we use only point mass distributions for the generator. This means there is a bit of a disconnect between what we do and the more typical GAN applications. That's completely fine. However, to make this connection stronger in the paper, I'd recommend to try some more standard GAN approaches as well. They don't have to work better, so this is relatively fail-safe, we just want to compare.

Results (08/02/2017)

Some early results are available here:

http://data.neuralnoise.com/inferbeddings/logs_08022017.tar.gz

Just decompress the file in the inferbeddings directory.

Those results are generated by jobs on the UCLCS cluster - the scripts generating the jobs have a UCL_ prefix and are available here:

https://github.com/uclmr/inferbeddings/tree/master/scripts/wn18
https://github.com/uclmr/inferbeddings/tree/master/scripts/fb15k

For checking the results - I've done a script that:

Looks for the best hyperameter settings for each metric (filtered setting, like in the ComplEx paper) on the validation set, and
Reports the corresponding results on the test sets.

For example - results on WN18 with and without including rules:

With rules:

$ ./tools/parse_results_filtered.sh logs/ucl_wn18_adv_v1/*.log
1080
Best MR, Filt: logs/ucl_wn18_adv_v1/ucl_wn18_adv_v1.adv_batch_size=1_adv_epochs=1_adv_lr=0.1_adv_weight=100_batches=10_disc_epochs=10_embedding_size=200_epochs=100_lr=0.1_margin=1_model=TransE_optimizer=adagrad_similarity=l2.log
Test - Best Filt MR: 140.9154

Best MRR, Filt: logs/ucl_wn18_adv_v1/ucl_wn18_adv_v1.adv_batch_size=1_adv_epochs=10_adv_lr=0.1_adv_weight=10000_batches=10_disc_epochs=10_embedding_size=100_epochs=100_lr=0.1_margin=1_model=TransE_optimizer=adagrad_similarity=l1.log
Test - Best Filt MRR: 0.493

Best H@1, Filt: logs/ucl_wn18_adv_v1/ucl_wn18_adv_v1.adv_batch_size=1_adv_epochs=10_adv_lr=0.1_adv_weight=10000_batches=10_disc_epochs=10_embedding_size=100_epochs=100_lr=0.1_margin=1_model=TransE_optimizer=adagrad_similarity=l1.log
Test - Best Filt Hits@1: 32.78%

Best H@3, Filt: logs/ucl_wn18_adv_v1/ucl_wn18_adv_v1.adv_batch_size=10_adv_epochs=10_adv_lr=0.1_adv_weight=100_batches=10_disc_epochs=10_embedding_size=50_epochs=100_lr=0.1_margin=1_model=TransE_optimizer=adagrad_similarity=l1.log
Test - Best Filt Hits@3: 84.57%

Best H@5, Filt: logs/ucl_wn18_adv_v1/ucl_wn18_adv_v1.adv_batch_size=10_adv_epochs=10_adv_lr=0.1_adv_weight=100_batches=10_disc_epochs=10_embedding_size=50_epochs=100_lr=0.1_margin=1_model=TransE_optimizer=adagrad_similarity=l1.log
Test - Best Filt Hits@5: 90.78%

Best H@10, Filt: logs/ucl_wn18_adv_v1/ucl_wn18_adv_v1.adv_batch_size=10_adv_epochs=10_adv_lr=0.1_adv_weight=100_batches=10_disc_epochs=10_embedding_size=50_epochs=100_lr=0.1_margin=1_model=TransE_optimizer=adagrad_similarity=l1.log
Test - Best Filt Hits@10: 93.06%

Without rules:

$ ./tools/parse_results_filtered.sh logs/ucl_wn18_adv_v1/*_adv_weight=0_*.log180
Best MR, Filt: logs/ucl_wn18_adv_v1/ucl_wn18_adv_v1.adv_batch_size=1_adv_epochs=0_adv_lr=0.1_adv_weight=0_batches=10_disc_epochs=10_embedding_size=200_epochs=100_lr=0.1_margin=1_model=TransE_optimizer=adagrad_similarity=l2.log
Test - Best Filt MR: 146.8016

Best MRR, Filt: logs/ucl_wn18_adv_v1/ucl_wn18_adv_v1.adv_batch_size=100_adv_epochs=1_adv_lr=0.1_adv_weight=0_batches=10_disc_epochs=10_embedding_size=50_epochs=100_lr=0.1_margin=1_model=TransE_optimizer=adagrad_similarity=l1.log
Test - Best Filt MRR: 0.372

Best H@1, Filt: logs/ucl_wn18_adv_v1/ucl_wn18_adv_v1.adv_batch_size=100_adv_epochs=1_adv_lr=0.1_adv_weight=0_batches=10_disc_epochs=1_embedding_size=20_epochs=100_lr=0.1_margin=1_model=TransE_optimizer=adagrad_similarity=l2.log
Test - Best Filt Hits@1: 16.62%

Best H@3, Filt: logs/ucl_wn18_adv_v1/ucl_wn18_adv_v1.adv_batch_size=100_adv_epochs=1_adv_lr=0.1_adv_weight=0_batches=10_disc_epochs=10_embedding_size=50_epochs=100_lr=0.1_margin=1_model=TransE_optimizer=adagrad_similarity=l1.log
Test - Best Filt Hits@3: 60.31%

Best H@5, Filt: logs/ucl_wn18_adv_v1/ucl_wn18_adv_v1.adv_batch_size=100_adv_epochs=1_adv_lr=0.1_adv_weight=0_batches=10_disc_epochs=10_embedding_size=50_epochs=100_lr=0.1_margin=1_model=TransE_optimizer=adagrad_similarity=l1.log
Test - Best Filt Hits@5: 70.16%

Best H@10, Filt: logs/ucl_wn18_adv_v1/ucl_wn18_adv_v1.adv_batch_size=1_adv_epochs=1_adv_lr=0.1_adv_weight=0_batches=10_disc_epochs=10_embedding_size=50_epochs=100_lr=0.1_margin=1_model=TransE_optimizer=adagrad_similarity=l1.log
Test - Best Filt Hits@10: 79.39%

Please note that the experiments in logs/ucl_fb15k_adv_v?.2 are still running (and most logfiles are incomplete). Those are experiments with a new ruleset I'm trying for FB15k - using clauses with higher support (minimum support here is 1000 instead of 100): this is related to #11