stasl0217 / ukge Goto Github PK

View Code? Open in Web Editor NEW

49.0 8.0 16.0 9.62 MB

Code and data for paper "Embedding Uncertain Knowledge Graphs"

Python 100.00%

ukge's Introduction

Embedding Uncertain Knowledge Graphs

This repository includes the code of UKGE and data used in the experiments.

Install

Make sure your local environment has the following installed:

Python3
tensorflow >= 1.5.0
scikit-learn

Install the dependents using:

pip install -r requirements.txt

Run the experiments

To run the experiments, use:

python ./run/run.py

python ./run/run.py --data ppi5k --model rect --batch_size 1024 --dim 128 --epoch 100 --reg_scale 5e-4

You can use --model logi to switch to the UKGE(logi) model.

Data is available at: https://drive.google.com/file/d/1UJQ8hnqPGv1O9pYglfNF5lY_sgDQkleS/view?usp=sharing

Reference

Please refer to our paper. Xuelu Chen, Muhao Chen, Weijia Shi, Yizhou Sun, Carlo Zaniolo. Embedding Uncertain Knowledge Graphs. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence (AAAI), 2019

@inproceedings{chen2019ucgraph,
    title={Embedding Uncertain Knowledge Graphs},
    author={Chen, Xuelu and Chen, Muhao and Shi, Weijia and Sun, Yizhou and Zaniolo, Carlo},
    booktitle={Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence (AAAI)},
    year={2019}
}

ukge's People

Contributors

Stargazers

Watchers

Forkers

nwpusunyue david-lee-1990 zhangjiatao gokcemuge shihanyang diison gvaladao tongxinli franklyncc banana-zero songjiang7 modberge lqyspace msellamitn hughiephan

ukge's Issues

python run.py error

Thank you for sharing your code and data. When I run 'python run.py', it gives me 'No train.csv' error. Google drive link that provided in Readme does not have those data. Where can i get those data? Thank you

Hyperparams for Table 4

Hi.. Can you please share the hyperparams that were used to get the results of UKGE logi and rect models in Table 4 of the paper?

Some problems in data

2000+ triples appear in both train.tsv and test.tsv with different confidence.

e.g.

Muhao Tian?

psl batch size

Hi!

UKGE/src/trainer.py

Line 227 in 7f62d21

 soft_h_index, soft_r_index, soft_t_index, soft_w_index = self.batchloader.gen_psl_samples() # length: param.n_psl 

in every epoch of training, for each batch, a psl batch with random triple(s) is generated and its default size is 1. What should be the exact psl batch size for the experiments in the paper? And why is it randomly generated?

Negative sampling size?

What are the negative sampling sizes for the best hyper-parameter combinations given in the paper?

Also parameter n_neg is explained as "Number of negative samples per (h,r,t)" in the code, in run.py.
However, in the implementation both head and tail are corrupted separately for each triple which results in double size of n_neg negative samples per (h,r,t).

A question about softlogic.tsv

Hi, thank you for sharing your code and data!
I would like to know how you obtained the data in softlogic.tsv. Did you use the PSL program implemented by Linqs to calculate the values in softlogic.tsv, or did you write your own logic to calculate these values? Because the implementation of PSL in Linqs is relatively complex, and the method in your paper does not involve many other functions in PSL, I would like to know if you have a simpler way to implement this simpler PSL.
Looking forward to your reply!

Confusion about the computation of nDCG

UKGE/src/testers.py

Line 474 in 7f62d21

def ndcg(self, h, r, tw_truth):

In the calculation of iDCG, we assume that the optimal ranking is 1, 2, 3, 4, ...
When this code is trying to calculate the real ranking of the tail entity, it is to calculate how many entities among all entities have a larger score than the target entity. However, if the tail entity list of a query (head entity plus relationship) has some tail entities with the same score, then there will be a tie in the ranking, and the optimal ranking is not a natural number sequence (1, 2, 3, 4 ... ), leading to the nDCG may be greater than 1.

I'm not sure if this problem exists in the code, thanks.

About the importance of PSL

Hi, thank you for sharing your code and data.

I am curious on the importance of PSL and I conducted serval experiments on data PPi5k provided here. I find that the lower 'self._p_psl' in models.py we chose, and we get the higher nGCG of both linear and exp version of UKGE_rect.

For example, train with early stop with batch_size=1024, embedding_dim=128, _p_psl=0.2, i get nDCG_linear = 0.951133, nDCG_exp=0.950328, if _p_psl=0, nDCG_linear = 0.960531, nDCG_exp=0.50403

And i think this is on the contrary of the result in the paper
http://web.cs.ucla.edu/~yzsun/papers/2019_AAAI_UKG.pdf.

In fact, I think psl should be of some benifits as it describes the uncertainty by rules. But the results i found above make me puzzled. Can you share some light?

About negative sampling in training process

In the following function, UKGE has an implementation about corrupt a batch for training:

UKGE/src/data.py

Line 295 in 7f62d21

def corrupt_batch(self, h_batch, r_batch, t_batch):

    def corrupt_batch(self, h_batch, r_batch, t_batch):
        N = self.this_data.num_cons()  # number of entities

        neg_hn_batch = np.random.randint(0, N, size=(
        self.batch_size, self.neg_per_positive))  # random index without filtering
        neg_rel_hn_batch = np.tile(r_batch, (self.neg_per_positive, 1)).transpose()  # copy
        neg_t_batch = np.tile(t_batch, (self.neg_per_positive, 1)).transpose()

        neg_h_batch = np.tile(h_batch, (self.neg_per_positive, 1)).transpose()
        neg_rel_tn_batch = neg_rel_hn_batch
        neg_tn_batch = np.random.randint(0, N, size=(self.batch_size, self.neg_per_positive))

        return neg_hn_batch, neg_rel_hn_batch, neg_t_batch, neg_h_batch, neg_rel_tn_batch, neg_tn_batch

However, is it possible to sample positive samples in such a random sampling method, which causes the model to fail to learn?

Is the data preprocessing code open source?

Hi there. I noticed that the processed data, that is, the data of the triples represented by entity and relation id, already exists in this repo, which can be conveniently used to evaluate the performance of the model.
However, if I want to do further exploration, I would like to know about the data preprocessing, and may I ask if the code that transfers the original data to this processed data could be open source?
Thank you for your contribution to the community.

Test sets missing

In the paper, it is mentioned as:
" To test if our model can correctly interpret negative links, we add the same amount of negative links as existing relation facts into the test sets."
Where are these test sets or would you share how you produce them?
I couldn't reproduce any of your results unfortunately.