uma-pi1 / kge Goto Github PK

View Code? Open in Web Editor NEW

755.0 755.0 124.0 1.89 MB

LibKGE - A knowledge graph embedding library for reproducible research

License: MIT License

Python 98.30% Shell 1.70%

kge's People

Contributors

Stargazers

Watchers

Forkers

nzteb adrianks cthoyt-forks-and-packages yangzhou12 samuelbroscheit shashankg7 vonvogelstein rhatem semanticbeeng fardon apoorvumang tsafavi dbcrwk zahramahani cs-dlut cuent allensmile pbloem freekang namisan bobycv06fpm unmeshvrije saatviga-healx fighting41love saroyehun guixiangyu1 hagridthick lemonqc sidewinderak47 chqlee leiloong deng-fankang vbharadwaj-bk diison nluedema moqingxinai alexrenz lfxx123 daocalendar lb0828 sfschouten team-project-okg ngo010 aihill sjvkishore wangdongde rishirelan assassinsurvivor aspirincode adbmd maybeee18 nitishajain jelly-pawpaw zeta1999 gaoxyingm mayo42 huamichaelchen beyondacm manojprabhakar chandrahasd jerry185 nicoledjy sallychang walker-zd helenswate rui9812 tillgeissler pwforks lj171 moon-xm lezhongwen matthewgleeson ashispapu miaohu-pro owenyoung75 forthappy anddoit lukasdegitz iml4e tauboga208 kangfengjian jiehu-cv sailfish009 simonpre filco306 sailera19 oliverry aylerh at0m1x seanahmad aahmadai mmamun1 khinepisi xiaoqiongxia lindakwan fniesel kenkoko bhanu77prakash johnson7788 tinginde

kge's Issues

Generated config should only include options relevant to the model

E.g., no need to have negative_sampling settings in the config when training a model with 1toN.

Code review: reimplementation of auto_initialization

Please have a look at 6e6f004. auto_initialization is now a type of initializer. In practice this essentially means the auto_initialize flag which existed in some models has been replaced with lookup_embedder.initialize="lookup_embedder". I think nothing has been broken, except that old config files with the auto_initialize flag will no longer work.

Also, Samuel suggested to export all auto_initialization logic to a separate function in, say, util.py. But this would require considerations for the differences across models.

Provide "packaged" models

When sharing models, we currently need to share both dataset and checkpoint. For prediciton, the dataset is solely used to obtain a mapping between entity and relation indexes and their ids or mentions, however.

A better approach may be to support "packaged models", where a package contains the checkpoint and just the relevant part of the dataset (which is much smaller than the entire dataset). With this, models can be deployed right away without having to have the dataset around.

Computation of penalty terms

Commit e667cf9 changes the way penalties are interpreted for many models. The penalty term is currently computed only once per embedding, but with this change it's computed twice if subject and object embedder are the same (a common case). Instead of calling penalty twice, the code should check whether they are the same and, if so, call penalty only once.

Improve scalability of unweighted regularization

Right now, weighted regularization (which regularizes batch entities) scales better then unweigthed regularization (which regularizes all entities) when negative sampling is used.

Whether we regularize only the batch entities or all entities, and whether regularization is unweighted or weighted should be done independently.

Code review of negative sampling

Sampler:

Merge _sample and call (no need to have both)
Use config keys negative_sampling.sampler (for the type) and negative_sampling.num_e. We can add num_{s,r,o} later in case we really need it. Document the keys in the config file.
The sampling API does not look ideal. I suggest to give it a batch directly (nx3 tensor) and should output the samples in an (nx (3+2*num_e)) tensor. This allows for a faster implementation.

Negative sampling job:

Do not call model.prepare_job and set is_prepared (done in superclass)
For collate, I suggest columns s,p,o,neg_s,neg_o (n x 3+2num_e tensor). This would then match the sampler when changed as above. Also, no labels needed any more (each row has exactly one positive and 2num_e negatives).
As for computing the loss with ranking, it may be most efficient to use a custom loss function.

Loss:

Move NotImplementedError to constructor.
I suggest to have compute_1toN and compute_negative_sampling functions (with different arguments and a separater documentation).

Improve error message for config typos

For a config like

job.type: train
dataset.name: toy
model: complex

complex:
  relation_embedders: # <--- typo relation_embedder*s* 
    regularize_args:
      p: 1

the error message is very technical

KeyError: 'complex.relation_embedders.regularize_args cannot be set because creation of complex.relation_embedders is not permitted'

which could be improved to say

KeyError: 'Key "complex.relation_embedders" does not exist. Parent key "complex" does not allow creation of new keys. If the creation of "complex.relation_embedders" was intended, then "complex" should have the "+++" attribute.'

Rewrite entity ranking evaluation to sample sp/po pairs and add loss

https://github.com/rufex2001/kge/blob/9d83e43f5085e4a0d30d70536e4c1772389907cd/kge/job/entity_ranking.py#L72

Support evaluation directly on a checkpoint

Right now, only possible with folder structure as used in training.

Problem with ConvE and "KgeModel.load_from_checkpoint(checkpoint)"

I'm having some problems but I cannot get fully behind it and I'm not sure if Iam doing something wrong so I rather ask for help.

Training toy-conve-train.yaml and then loading it from any checkpoint with KgeModel.load_from_checkpoint(checkpoint) results in a long torch error about a dimension mismatch (Resuming the ConvE job with the normal resume functionality is no problem and does work, it does not use this function).

The error is raised here https://github.com/uma-pi1/kge/blob/master/kge/model/kge_model.py#L368

I could find out that this config
https://github.com/uma-pi1/kge/blob/master/kge/model/kge_model.py#L364
seems to be different in the key "entity_embedder.dim" from the original config with which the job was run. For example, when I substitute it and by hand load the original config, then it seems to work.
Somehow the problem might be connected with the lines here
https://github.com/uma-pi1/kge/blob/master/kge/model/conve.py#L116
because I tried to outcomment them and it seemed to work.

Do not have one shared entity_embedder hard coded into KgeModel

https://github.com/rufex2001/kge/blob/bdea002209f3c9ce1e510c5d8ee61dd1eebb10d5/kge/model/kge_model.py#L33
https://github.com/rufex2001/kge/blob/bdea002209f3c9ce1e510c5d8ee61dd1eebb10d5/kge/model/kge_model.py#L39
https://github.com/rufex2001/kge/blob/bdea002209f3c9ce1e510c5d8ee61dd1eebb10d5/kge/model/kge_model.py#L41
...

Please have a look at my implementation ...

Numba support for frequency-based sampling

See discussion in #64.

Implement freex model

Currently working for two blocks only.

Use int instead of long to save memory

E.g. for entity/relation ids in large data structures on the CPU side (most notably in Dataset).

Serialize dataset and indexes

When loading a dataset, LibKGE parses the text files holding the raw data and creates indexes. This may take while.

To speed things up, datasets and indexes should be pickled to the dataset's folder once loaded for the first time. One subsequent loads, we can directly use the pickled files, which should be much faster.

Trace every job with a unique run id

4233993 introduced a git commit field into the configuration. We should remove this field from there.

The field is misleading in the configuration because (i) our code does not ensure that exactly this commit is actually used and (ii) a model may be resumed with a different commit.

I suggest to clearly separate configuration and environment:

Give every job a unique id. Add this id to each trace entry automatically.
Whenever a job starts, add a trace entry which has the job id, type, etc. as well as other environment variables (start time, git commit, user, machine, ...). This way, we can keep track of changing environments (e.g, when a job is resumed).

Add label smoothing as an option to trainer

Add batch normalization to lookup embedders

Output partial results in grid job

Grid search validation outputs are currently copied to the trace file of the grid search job only once grid search completes. This makes it unnecessarily hard to track the state of the search; all validation results should be forwarded immediately as they are computed.

Enhance usability for 'dump trace --keys and --keysfile' option

Documentation for the 'dump trace --keys and --keysfile' option are needed. Its not clear how to use them.

Also helpful for this would be a 'dump trace --list_keys' option.

API changes Config

Get rid of get_option so that we use one API for accessing the config everywhere and don't have two names "config/option" for the same thing.
Make config_key an optional kwarg for config.get_default and do the check from get_option there.
Merge get and get_default
Rename default to resolve_type and make a boolean argument get(..., resolve_type=True, ...) as it is done for set(..., create=True, ...). Then we have one get and one set with either an option to have a resolve_type/create behaviour or not.

Fix dbpedia500 dataset because of s,o,r

n/t

Streamlining min_threshold configuration keys

valid.early_stopping.min_threshold.epochs should be w.r.t. to validation runs and be renamed (e.g., to valid.early_stopping.metric_value_threshold.after_validations or so.

Add option for memory-saving ComplEx implementation

Right now, scorer uses about 4x memory than needed but is faster.

Add support for job reordering to ManualSearchJob

Most notably, a hyperparameter optimization package may be used to determine which configurations appear most promising and evaluate those first.

Add other forms of initialization

Currently only "normal", done in KgeBase.

Batch size and penalty terms

7aa82ce introduces a patch that divides penalty terms by the number of batches to keep the penelty terms consistent. This needs discussion.

In particular: we average example losses/gradients over the batch. Thus, before 7aa82ce:

E[gradient] = E[gradient of a random example] + gradient of penatly term

That's independent of the number of batches. After 7aa82ce :

E[gradient] = E[gradient of a random example] + gradient of penalty term / num_batches

That's dependent on the number of batches. This patch thus seems to introduce what it tries to avoid.

fixed_parameters may be ignored

The new fixed_parameters feature below is only applied when num_sobol_trials is set. Perhaps the best approach is to change our implementation to always create the generation_strategy manually (i.e., also when num_sobol_trials=-1).

kge/kge/config-default.yaml

Line 212 in 0871f68

fixed_parameters: []

Code review of ConvE implementation

Pending.

Two folders named data

One on the root folder, one inside the kge folder. Perhaps rename the one inside kge to kge/dataset?

Bug resuming ConvE

resume-conve-bug.yaml.zip

I attach a ConvE file which can be used to reproduce the resuming bug. Start with:

kge start resume-conve-bug.yaml --folder experiments/resume-conve-bug

The stop after the first trial is finished, and resume with:

kge resume experiments/resume-conve-bug/config.yaml

Clean-up support for learning rate scheduler

Inverse relation models only works with ConvE

The inverse relations model is broken for models other than ConvE. The commit d90358d broke it. It removes the automatic addition of the type entry in embedders when missing. I am look into fixing it now, but if anyone has a quick suggestion, that is welcome. I attach a config file to reproduce the error.

toy-complex-inverse-train.yaml.zip

Add devicepool to parallel search

Should use a devicepool to distribute parallel search jobs to different devices, e.g.

device_pool = ['cpu', 'gpu#1', 'gpu#1', 'gpu#1', 'gpu#2', 'gpu#2', 'gpu#2', ]

Why ignore test during validation?

https://github.com/rufex2001/kge/blob/030320d83c10ae379b321fd1101d1e1b976a3185/kge/job/entity_ranking.py#L45

Let auto search output best run

Currently only a parameter estimate and a metric estimate is output. It is helpful if information about the best actual run would also be output.

Improve seeding with ax

Make SOBOAL seed configurable as ax_search.sobol_seed, default to 0.

Also, seed 0 is currently used twice in the code. It looks like that's for different purposes, which is problematic (not sure).

Use entry_points to make vanity CLI script

Rather than providing kge.py, you can use the console_scripts entry point to define a CLI that automatically gets installed and made available in the shell with pip install .

Would be happy to send a PR!

Config bug: Trying to read type from embedder leads to exception for inverse_relations_model

When model is set to 'model': 'inverse_relations_model' the following code

if config.get(self.configuration_key + ".relation_embedder.type") == 'projection_embedder':
in sd_rescal throws the error

Traceback (most recent call last):
  File "/home/sbrosche/anaconda3/lib/python3.7/concurrent/futures/process.py", line 232, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/home/sbrosche/PycharmProjects/kge/kge/job/search.py", line 126, in _run_train_job
    job = Job.create(train_job_config, search_job.dataset, parent_job=search_job)
  File "/home/sbrosche/PycharmProjects/kge/kge/job/job.py", line 48, in create
    job = TrainingJob.create(config, dataset, parent_job)
  File "/home/sbrosche/PycharmProjects/kge/kge/job/train.py", line 88, in create
    return TrainingJob1toN(config, dataset, parent_job)
  File "/home/sbrosche/PycharmProjects/kge/kge/job/train.py", line 396, in __init__
    super().__init__(config, dataset, parent_job)
  File "/home/sbrosche/PycharmProjects/kge/kge/job/train.py", line 34, in __init__
    self.model = KgeModel.create(config, dataset)
  File "/home/sbrosche/PycharmProjects/kge/kge/model/kge_model.py", line 324, in create
    model = getattr(module, class_name)(config, dataset, configuration_key)
  File "/home/sbrosche/PycharmProjects/kge/kge/model/inverse_relations_model.py", line 33, in __init__
    config, alt_dataset, self.configuration_key + ".base_model"
  File "/home/sbrosche/PycharmProjects/kge/kge/model/kge_model.py", line 324, in create
    model = getattr(module, class_name)(config, dataset, configuration_key)
  File "/home/sbrosche/PycharmProjects/kge/kge/model/sd_rescal.py", line 224, in __init__
    if config.get(self.configuration_key + ".relation_embedder.type") == 'projection_embedder':
  File "/home/sbrosche/PycharmProjects/kge/kge/config.py", line 43, in get
    result = result[name]
KeyError: 'type'

label smoothing broken

The default value in config-default of -1.0 (to disable it) is not accepted anymore. This breaks all configurations which do not explicitely set it. It's also unclear hwo to disable label smoothing right now.

Probably introduced by 2b082b7

Support "half-ranks" during evaluation

During evaluation: If a true answer has a half-rank (such as 1.5), it is currently counted as rank 2.

Example: scores are (3,+10,*10,5). If the true answer is + or *, returned rank should be 1.5 in both cases (right now, it's 2 in both cases).

Internally, fixing this problem requires a change of the histogram layout.

'dump trace' for a search folder should contain search's child process folder and search's child process job_id

right now kge dump trace for a search folder gives no clue which search folder contains a specific job's files.

Cannot set different dropout, initialization ( or for Tucker embedding sizes ) for entities and relations with current config style

I propose the following schema.

model:
  type: complex               
  class_name: ComplEx
  entity_embedder: <lookup_embedder>
  relation_embedder: <lookup_embedder>

  grid_search_tied_attributes: [
      [ 'model.entity_embedder.dim', 'model.relation_embedder.dim' ],
      [ 'model.entity_embedder.sparse', 'model.relation_embedder.sparse' ],
      [ 'model.entity_embedder.normalize', 'model.relation_embedder.normalize' ],
      [ 'model.entity_embedder.initialize', 'model.relation_embedder.initialize' ],
  ]

where CONFIG_ITEM: <NAME> triggers copying the NAME config from the core into CONFIG_ITEM, e.g. <lookup_embedder> from core into entity_embedder, so automatically expanded this looks like

model:
  type: complex               
  class_name: ComplEx
  entity_embedder:
    type: lookup_embedder 
    dim: 100                    # entity dimensionality or [ entity, relation ] dimensionality
    initialize: normal          # xavier, uniform, normal
    initialize_arg: 0.1         # gain for Xavier, range for uniform, stddev for Normal
    dropout: 0.                 # dropout used for embeddings
    sparse: False               # ??
    normalize: ''               # alternatively: normalize '', L2
  relation_embedder: 
    type: lookup_embedder 
    dim: 100                    # entity dimensionality or [ entity, relation ] dimensionality
    initialize: normal          # xavier, uniform, normal
    initialize_arg: 0.1         # gain for Xavier, range for uniform, stddev for Normal
    dropout: 0.                 # dropout used for embeddings
    sparse: False               # ??
    normalize: ''               # alternatively: normalize '', L2

  grid_search_tied_attributes: [
      [ 'model.entity_embedder.dim', 'model.relation_embedder.dim' ],
      [ 'model.entity_embedder.sparse', 'model.relation_embedder.sparse' ],
      [ 'model.entity_embedder.normalize', 'model.relation_embedder.normalize' ],
      [ 'model.entity_embedder.initialize', 'model.relation_embedder.initialize' ],
  ]

This also relieves us from having many if-elses, like f.ex. the "if embedder == 'lookup'" ...

grid_search_tied_attributes tells grid search that if we do grid search over values for model.entity_embedder.dim then we copy/tie them to model.relation_embedder.dim. This is what I am currently in my code.

Should not encourage people to eval on test all the time

https://github.com/rufex2001/kge/blob/9d83e43f5085e4a0d30d70536e4c1772389907cd/kge/job/entity_ranking.py#L10

Allow to select checkpoint for resuming

This is necessary, for example, to run test evaluation on the best checkpoint instead of the last.

Suggested implementation:

add an optional --checkpoint option to kge resume
when present, use this checkpoint id (such as "100" or "best")
when absent, use "best" if present and it's an eval job, else use latest checkpoint by default

Support type of ranking (competition, fractional, ...) explicitly

https://en.wikipedia.org/wiki/Ranking

Revert 26a4d2a

Revert 26a4d2a. This commit creates a configuration mess and all possible initializers and options need to be specified aprior. An alternative to obtain this behavior (why is it needed, though?) might be to use

initialize_args.x -> to pass along option x
initialize_args.normal_.x -> to pass along option x only when initialize is normal

This way, initializers and their options do not need to be listed in the config file (and they should not be listed there).

Config does not work as expected

sd_rescal_tucker3.yaml:

import: [lookup_embedder, projection_embedder]

sd_rescal_tucker3:
  class_name: SparseDiagonalRescal
  blocks: -1
  block_size: -1
  entity_embedder:
    type: lookup_embedder
    dim: -1  # determine automatically
    +++: +++
  relation_embedder:
    type: projection_embedder
    dim: -1  # determine automatically
    +++: +++

toy-sdrescal-tucker3.yaml

import: sd_rescal_tucker3
model: sd_rescal_tucker3

sd_rescal_tucker3:
  class_name: SparseDiagonalRescal
  blocks: 4
  block_size: 16
  entity_embedder:
    initialize: auto_initialization
  relation_embedder:
    initialize: auto_initialization
    dim: -1  # determine automatically

This throws the an error

           relation_embedder = ".base_embedder.relation_embedder"

...

            config.set(
                self.configuration_key + relation_embedder + ".initialize",
                "normal_",
                log=True,
            )

Error:

Traceback (most recent call last):
  File "/home/samuel/PycharmProjects/kge/kge.py", line 200, in <module>
    job = Job.create(config, dataset)
  File "/home/samuel/PycharmProjects/kge/kge/job/job.py", line 48, in create
    job = TrainingJob.create(config, dataset, parent_job)
  File "/home/samuel/PycharmProjects/kge/kge/job/train.py", line 86, in create
    return TrainingJob1toN(config, dataset, parent_job)
  File "/home/samuel/PycharmProjects/kge/kge/job/train.py", line 394, in __init__
    super().__init__(config, dataset, parent_job)
  File "/home/samuel/PycharmProjects/kge/kge/job/train.py", line 32, in __init__
    self.model = KgeModel.create(config, dataset)
  File "/home/samuel/PycharmProjects/kge/kge/model/kge_model.py", line 322, in create
    model = getattr(module, class_name)(config, dataset, configuration_key)
  File "/home/samuel/PycharmProjects/kge/kge/model/sd_rescal.py", line 261, in __init__
    log=True,
  File "/home/samuel/PycharmProjects/kge/kge/config.py", line 143, in set
    create = create or "+++" in data[splits[i]]
KeyError: 'base_embedder

Its not clear why this doesn't work.

Verify sparse regularization

The sparse regularization implementation in lookup embedders looks fishy: it seems to assume that embed is only called once per batch.

Also, it may also lead to incorrect results for 1toN, where every entity is embedded but only the entities/relations in the batch triples should be regularized.

A better design would be to pass the set of used entities to the penalty function (e.g., as a kw argument so_indexes and p_indexes); the lookup emebdder can then pick it up there.

Add weighted regularization

Add possibility to regularize each entity/relation embedding proportional to its inverse frequency in the training data in lookup_embedder.penalty().

This may be controlled with a Boolean option lookup_embedder.regularize_weighted or so (default: False).

Technically, LookupEmbedder should take an additional argument vocab_weights, defaulting to None.