Giter VIP home page Giter VIP logo

pykeen / bloom-filterer-benchmark Goto Github PK

View Code? Open in Web Editor NEW
0.0 6.0 0.0 11.05 MB

๐Ÿช‘ Benchmark the bloom filterer at https://pykeen.github.io/bloom-filterer-benchmark/

Home Page: https://pykeen.github.io/bloom-filterer-benchmark/

License: MIT License

Python 100.00%
bloom-filter pykeen benchmarking negative-sampling machine-learning knowledge-graphs knowledge-graph-embeddings knowledge-graph-embedding-models

bloom-filterer-benchmark's Introduction

bloom-filterer-benchmark

Negative sampling is necessary during training of knowledge graph embedding models because knowledge graphs typically only have positive examples. Typically negative examples are derived from positive ones by corruption: replacing an entry of a positive triple by a random replacement. Unfortunately, this corruption technique can produce false negatives that are actually already in the knowledge graph. Sometimes this isn't a big deal, depending on the dataset, model, and learning task. PyKEEN provides the ability to filter false negatives using an exact algorithm, but it's quite slow.

An alternative filterer based on an approximate existence index structure called the bloom filter was introduced in PyKEEN #401 by Max Berrendorf (@mberr). We then did this benchmarking study to show that it both makes a huge time improvement while maintaining a low error rate.

The code and artifacts are available on GitHub. It can be rerun with python benchmark.py --force. A tutorial on how to enable the filtering of negative samples as well as specifics on the exact filterer and the bloom filterer are available on Read the Docs.

Benchmarking

Benchmarking over several datasets of varying size shows suggests that there isn't a large size-dependence on the relationship between the bloom filter's error_rate parameter and the actual error observed on either the testing or validation sets.

As expected, the time for checking the triples decreases with an increased nominal error rate.

Datasets with a larger number of triples take longer to create. The time to create a bloom filter also decreases as the nominal error rate increases.

The size of the bloom filter increases with larger number of training triples, but also varies exponentially with the error rate. The relationship is log(time) ~ log(triples) + log(error rate).

Comparison

The bloom filterer is compared the exact implementation using Python sets. Several error rates for the bloom filter are shown simultaneously, to show that there's a tradeoff that's possible for each.

The setup times show the bloom filterer is faster for larger datasets, though this only has to be done once for any given training. The times for both implementations are sub-seconds, so they can be considered negligible.

The Python-based implementation performs well on smaller datasets, but the bloom filterer clearly wins for larger ones. The lookup operation is repeated during training, so these times do add up.

The error rates make most sense to show on a log scale, but since the exact implementation has a constant error rate of zero, the log scale becomes an issue. Therefore, the error rates were all adjusted by adding 1 / number of triples, which is the minimum possible error for any given lookup task. In this chart, you can think of the "adjusted error rate" reported for the exact algorithm as the lower bound. Note that several datapoints are shown for the bloom filter, which correspond to different "desired error rates" that are parameterizable in the code.

The following chart shows the tradeoff between the implementations (with all error rates of the bloom filter shown) across all datasets. Again, think of the adjusted error rate reported for the exact algorithm as a lower bound.

bloom-filterer-benchmark's People

Contributors

cthoyt avatar mberr avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.