Giter VIP home page Giter VIP logo

lipp's People

Contributors

jiacheng-wu avatar oxooo avatar pohchaichon avatar zhczhong avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

lipp's Issues

Licensing and attribution of this implementation of LIPP

hi LIPP team,

I'm looking into benchmarking and potentially modifying LIPP at some point if the license allows for it. However, the VLDB paper itself specifies a CC license which, while clear about the handing of the paper itself, the code's licensing status is unclear. Would you consider adding an appropriate Open Source or Source Available license (if you wish to continue hard-enforcing the non-commercial clause, MongoDB's SSPL would technically work, but something truly open source like GPL would be a lot better) so no legal confusion could result in application or use of the software.

RT_ASSERT Error happened when inserting

Hi Jiacheng,

Thanks for sharing your excellent work. We want to evaluate several kinds of Learned Index. When we inserted 100,000,000 kv pairs into LIPP, RT_ASSERT Error happened. But when 80,000,000 kv pairs were inserted, it worked fine. So is there some limits of total amount of kv pairs?

int main(){
    LIPP<int , int> lipp{};
    for(int i=0;i<100000000;i++){
        lipp.insert(std::pair<int , int>(i,i));
    }
}

Output
RT_ASSERT Error at ./index/lipp/lipp.h:805,path_size < MAX_DEPTH

replicate experiments in Figure 5

Hi Jiacheng,

I want to reproduce the experiment of Fig.5 in your paper Updatable Learned Index with Precise Positions. But the result in my benchmark shows that lipp is not as strong as the paper showed.

figure

Is it convenient to share your micro-benchmark code to help me reproduce that figure?

Thanks in advance!

Regards,
Zhicong

Unable to handle 'update' operation.

Hi Jiacheng, thanks for sharing your excellent work.
I am using your LIPP to test some key-value workloads. I wonder to know how to update the value of an existing key with your LIPP, since I can't find an interface to update a pair of key-value while I use 'insert' instead. When I use a workload with 50% read and 50% update generated by YCSB to evaluate the LIPP, it reports "RT_ASSERT Error at lipp/src/core/lipp.h:377, key1 < key2". Does it seem that the LIPP may not support update operation?

Out-of-memory exception

When running your LIPP source code on a dataset of mixed SOSD data(1.6GB), we found a potential bug of LIPP that makes it run out of our memory (756 GB).

terminate called after throwing an instance of 'St9bad_alloc'
  what():  std::bad_alloc
zsh: abort (core dumped)  ./example_bulk_load

The code below is used to reproduce the case . Data could be fetched from https://drive.google.com/file/d/1d3JomhUHXBCMGnRx5fTtwPrmQk-adxJS/view?usp=sharing

Would you mind to take a look and fix it? Besides, I could not find a range query interface from your code, could you please tell me how to do range query using your code or provide a new interface of range query?

Thank you very much in advance.

Best regards,
Zhicong

#include <lipp.h>
#include <iostream>
#include <vector>
#include <cassert>
#include <algorithm>
#include <iostream>
#include <fstream>

using namespace std;

template<class T>
bool load_binary_data(T data[], int length, const std::string &file_path) {
    std::ifstream is(file_path.c_str(), std::ios::binary | std::ios::in);
    if (!is.is_open()) {
        return false;
    }
    is.read(reinterpret_cast<char *>(data), std::streamsize(length * sizeof(T)));
    is.close();

    return true;
}

int main()
{
    LIPP<uint64_t, uint64_t> lipp;
    size_t data_size = 200000000;
    size_t bulk_load_size = 100000000;

    // prepare data
    uint64_t *keys = new uint64_t[data_size];
    load_binary_data(keys, data_size, "../test.data");

    vector<pair<uint64_t, uint64_t>> data;
    std::sort(keys, keys + bulk_load_size);
    auto last = std::unique(keys, keys + bulk_load_size);
    bulk_load_size = last - keys;
    for (size_t i = 0; i < bulk_load_size; i ++) {
        data.push_back({keys[i], keys[i]});
    }

    // bulk load
    lipp.bulk_load(data.data(), bulk_load_size);

    for(int i = 0; i < data_size - bulk_load_size; i++) {
        lipp.insert(keys[bulk_load_size + i], i);
    }
    
    return 0;
}

CANNOT pass "isfinite" check

Hi,

When I run LIPP on SOSD data sets, I found that it will incur the error in the "isfinite" check in the bulk loading algorithm.
I am wondering why the code requires such a check and how to fix this bug?

Baotong

Out-of-memory exception when loading SOSD benchmark datasets

Hi,

there appears to be an out-of-memory exception when loading the SOSD benchmark datasets.

Example dataset: https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/JGVF9A/EATHF7&version=10.0

I'm running my experiments with 32 GB of RAM and have been able to successfully load the datasets with ALEX, STX-BTree and Dynamic-PGM, so the available memory should be sufficient.

Code to reproduce the issue:

#include <vector>
#include <fstream>
#include <iostream>
#include "lipp.h"

template<typename Key>
std::vector<Key> load_data(const std::string &filename) {
    using key_type = Key;

    /* Open file. */
    std::ifstream in(filename, std::ios::binary);
    if (!in.is_open())
        exit(EXIT_FAILURE);

    /* Read number of keys. */
    uint64_t n_keys;
    in.read(reinterpret_cast<char*>(&n_keys), sizeof(uint64_t));

    /* Initialize vector. */
    std::vector<key_type> data;
    data.resize(n_keys);

    /* Read keys. */
    in.read(reinterpret_cast<char*>(data.data()), n_keys * sizeof(key_type));
    in.close();

    return data;
}

int main(int argc, char *argv[])
{
    /* Load keys. */
    auto keys = load_data<uint64_t>("data/wiki_ts_200M_uint64");

    std::vector<std::pair<uint64_t, uint64_t>>  key_value_pairs;
    key_value_pairs.reserve(keys.size());

    for (auto& key : keys) {
      key_value_pairs.push_back(std::make_pair(key, 100));
    }
    std::cout << "Keys loaded" << std::endl;

    /* Build LIPP. */
    LIPP<uint64_t, uint64_t> lipp;
    lipp.bulk_load(key_value_pairs.data(), key_value_pairs.size());
    std::cout << "LIPP index built succesfully" << std::endl;
}

Output:

bachfischer@learned-indexes-benchmark:~/lipp$ ./build/experiments 
Keys loaded
Killed

Segmentation Fault in bulk loading

Hi,

There will appear a segmentation fault when bulk loading osm_cellids_200M_uint64 dataset in SOSD. The code below could be used to reproduce the error. And the dataset could be fetching by executing the script from https://github.com/learnedsystems/SOSD/blob/master/scripts/download.sh.

#include <lipp.h>
#include <iostream>
#include <vector>
#include <cassert>
#include <algorithm>
#include <iostream>
#include <fstream>

using namespace std;

template<class T>
bool load_binary_data(T data[], int length, const std::string &file_path) {
    std::ifstream is(file_path.c_str(), std::ios::binary | std::ios::in);
    if (!is.is_open()) {
        return false;
    }
    is.read(reinterpret_cast<char *>(data), std::streamsize(length * sizeof(T)));
    is.close();

    return true;
}

int main()
{
    LIPP<uint64_t, uint64_t> lipp;
    size_t data_size = 100000000;

    // prepare data
    uint64_t *keys = new uint64_t[data_size];
    load_binary_data(keys, data_size, "./osm_cellids_200M_uint64");
    vector<pair<uint64_t, uint64_t>> data;
    std::sort(keys, keys + data_size);
    for (size_t i = 0; i < data_size; i ++) {
        data.push_back({keys[i], keys[i]});
    }

    // bulk load
    lipp.bulk_load(data.data(), data.size());

    // normal insert
    lipp.insert(-100, 5);
    lipp.insert(187688, 4);

    // verify LIPP data structure
    lipp.verify();

    cout << "bolk load success" << endl;

    return 0;
}

Output

Program received signal SIGSEGV, Segmentation fault.
0x00000000004048d1 in LIPP<unsigned long, unsigned long, true>::build_tree_bulk_fmcd(unsigned long*, unsigned long*, int) ()

Segmentation Fault when inserting keys

Hi Jiacheng,

We want to test the distribution shift of different learned index in SOSD. But when I run lipp in develop branch, it will throw segmentation fault. And I have already removed duplicated keys. The code below could be used to reproduce the error. And the dataset could be fetching by executing the script from https://github.com/learnedsystems/SOSD/blob/master/scripts/download.sh. Would you mind taking a look and fix it? Besides, when is the estimated time for the range query interface?

#include <lipp.h>
#include <iostream>
#include <vector>
#include <cassert>
#include <algorithm>
#include <iostream>
#include <fstream>

using namespace std;

template<class T>
bool load_binary_data(T data[], int length, const std::string &file_path) {
    std::ifstream is(file_path.c_str(), std::ios::binary | std::ios::in);
    if (!is.is_open()) {
        return false;
    }
    is.read(reinterpret_cast<char *>(data), std::streamsize(length * sizeof(T)));
    is.close();

    return true;
}


template<typename T>
T *unique_data(T *key1, size_t &size1, T *key2, size_t &size2) {
    size_t ptr1 = 0;
    size_t ptr2 = 0;

    std::sort(key1, key1 + size1);
    size1 = std::unique(key1, key1 + size1) - key1;
    std::sort(key2, key2 + size2);
    size2 = std::unique(key2, key2 + size2) - key2;

    size_t result = 0;

    while (ptr1 < size1 && ptr2 < size2) {
        while (key1[ptr1] < key2[ptr2]) {
            ptr1++;
        }
        if (key1[ptr1] == key2[ptr2]) {
            ptr2++;
            continue;
        }
        key2[result++] = key2[ptr2++];
    }

    while (ptr2 < size2) {
        key2[result++] = key2[ptr2++];
    }

    size2 = result;
    std::random_shuffle(key2, key2 + size2);

    return &key2[result];
}

int main() {
    LIPP <uint64_t, uint64_t> lipp;
    size_t data_size = 200000000;
    size_t bulk_load_size = 100000000;
    size_t to_insert = data_size - bulk_load_size;

    // prepare data
    uint64_t *keys = new uint64_t[data_size];
    cout << "Loading data" << endl;
    load_binary_data(keys, bulk_load_size, "../fb_200M_uint64");
    load_binary_data(keys + bulk_load_size, to_insert, "../books_200M_uint64");
    cout << "Make data Unique" << endl;
    unique_data(keys, bulk_load_size, keys + bulk_load_size, to_insert);

    vector <pair<uint64_t, uint64_t>> data;
    cout << "bulk load size is " << bulk_load_size << endl;
    for (size_t i = 0; i < bulk_load_size; i++) {
        data.push_back({keys[i], keys[i]});
    }

    cout << "Begin to bulkload" << endl;

    // bulk load
    lipp.bulk_load(data.data(), bulk_load_size);

    cout << "Finish bulkload" << endl;
    cout << "Begin insert" << endl;
    for (int i = 0; i < to_insert; i++) {
        lipp.insert(keys[bulk_load_size + i], i);
    }

    cout << "Finish insert" << endl;

    return 0;
}

The gdb backtrace info is as followed

#0  0x0000000000411100 in LIPP<unsigned long, unsigned long, true>::insert_tree (this=0x7fffffff3140, _node=0xb5d098e0, key=@0x7fffc75b1a58: 772277387847372864, 
    value=@0x7fffffff3238: 73) at lipp/src/core/lipp.h:795
#1  0x000000000041015c in LIPP<unsigned long, unsigned long, true>::insert (this=0x7fffffff3140, key=@0x7fffc75b1a58: 772277387847372864, value=@0x7fffffff3238: 73)
    at lipp/src/core/lipp.h:110
#2  0x000000000040f55f in main () atlipp/src/examples/example_bulk_load.cpp:85

Range Query Code

Hi,

It is so weird that you evaluate the range query in your paper but do not release it in this repo.
Could you upload it as soon as possible?
BTW, I saw that you like closing the issue raised by users but in general, the issue should be closed by the person who raised it.

Baotong

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.