jiacheng-wu / lipp Goto Github PK

View Code? Open in Web Editor NEW

49.0 49.0 18.0 34 KB

Updatable Learned Index with Precise Positions

License: MIT License

CMake 0.92% C++ 99.08%

lipp's People

Contributors

Stargazers

Watchers

Forkers

bachfischer zhczhong kaiwang19 luffy06 pohchaichon tawnysky suibianll mickeysclubhouse zhujunhao7488 jinw18 whitezou dashstander zhtshr npalff mschoema joecarf kms040411

lipp's Issues

Licensing and attribution of this implementation of LIPP

hi LIPP team,

I'm looking into benchmarking and potentially modifying LIPP at some point if the license allows for it. However, the VLDB paper itself specifies a CC license which, while clear about the handing of the paper itself, the code's licensing status is unclear. Would you consider adding an appropriate Open Source or Source Available license (if you wish to continue hard-enforcing the non-commercial clause, MongoDB's SSPL would technically work, but something truly open source like GPL would be a lot better) so no legal confusion could result in application or use of the software.

RT_ASSERT Error happened when inserting

Hi Jiacheng,

Thanks for sharing your excellent work. We want to evaluate several kinds of Learned Index. When we inserted 100,000,000 kv pairs into LIPP, RT_ASSERT Error happened. But when 80,000,000 kv pairs were inserted, it worked fine. So is there some limits of total amount of kv pairs?

int main(){
    LIPP<int , int> lipp{};
    for(int i=0;i<100000000;i++){
        lipp.insert(std::pair<int , int>(i,i));
    }
}

Output
RT_ASSERT Error at ./index/lipp/lipp.h:805,path_size < MAX_DEPTH

replicate experiments in Figure 5

Hi Jiacheng,

I want to reproduce the experiment of Fig.5 in your paper Updatable Learned Index with Precise Positions. But the result in my benchmark shows that lipp is not as strong as the paper showed.

Is it convenient to share your micro-benchmark code to help me reproduce that figure?

Thanks in advance!

Regards,
Zhicong

Unable to handle 'update' operation.

Hi Jiacheng, thanks for sharing your excellent work.
I am using your LIPP to test some key-value workloads. I wonder to know how to update the value of an existing key with your LIPP, since I can't find an interface to update a pair of key-value while I use 'insert' instead. When I use a workload with 50% read and 50% update generated by YCSB to evaluate the LIPP, it reports "RT_ASSERT Error at lipp/src/core/lipp.h:377, key1 < key2". Does it seem that the LIPP may not support update operation?

Out-of-memory exception

When running your LIPP source code on a dataset of mixed SOSD data(1.6GB), we found a potential bug of LIPP that makes it run out of our memory (756 GB).

terminate called after throwing an instance of 'St9bad_alloc'
  what():  std::bad_alloc
zsh: abort (core dumped)  ./example_bulk_load

The code below is used to reproduce the case . Data could be fetched from https://drive.google.com/file/d/1d3JomhUHXBCMGnRx5fTtwPrmQk-adxJS/view?usp=sharing

Would you mind to take a look and fix it? Besides, I could not find a range query interface from your code, could you please tell me how to do range query using your code or provide a new interface of range query?

Thank you very much in advance.

Best regards,
Zhicong

#include <lipp.h>
#include <iostream>
#include <vector>
#include <cassert>
#include <algorithm>
#include <iostream>
#include <fstream>

using namespace std;

template<class T>
bool load_binary_data(T data[], int length, const std::string &file_path) {
    std::ifstream is(file_path.c_str(), std::ios::binary | std::ios::in);
    if (!is.is_open()) {
        return false;
    }
    is.read(reinterpret_cast<char *>(data), std::streamsize(length * sizeof(T)));
    is.close();

    return true;
}

int main()
{
    LIPP<uint64_t, uint64_t> lipp;
    size_t data_size = 200000000;
    size_t bulk_load_size = 100000000;

    // prepare data
    uint64_t *keys = new uint64_t[data_size];
    load_binary_data(keys, data_size, "../test.data");

    vector<pair<uint64_t, uint64_t>> data;
    std::sort(keys, keys + bulk_load_size);
    auto last = std::unique(keys, keys + bulk_load_size);
    bulk_load_size = last - keys;
    for (size_t i = 0; i < bulk_load_size; i ++) {
        data.push_back({keys[i], keys[i]});
    }

    // bulk load
    lipp.bulk_load(data.data(), bulk_load_size);

    for(int i = 0; i < data_size - bulk_load_size; i++) {
        lipp.insert(keys[bulk_load_size + i], i);
    }
    
    return 0;
}

CANNOT pass "isfinite" check

Hi,

When I run LIPP on SOSD data sets, I found that it will incur the error in the "isfinite" check in the bulk loading algorithm.
I am wondering why the code requires such a check and how to fix this bug?

Baotong

Out-of-memory exception when loading SOSD benchmark datasets

Hi,

there appears to be an out-of-memory exception when loading the SOSD benchmark datasets.

Example dataset: https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/JGVF9A/EATHF7&version=10.0

I'm running my experiments with 32 GB of RAM and have been able to successfully load the datasets with ALEX, STX-BTree and Dynamic-PGM, so the available memory should be sufficient.

Code to reproduce the issue:

#include <vector>
#include <fstream>
#include <iostream>
#include "lipp.h"

template<typename Key>
std::vector<Key> load_data(const std::string &filename) {
    using key_type = Key;

    /* Open file. */
    std::ifstream in(filename, std::ios::binary);
    if (!in.is_open())
        exit(EXIT_FAILURE);

    /* Read number of keys. */
    uint64_t n_keys;
    in.read(reinterpret_cast<char*>(&n_keys), sizeof(uint64_t));

    /* Initialize vector. */
    std::vector<key_type> data;
    data.resize(n_keys);

    /* Read keys. */
    in.read(reinterpret_cast<char*>(data.data()), n_keys * sizeof(key_type));
    in.close();

    return data;
}

int main(int argc, char *argv[])
{
    /* Load keys. */
    auto keys = load_data<uint64_t>("data/wiki_ts_200M_uint64");

    std::vector<std::pair<uint64_t, uint64_t>>  key_value_pairs;
    key_value_pairs.reserve(keys.size());

    for (auto& key : keys) {
      key_value_pairs.push_back(std::make_pair(key, 100));
    }
    std::cout << "Keys loaded" << std::endl;

    /* Build LIPP. */
    LIPP<uint64_t, uint64_t> lipp;
    lipp.bulk_load(key_value_pairs.data(), key_value_pairs.size());
    std::cout << "LIPP index built succesfully" << std::endl;
}

Output:

bachfischer@learned-indexes-benchmark:~/lipp$ ./build/experiments 
Keys loaded
Killed

Segmentation Fault in bulk loading

Hi,

There will appear a segmentation fault when bulk loading osm_cellids_200M_uint64 dataset in SOSD. The code below could be used to reproduce the error. And the dataset could be fetching by executing the script from https://github.com/learnedsystems/SOSD/blob/master/scripts/download.sh.

#include <lipp.h>
#include <iostream>
#include <vector>
#include <cassert>
#include <algorithm>
#include <iostream>
#include <fstream>

using namespace std;

template<class T>
bool load_binary_data(T data[], int length, const std::string &file_path) {
    std::ifstream is(file_path.c_str(), std::ios::binary | std::ios::in);
    if (!is.is_open()) {
        return false;
    }
    is.read(reinterpret_cast<char *>(data), std::streamsize(length * sizeof(T)));
    is.close();

    return true;
}

int main()
{
    LIPP<uint64_t, uint64_t> lipp;
    size_t data_size = 100000000;

    // prepare data
    uint64_t *keys = new uint64_t[data_size];
    load_binary_data(keys, data_size, "./osm_cellids_200M_uint64");
    vector<pair<uint64_t, uint64_t>> data;
    std::sort(keys, keys + data_size);
    for (size_t i = 0; i < data_size; i ++) {
        data.push_back({keys[i], keys[i]});
    }

    // bulk load
    lipp.bulk_load(data.data(), data.size());

    // normal insert
    lipp.insert(-100, 5);
    lipp.insert(187688, 4);

    // verify LIPP data structure
    lipp.verify();

    cout << "bolk load success" << endl;

    return 0;
}

Output

Program received signal SIGSEGV, Segmentation fault.
0x00000000004048d1 in LIPP<unsigned long, unsigned long, true>::build_tree_bulk_fmcd(unsigned long*, unsigned long*, int) ()

Segmentation Fault when inserting keys

Hi Jiacheng,

We want to test the distribution shift of different learned index in SOSD. But when I run lipp in develop branch, it will throw segmentation fault. And I have already removed duplicated keys. The code below could be used to reproduce the error. And the dataset could be fetching by executing the script from https://github.com/learnedsystems/SOSD/blob/master/scripts/download.sh. Would you mind taking a look and fix it? Besides, when is the estimated time for the range query interface?

#include <lipp.h>
#include <iostream>
#include <vector>
#include <cassert>
#include <algorithm>
#include <iostream>
#include <fstream>

using namespace std;

template<class T>
bool load_binary_data(T data[], int length, const std::string &file_path) {
    std::ifstream is(file_path.c_str(), std::ios::binary | std::ios::in);
    if (!is.is_open()) {
        return false;
    }
    is.read(reinterpret_cast<char *>(data), std::streamsize(length * sizeof(T)));
    is.close();

    return true;
}


template<typename T>
T *unique_data(T *key1, size_t &size1, T *key2, size_t &size2) {
    size_t ptr1 = 0;
    size_t ptr2 = 0;

    std::sort(key1, key1 + size1);
    size1 = std::unique(key1, key1 + size1) - key1;
    std::sort(key2, key2 + size2);
    size2 = std::unique(key2, key2 + size2) - key2;

    size_t result = 0;

    while (ptr1 < size1 && ptr2 < size2) {
        while (key1[ptr1] < key2[ptr2]) {
            ptr1++;
        }
        if (key1[ptr1] == key2[ptr2]) {
            ptr2++;
            continue;
        }
        key2[result++] = key2[ptr2++];
    }

    while (ptr2 < size2) {
        key2[result++] = key2[ptr2++];
    }

    size2 = result;
    std::random_shuffle(key2, key2 + size2);

    return &key2[result];
}

int main() {
    LIPP <uint64_t, uint64_t> lipp;
    size_t data_size = 200000000;
    size_t bulk_load_size = 100000000;
    size_t to_insert = data_size - bulk_load_size;

    // prepare data
    uint64_t *keys = new uint64_t[data_size];
    cout << "Loading data" << endl;
    load_binary_data(keys, bulk_load_size, "../fb_200M_uint64");
    load_binary_data(keys + bulk_load_size, to_insert, "../books_200M_uint64");
    cout << "Make data Unique" << endl;
    unique_data(keys, bulk_load_size, keys + bulk_load_size, to_insert);

    vector <pair<uint64_t, uint64_t>> data;
    cout << "bulk load size is " << bulk_load_size << endl;
    for (size_t i = 0; i < bulk_load_size; i++) {
        data.push_back({keys[i], keys[i]});
    }

    cout << "Begin to bulkload" << endl;

    // bulk load
    lipp.bulk_load(data.data(), bulk_load_size);

    cout << "Finish bulkload" << endl;
    cout << "Begin insert" << endl;
    for (int i = 0; i < to_insert; i++) {
        lipp.insert(keys[bulk_load_size + i], i);
    }

    cout << "Finish insert" << endl;

    return 0;
}

The gdb backtrace info is as followed

#0  0x0000000000411100 in LIPP<unsigned long, unsigned long, true>::insert_tree (this=0x7fffffff3140, _node=0xb5d098e0, key=@0x7fffc75b1a58: 772277387847372864, 
    value=@0x7fffffff3238: 73) at lipp/src/core/lipp.h:795
#1  0x000000000041015c in LIPP<unsigned long, unsigned long, true>::insert (this=0x7fffffff3140, key=@0x7fffc75b1a58: 772277387847372864, value=@0x7fffffff3238: 73)
    at lipp/src/core/lipp.h:110
#2  0x000000000040f55f in main () atlipp/src/examples/example_bulk_load.cpp:85

Range Query Code

Hi,

It is so weird that you evaluate the range query in your paper but do not release it in this repo.
Could you upload it as soon as possible?
BTW, I saw that you like closing the issue raised by users but in general, the issue should be closed by the person who raised it.

Baotong

jiacheng-wu / lipp Goto Github PK

lipp's People

Contributors

Stargazers

Watchers

Forkers

lipp's Issues

Licensing and attribution of this implementation of LIPP

RT_ASSERT Error happened when inserting

replicate experiments in Figure 5

Unable to handle 'update' operation.

Out-of-memory exception

CANNOT pass "isfinite" check

Out-of-memory exception when loading SOSD benchmark datasets

Segmentation Fault in bulk loading

Segmentation Fault when inserting keys

Range Query Code

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent