jiacheng-wu / lipp Goto Github PK
View Code? Open in Web Editor NEWUpdatable Learned Index with Precise Positions
License: MIT License
Updatable Learned Index with Precise Positions
License: MIT License
hi LIPP team,
I'm looking into benchmarking and potentially modifying LIPP at some point if the license allows for it. However, the VLDB paper itself specifies a CC license which, while clear about the handing of the paper itself, the code's licensing status is unclear. Would you consider adding an appropriate Open Source or Source Available license (if you wish to continue hard-enforcing the non-commercial clause, MongoDB's SSPL would technically work, but something truly open source like GPL would be a lot better) so no legal confusion could result in application or use of the software.
Hi Jiacheng,
Thanks for sharing your excellent work. We want to evaluate several kinds of Learned Index. When we inserted 100,000,000 kv pairs into LIPP, RT_ASSERT Error happened. But when 80,000,000 kv pairs were inserted, it worked fine. So is there some limits of total amount of kv pairs?
int main(){
LIPP<int , int> lipp{};
for(int i=0;i<100000000;i++){
lipp.insert(std::pair<int , int>(i,i));
}
}
Output
RT_ASSERT Error at ./index/lipp/lipp.h:805,
path_size < MAX_DEPTH
Hi Jiacheng,
I want to reproduce the experiment of Fig.5 in your paper Updatable Learned Index with Precise Positions. But the result in my benchmark shows that lipp is not as strong as the paper showed.
Is it convenient to share your micro-benchmark code to help me reproduce that figure?
Thanks in advance!
Regards,
Zhicong
Hi Jiacheng, thanks for sharing your excellent work.
I am using your LIPP to test some key-value workloads. I wonder to know how to update the value of an existing key with your LIPP, since I can't find an interface to update a pair of key-value while I use 'insert' instead. When I use a workload with 50% read and 50% update generated by YCSB to evaluate the LIPP, it reports "RT_ASSERT Error at lipp/src/core/lipp.h:377, key1 < key2
". Does it seem that the LIPP may not support update operation?
When running your LIPP source code on a dataset of mixed SOSD data(1.6GB), we found a potential bug of LIPP that makes it run out of our memory (756 GB).
terminate called after throwing an instance of 'St9bad_alloc'
what(): std::bad_alloc
zsh: abort (core dumped) ./example_bulk_load
The code below is used to reproduce the case . Data could be fetched from https://drive.google.com/file/d/1d3JomhUHXBCMGnRx5fTtwPrmQk-adxJS/view?usp=sharing
Would you mind to take a look and fix it? Besides, I could not find a range query interface from your code, could you please tell me how to do range query using your code or provide a new interface of range query?
Thank you very much in advance.
Best regards,
Zhicong
#include <lipp.h>
#include <iostream>
#include <vector>
#include <cassert>
#include <algorithm>
#include <iostream>
#include <fstream>
using namespace std;
template<class T>
bool load_binary_data(T data[], int length, const std::string &file_path) {
std::ifstream is(file_path.c_str(), std::ios::binary | std::ios::in);
if (!is.is_open()) {
return false;
}
is.read(reinterpret_cast<char *>(data), std::streamsize(length * sizeof(T)));
is.close();
return true;
}
int main()
{
LIPP<uint64_t, uint64_t> lipp;
size_t data_size = 200000000;
size_t bulk_load_size = 100000000;
// prepare data
uint64_t *keys = new uint64_t[data_size];
load_binary_data(keys, data_size, "../test.data");
vector<pair<uint64_t, uint64_t>> data;
std::sort(keys, keys + bulk_load_size);
auto last = std::unique(keys, keys + bulk_load_size);
bulk_load_size = last - keys;
for (size_t i = 0; i < bulk_load_size; i ++) {
data.push_back({keys[i], keys[i]});
}
// bulk load
lipp.bulk_load(data.data(), bulk_load_size);
for(int i = 0; i < data_size - bulk_load_size; i++) {
lipp.insert(keys[bulk_load_size + i], i);
}
return 0;
}
Hi,
When I run LIPP on SOSD data sets, I found that it will incur the error in the "isfinite" check in the bulk loading algorithm.
I am wondering why the code requires such a check and how to fix this bug?
Baotong
Hi,
there appears to be an out-of-memory exception when loading the SOSD benchmark datasets.
Example dataset: https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/JGVF9A/EATHF7&version=10.0
I'm running my experiments with 32 GB of RAM and have been able to successfully load the datasets with ALEX, STX-BTree and Dynamic-PGM, so the available memory should be sufficient.
Code to reproduce the issue:
#include <vector>
#include <fstream>
#include <iostream>
#include "lipp.h"
template<typename Key>
std::vector<Key> load_data(const std::string &filename) {
using key_type = Key;
/* Open file. */
std::ifstream in(filename, std::ios::binary);
if (!in.is_open())
exit(EXIT_FAILURE);
/* Read number of keys. */
uint64_t n_keys;
in.read(reinterpret_cast<char*>(&n_keys), sizeof(uint64_t));
/* Initialize vector. */
std::vector<key_type> data;
data.resize(n_keys);
/* Read keys. */
in.read(reinterpret_cast<char*>(data.data()), n_keys * sizeof(key_type));
in.close();
return data;
}
int main(int argc, char *argv[])
{
/* Load keys. */
auto keys = load_data<uint64_t>("data/wiki_ts_200M_uint64");
std::vector<std::pair<uint64_t, uint64_t>> key_value_pairs;
key_value_pairs.reserve(keys.size());
for (auto& key : keys) {
key_value_pairs.push_back(std::make_pair(key, 100));
}
std::cout << "Keys loaded" << std::endl;
/* Build LIPP. */
LIPP<uint64_t, uint64_t> lipp;
lipp.bulk_load(key_value_pairs.data(), key_value_pairs.size());
std::cout << "LIPP index built succesfully" << std::endl;
}
Output:
bachfischer@learned-indexes-benchmark:~/lipp$ ./build/experiments
Keys loaded
Killed
Hi,
There will appear a segmentation fault when bulk loading osm_cellids_200M_uint64 dataset in SOSD. The code below could be used to reproduce the error. And the dataset could be fetching by executing the script from https://github.com/learnedsystems/SOSD/blob/master/scripts/download.sh.
#include <lipp.h>
#include <iostream>
#include <vector>
#include <cassert>
#include <algorithm>
#include <iostream>
#include <fstream>
using namespace std;
template<class T>
bool load_binary_data(T data[], int length, const std::string &file_path) {
std::ifstream is(file_path.c_str(), std::ios::binary | std::ios::in);
if (!is.is_open()) {
return false;
}
is.read(reinterpret_cast<char *>(data), std::streamsize(length * sizeof(T)));
is.close();
return true;
}
int main()
{
LIPP<uint64_t, uint64_t> lipp;
size_t data_size = 100000000;
// prepare data
uint64_t *keys = new uint64_t[data_size];
load_binary_data(keys, data_size, "./osm_cellids_200M_uint64");
vector<pair<uint64_t, uint64_t>> data;
std::sort(keys, keys + data_size);
for (size_t i = 0; i < data_size; i ++) {
data.push_back({keys[i], keys[i]});
}
// bulk load
lipp.bulk_load(data.data(), data.size());
// normal insert
lipp.insert(-100, 5);
lipp.insert(187688, 4);
// verify LIPP data structure
lipp.verify();
cout << "bolk load success" << endl;
return 0;
}
Output
Program received signal SIGSEGV, Segmentation fault.
0x00000000004048d1 in LIPP<unsigned long, unsigned long, true>::build_tree_bulk_fmcd(unsigned long*, unsigned long*, int) ()
Hi Jiacheng,
We want to test the distribution shift of different learned index in SOSD. But when I run lipp in develop branch, it will throw segmentation fault. And I have already removed duplicated keys. The code below could be used to reproduce the error. And the dataset could be fetching by executing the script from https://github.com/learnedsystems/SOSD/blob/master/scripts/download.sh. Would you mind taking a look and fix it? Besides, when is the estimated time for the range query interface?
#include <lipp.h>
#include <iostream>
#include <vector>
#include <cassert>
#include <algorithm>
#include <iostream>
#include <fstream>
using namespace std;
template<class T>
bool load_binary_data(T data[], int length, const std::string &file_path) {
std::ifstream is(file_path.c_str(), std::ios::binary | std::ios::in);
if (!is.is_open()) {
return false;
}
is.read(reinterpret_cast<char *>(data), std::streamsize(length * sizeof(T)));
is.close();
return true;
}
template<typename T>
T *unique_data(T *key1, size_t &size1, T *key2, size_t &size2) {
size_t ptr1 = 0;
size_t ptr2 = 0;
std::sort(key1, key1 + size1);
size1 = std::unique(key1, key1 + size1) - key1;
std::sort(key2, key2 + size2);
size2 = std::unique(key2, key2 + size2) - key2;
size_t result = 0;
while (ptr1 < size1 && ptr2 < size2) {
while (key1[ptr1] < key2[ptr2]) {
ptr1++;
}
if (key1[ptr1] == key2[ptr2]) {
ptr2++;
continue;
}
key2[result++] = key2[ptr2++];
}
while (ptr2 < size2) {
key2[result++] = key2[ptr2++];
}
size2 = result;
std::random_shuffle(key2, key2 + size2);
return &key2[result];
}
int main() {
LIPP <uint64_t, uint64_t> lipp;
size_t data_size = 200000000;
size_t bulk_load_size = 100000000;
size_t to_insert = data_size - bulk_load_size;
// prepare data
uint64_t *keys = new uint64_t[data_size];
cout << "Loading data" << endl;
load_binary_data(keys, bulk_load_size, "../fb_200M_uint64");
load_binary_data(keys + bulk_load_size, to_insert, "../books_200M_uint64");
cout << "Make data Unique" << endl;
unique_data(keys, bulk_load_size, keys + bulk_load_size, to_insert);
vector <pair<uint64_t, uint64_t>> data;
cout << "bulk load size is " << bulk_load_size << endl;
for (size_t i = 0; i < bulk_load_size; i++) {
data.push_back({keys[i], keys[i]});
}
cout << "Begin to bulkload" << endl;
// bulk load
lipp.bulk_load(data.data(), bulk_load_size);
cout << "Finish bulkload" << endl;
cout << "Begin insert" << endl;
for (int i = 0; i < to_insert; i++) {
lipp.insert(keys[bulk_load_size + i], i);
}
cout << "Finish insert" << endl;
return 0;
}
The gdb backtrace info is as followed
#0 0x0000000000411100 in LIPP<unsigned long, unsigned long, true>::insert_tree (this=0x7fffffff3140, _node=0xb5d098e0, key=@0x7fffc75b1a58: 772277387847372864,
value=@0x7fffffff3238: 73) at lipp/src/core/lipp.h:795
#1 0x000000000041015c in LIPP<unsigned long, unsigned long, true>::insert (this=0x7fffffff3140, key=@0x7fffc75b1a58: 772277387847372864, value=@0x7fffffff3238: 73)
at lipp/src/core/lipp.h:110
#2 0x000000000040f55f in main () atlipp/src/examples/example_bulk_load.cpp:85
Hi,
It is so weird that you evaluate the range query in your paper but do not release it in this repo.
Could you upload it as soon as possible?
BTW, I saw that you like closing the issue raised by users but in general, the issue should be closed by the person who raised it.
Baotong
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.