vigna / sux Goto Github PK
View Code? Open in Web Editor NEWSuccinct data structures in C/C++
License: Other
Succinct data structures in C/C++
License: Other
I am trying to get the mapping from keys to unique indices out of a RecSplit MPHF. I created a file with 4 strings and passed it to recsplit_dump_8
, creating an MPHF. I modified recsplit_load.cpp
(shown below) to display the mapping. However, the mapping is not a bijection. I also tried with a million keys and could not get an MPHF.
I've only been working with the tool for a few hours today, but I can't see how I'm using the interface incorrectly. Any help would be appreciated.
Here is the output of my run:
$ git status
On branch master
Your branch is up-to-date with 'origin/master'.
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git checkout -- <file>..." to discard changes in working directory)
modified: benchmark/function/recsplit_load.cpp
Untracked files:
(use "git add <file>..." to include in what will be committed)
test.keys
test.mphf
no changes added to commit (use "git add" and/or "git commit -a")
$ make recsplit
g++ -std=c++17 -I./ -O3 -DSTATS -march=native -DLEAF=8 -DALLOC_TYPE=MALLOC benchmark/function/recsplit_dump.cpp -o bin/recsplit_dump_8
g++ -std=c++17 -I./ -O3 -DSTATS -march=native -DLEAF=8 -DALLOC_TYPE=MALLOC benchmark/function/recsplit_dump128.cpp -o bin/recsplit_dump128_8
g++ -std=c++17 -I./ -O3 -DSTATS -march=native -DLEAF=8 -DALLOC_TYPE=MALLOC benchmark/function/recsplit_load.cpp -o bin/recsplit_load_8
g++ -std=c++17 -I./ -O3 -DSTATS -march=native -DLEAF=8 -DALLOC_TYPE=MALLOC benchmark/function/recsplit_load128.cpp -o bin/recsplit_load128_8
$ cat test.keys
fish
cat
dog
horse
$ bin/recsplit_dump_8 test.keys 2 test.mphf
Building...
Elias-Fano cumul sizes: 3.000000 bits/bucket
Elias-Fano cumul bits: 5.000000 bits/bucket
Elias-Fano cumul sizes: 1.500000 bits/key
Elias-Fano cumul bits: 2.500000 bits/key
Rice-Golomb descriptors: 1.500000 bits/key
Total bits: 5.500000 bits/key
Construction time: 0.000 s, 15366 ns/key
$ bin/recsplit_load_8 test.keys test.mphf
fish -> 1
cat -> 1
1 Duplicate!!!
dog -> 1
1 Duplicate!!!
horse -> 2
0 Missing!!!
3 Missing!!!
Here is my modification torecsplit_load.cpp
to print the mapping.
$ cat benchmark/function/recsplit_load.cpp
#include <algorithm>
#include <chrono>
#include <cstdio>
#include <fstream>
#include <iostream>
#include <random>
#include <sux/function/RecSplit.hpp>
#define SAMPLES (11)
using namespace std;
using namespace sux::function;
int main(int argc, char **argv) {
if (argc < 3) {
fprintf(stderr, "Usage: %s <keys> <mphf>\n", argv[0]);
return 1;
}
ifstream fin(argv[1]);
string str;
vector<string> keys;
while (getline(fin, str)) keys.push_back(str);
fin.close();
fstream fs;
RecSplit<LEAF, ALLOC_TYPE> rs;
fs.exceptions(fstream::failbit | fstream::badbit);
fs.open(argv[2], std::fstream::in | std::fstream::binary);
fs >> rs;
fs.close();
uint8_t *seen = (uint8_t *)calloc(keys.size(), sizeof(uint8_t));
for (int k = 0; k < keys.size(); k++) {
uint64_t h = rs(keys[k]);
printf("%s -> %lu\n", keys[k].c_str(), h);
if(seen[h] == 1) {
printf("%lu Duplicate!!!\n", h);
} else {
seen[h] = 1;
}
}
for (int k = 0; k < keys.size(); k++) {
if(seen[k] == 0) {
printf("%u Missing!!!\n", k);
}
}
return 0;
}
Here I try to build an MPFH for 2^30 keys using recsplit. I'm running on an Amazon EC2 instance of type m5.4xlarge. Such an instance has 16 cores and 64 GiB of memory.
$ wc -l test.keys
1073741823 test.keys
$ bin/recsplit_load_8 test.keys test.mphf
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Aborted (core dumped)
I want to call operator() of RecSplit from multiple threads, but I think the function is not currently thread safe as it modifies some internal data structure (e.g., descriptors). Is it possible to make it thread safe?
Is there a way to discard duplicate keys at construction time? Rooting them out before constructing recsplit can be expensive.
Hello,
I have the following code:
#include "sux/function/RecSplit.hpp"
using namespace sux::function;
int main() {
vector<hash128_t> keys;
for (int i = 0; i < 10000; i++) {
keys.push_back(hash128_t(0, i));
}
RecSplit<8> RS(keys, 100);
}
Compiled with g++ -std=c++17
, this crashes with a segmentation fault. If the loop goes up to only 1000, it works. Why is that? Thanks.
This repository would be easier to find on Github if it had topics.
I don't understand exactly what do you mean by "Succinct data structures in C/C++" and so will most people.
A short description of the project would help to find it and research about it.
In EliasFano.hpp, line 119, upper_bits.size has the term (num_ones + (num_bits >> l) + 1), which is converted to the number of words by adding 63 and dividing by 64. To be consistent, shouldn't lines 137 and 138 call SimpleSelectHalf and SimpleSelectZeroHalf by also adding 1 to (num_ones + (num_bits >> l)? That way, in SimpleSelectHalf, line 83, and SimpleSelectZeroHalf, line 83, the value of num_words will be equal to upper_bits.size.
Hi: Thank you for this!
There is one thing though is that the in C++ your templates would always be statically linked with the code that uses it. And therefore my understanding is that the LGPL requirements would extend to the calling code (in contrast with Java where the linking is different)
Have you consider to use the GPL with a runtime exception instead like this is done in libstdc++ rather than an LGPL license?
See https://gcc.gnu.org/onlinedocs/libstdc++/faq.html#faq.license.what and https://gcc.gnu.org/onlinedocs/libstdc++/manual/license.html
Thank you for your kind consideration!
After playing around with RecSplit a bit, I came to realize that the lookup semantics (the operator()) is different from that of BBHash. That is, in RecSplit a key that is not in the map may return any value. In BBHash, a key that is not in the map will return a zero value, but may also return any value (a false positive). In my experience with BBHash, false positives are rare.
I'm wondering if RecSplit could be adjusted to a similar semantics; that is to return
0
if it is known that the key is not in the map.1, 2, ..., n
corresponding to the n
keys.To be clear, I'm not asking that you change your implementation in this repository, only whether or not it is an inherent limitation with RecSplit's underlying data structures. I've looked at the code and the paper, but on first glance it was not trivial to understand.
I would like to use this library (specifically RecSplit for creating MPHFs) for some experiments in our Erigon project (if you are interested I can describe the possible use case): https://github.com/ledgerwatch/erigon
But I do need Golang binding to do so. I assume there are no bindings yet, and I was thinking about creating a bounty for someone to develop it.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.