Giter VIP home page Giter VIP logo

sux's Issues

RecSplit MPHF mapping

I am trying to get the mapping from keys to unique indices out of a RecSplit MPHF. I created a file with 4 strings and passed it to recsplit_dump_8, creating an MPHF. I modified recsplit_load.cpp (shown below) to display the mapping. However, the mapping is not a bijection. I also tried with a million keys and could not get an MPHF.

I've only been working with the tool for a few hours today, but I can't see how I'm using the interface incorrectly. Any help would be appreciated.

Here is the output of my run:

$ git status
On branch master
Your branch is up-to-date with 'origin/master'.
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

	modified:   benchmark/function/recsplit_load.cpp

Untracked files:
  (use "git add <file>..." to include in what will be committed)

	test.keys
	test.mphf

no changes added to commit (use "git add" and/or "git commit -a")
$ make recsplit
g++ -std=c++17 -I./ -O3 -DSTATS -march=native -DLEAF=8 -DALLOC_TYPE=MALLOC benchmark/function/recsplit_dump.cpp -o bin/recsplit_dump_8
g++ -std=c++17 -I./ -O3 -DSTATS -march=native -DLEAF=8 -DALLOC_TYPE=MALLOC benchmark/function/recsplit_dump128.cpp -o bin/recsplit_dump128_8
g++ -std=c++17 -I./ -O3 -DSTATS -march=native -DLEAF=8 -DALLOC_TYPE=MALLOC benchmark/function/recsplit_load.cpp -o bin/recsplit_load_8
g++ -std=c++17 -I./ -O3 -DSTATS -march=native -DLEAF=8 -DALLOC_TYPE=MALLOC benchmark/function/recsplit_load128.cpp -o bin/recsplit_load128_8
$ cat test.keys
fish
cat
dog
horse
$ bin/recsplit_dump_8 test.keys 2 test.mphf
Building...
Elias-Fano cumul sizes:  3.000000 bits/bucket
Elias-Fano cumul bits:   5.000000 bits/bucket
Elias-Fano cumul sizes:  1.500000 bits/key
Elias-Fano cumul bits:   2.500000 bits/key
Rice-Golomb descriptors: 1.500000 bits/key
Total bits:              5.500000 bits/key
Construction time: 0.000 s, 15366 ns/key
$ bin/recsplit_load_8 test.keys test.mphf
fish -> 1
cat -> 1
1 Duplicate!!!
dog -> 1
1 Duplicate!!!
horse -> 2
0 Missing!!!
3 Missing!!!

Here is my modification torecsplit_load.cpp to print the mapping.

$ cat benchmark/function/recsplit_load.cpp 
#include <algorithm>
#include <chrono>
#include <cstdio>
#include <fstream>
#include <iostream>
#include <random>
#include <sux/function/RecSplit.hpp>

#define SAMPLES (11)

using namespace std;
using namespace sux::function;

int main(int argc, char **argv) {
	if (argc < 3) {
		fprintf(stderr, "Usage: %s <keys> <mphf>\n", argv[0]);
		return 1;
	}

	ifstream fin(argv[1]);
	string str;
	vector<string> keys;
	while (getline(fin, str)) keys.push_back(str);
	fin.close();

	fstream fs;
	RecSplit<LEAF, ALLOC_TYPE> rs;

	fs.exceptions(fstream::failbit | fstream::badbit);
	fs.open(argv[2], std::fstream::in | std::fstream::binary);
	fs >> rs;
	fs.close();

	uint8_t *seen = (uint8_t *)calloc(keys.size(), sizeof(uint8_t));
	
	for (int k = 0; k < keys.size(); k++) {
	  uint64_t h = rs(keys[k]);
	  printf("%s -> %lu\n", keys[k].c_str(), h);
	  if(seen[h] == 1) {
	    printf("%lu Duplicate!!!\n", h);
	  } else {
	    seen[h] = 1;
	  }
	}

	for (int k = 0; k < keys.size(); k++) {
	  if(seen[k] == 0) {
	    printf("%u Missing!!!\n", k);
	  }
	}
	
	return 0;
}

Recsplit core dump on large keyfiles

Here I try to build an MPFH for 2^30 keys using recsplit. I'm running on an Amazon EC2 instance of type m5.4xlarge. Such an instance has 16 cores and 64 GiB of memory.

$ wc -l test.keys 
1073741823 test.keys
$ bin/recsplit_load_8 test.keys test.mphf
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted (core dumped)

Making RecSplit MPHF thread safe

I want to call operator() of RecSplit from multiple threads, but I think the function is not currently thread safe as it modifies some internal data structure (e.g., descriptors). Is it possible to make it thread safe?

Duplicate keys in RecSplit

Is there a way to discard duplicate keys at construction time? Rooting them out before constructing recsplit can be expensive.

Segmentation fault in RecSplit

Hello,

I have the following code:

#include "sux/function/RecSplit.hpp"

using namespace sux::function;

int main() {
	vector<hash128_t> keys;
	for (int i = 0; i < 10000; i++) {
		keys.push_back(hash128_t(0, i));
	}
	RecSplit<8> RS(keys, 100);
}

Compiled with g++ -std=c++17, this crashes with a segmentation fault. If the loop goes up to only 1000, it works. Why is that? Thanks.

Improve description

I don't understand exactly what do you mean by "Succinct data structures in C/C++" and so will most people.
A short description of the project would help to find it and research about it.

Creating select_upper and selectz_upper in EliasFano.hpp

In EliasFano.hpp, line 119, upper_bits.size has the term (num_ones + (num_bits >> l) + 1), which is converted to the number of words by adding 63 and dividing by 64. To be consistent, shouldn't lines 137 and 138 call SimpleSelectHalf and SimpleSelectZeroHalf by also adding 1 to (num_ones + (num_bits >> l)? That way, in SimpleSelectHalf, line 83, and SimpleSelectZeroHalf, line 83, the value of num_words will be equal to upper_bits.size.

Licensing

Hi: Thank you for this!
There is one thing though is that the in C++ your templates would always be statically linked with the code that uses it. And therefore my understanding is that the LGPL requirements would extend to the calling code (in contrast with Java where the linking is different)
Have you consider to use the GPL with a runtime exception instead like this is done in libstdc++ rather than an LGPL license?
See https://gcc.gnu.org/onlinedocs/libstdc++/faq.html#faq.license.what and https://gcc.gnu.org/onlinedocs/libstdc++/manual/license.html

Thank you for your kind consideration!

Lookup semantics

After playing around with RecSplit a bit, I came to realize that the lookup semantics (the operator()) is different from that of BBHash. That is, in RecSplit a key that is not in the map may return any value. In BBHash, a key that is not in the map will return a zero value, but may also return any value (a false positive). In my experience with BBHash, false positives are rare.

I'm wondering if RecSplit could be adjusted to a similar semantics; that is to return

  • 0 if it is known that the key is not in the map.
  • index values: 1, 2, ..., n corresponding to the n keys.
  • false positives may be allowed.

To be clear, I'm not asking that you change your implementation in this repository, only whether or not it is an inherent limitation with RecSplit's underlying data structures. I've looked at the code and the paper, but on first glance it was not trivial to understand.

Are they any bindings for Golang?

I would like to use this library (specifically RecSplit for creating MPHFs) for some experiments in our Erigon project (if you are interested I can describe the possible use case): https://github.com/ledgerwatch/erigon
But I do need Golang binding to do so. I assume there are no bindings yet, and I was thinking about creating a bounty for someone to develop it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.