xxsds / dynamic Goto Github PK
View Code? Open in Web Editor NEWDynamic succinct/compressed data structures
License: MIT License
Dynamic succinct/compressed data structures
License: MIT License
Hi, I wanted to use multiple of rle_str, but I get segmentation fault when constructing the vector. Here is MWE:
#include <dynamic.hpp>
using namespace std;
using namespace dyn;
int main() {
vector<rle_str> strs(2,rle_str(5));
return 0;
}
Reproduced the bug on both clang-8 and gcc-8.
When constructing without the specified alphabet size vector<rle_str> strs(2)
everything works fine.
What is strange that this
pointer is NULL
which was assigned in node(const node& other)
constructor.
Thank you for the effort of creating this library.
Jan
Stacktrace:
dyn::wt_string<dyn::succinct_bitvector<dyn::spsi<dyn::packed_vector, 8192u, 16u> > >::node::operator= wt_string.hpp:305
dyn::wt_string<dyn::succinct_bitvector<dyn::spsi<dyn::packed_vector, 8192u, 16u> > >::node::operator= wt_string.hpp:305
dyn::wt_string<dyn::succinct_bitvector<dyn::spsi<dyn::packed_vector, 8192u, 16u> > >::node::node wt_string.hpp:301
dyn::wt_string<dyn::succinct_bitvector<dyn::spsi<dyn::packed_vector, 8192u, 16u> > >::wt_string wt_string.hpp:32
dyn::rle_string<dyn::gap_bitvector<dyn::spsi<dyn::packed_vector, 256u, 16u> >, dyn::wt_string<dyn::succinct_bitvector<dyn::spsi<dyn::packed_vector, 8192u, 16u> > > >::rle_string rle_string.hpp:28
std::_Construct<dyn::rle_string<dyn::gap_bitvector<dyn::spsi<dyn::packed_vector, 256u, 16u> >, dyn::wt_string<dyn::succinct_bitvector<dyn::spsi<dyn::packed_vector, 8192u, 16u> > > >, dyn::rle_string<dyn::gap_bitvector<dyn::spsi<dyn::packed_vector, 256u, 16u> >, dyn::wt_string<dyn::succinct_bitvector<dyn::spsi<dyn::packed_vector, 8192u, 16u> > > > const&> stl_construct.h:75
std::__uninitialized_fill_n<false>::__uninit_fill_n<dyn::rle_string<dyn::gap_bitvector<dyn::spsi<dyn::packed_vector, 256u, 16u> >, dyn::wt_string<dyn::succinct_bitvector<dyn::spsi<dyn::packed_vector, 8192u, 16u> > > >*, unsigned long, dyn::rle_string<dyn::gap_bitvector<dyn::spsi<dyn::packed_vector, 256u, 16u> >, dyn::wt_string<dyn::succinct_bitvector<dyn::spsi<dyn::packed_vector, 8192u, 16u> > > > > stl_uninitialized.h:210
std::uninitialized_fill_n<dyn::rle_string<dyn::gap_bitvector<dyn::spsi<dyn::packed_vector, 256u, 16u> >, dyn::wt_string<dyn::succinct_bitvector<dyn::spsi<dyn::packed_vector, 8192u, 16u> > > >*, unsigned long, dyn::rle_string<dyn::gap_bitvector<dyn::spsi<dyn::packed_vector, 256u, 16u> >, dyn::wt_string<dyn::succinct_bitvector<dyn::spsi<dyn::packed_vector, 8192u, 16u> > > > > stl_uninitialized.h:255
std::__uninitialized_fill_n_a<dyn::rle_string<dyn::gap_bitvector<dyn::spsi<dyn::packed_vector, 256u, 16u> >, dyn::wt_string<dyn::succinct_bitvector<dyn::spsi<dyn::packed_vector, 8192u, 16u> > > >*, unsigned long, dyn::rle_string<dyn::gap_bitvector<dyn::spsi<dyn::packed_vector, 256u, 16u> >, dyn::wt_string<dyn::succinct_bitvector<dyn::spsi<dyn::packed_vector, 8192u, 16u> > > >, dyn::rle_string<dyn::gap_bitvector<dyn::spsi<dyn::packed_vector, 256u, 16u> >, dyn::wt_string<dyn::succinct_bitvector<dyn::spsi<dyn::packed_vector, 8192u, 16u> > > > > stl_uninitialized.h:366
std::vector<dyn::rle_string<dyn::gap_bitvector<dyn::spsi<dyn::packed_vector, 256u, 16u> >, dyn::wt_string<dyn::succinct_bitvector<dyn::spsi<dyn::packed_vector, 8192u, 16u> > > >, std::allocator<dyn::rle_string<dyn::gap_bitvector<dyn::spsi<dyn::packed_vector, 256u, 16u> >, dyn::wt_string<dyn::succinct_bitvector<dyn::spsi<dyn::packed_vector, 8192u, 16u> > > > > >::_M_fill_initialize stl_vector.h:1480
std::vector<dyn::rle_string<dyn::gap_bitvector<dyn::spsi<dyn::packed_vector, 256u, 16u> >, dyn::wt_string<dyn::succinct_bitvector<dyn::spsi<dyn::packed_vector, 8192u, 16u> > > >, std::allocator<dyn::rle_string<dyn::gap_bitvector<dyn::spsi<dyn::packed_vector, 256u, 16u> >, dyn::wt_string<dyn::succinct_bitvector<dyn::spsi<dyn::packed_vector, 8192u, 16u> > > > > >::vector stl_vector.h:430
main test_size.cpp:7
__libc_start_main 0x00007f4a41191830
_start 0x0000000000429099
rle-lz77-2 returns wrong number of lz77 phrases. Check last commits.
It would be useful (I think) to have a way to make a rle_str
from a vector<int>
, or to insert a whole vector<int>
into an rle_str
at some position.
Would breaking those operations out as batch operations make them any more efficient than just inserting each value one at a time? I don't know much about the internals, but I suspect a workload of inserts all at the very end, or all one after the other, might produce tree balancing problems.
The main dynamic.hpp
wants to include internal files as #include <internal/whatever.hpp>
. To satisfy this when installing DYNAMIC into a prefix, we have to dump DYNAMIC's internal
directory straight into the root of the prefix (or tell downstream projects to use -I<prefix>/dynamic
). Either solution causes trouble if any other e.g. internal/bwt.hpp
exists in the project's include paths.
DYNAMIC ought to reference its internal includes via a path that has the library's name in it, to prevent filename conflicts and people scratching their heads at what package is responsible for a directory like /usr/local/include/internal
.
See https://github.com/nicolaprezza/DYNAMIC/blob/master/include/internal/spsi.hpp#L1017-L1025
The running total of bytes written is not being updated here.
Orthogonal to #5, it would be nice to have (and probably easier to implement) batch construction of wavelet trees which is agnostic to the underlying bitvector implementation. While still making serial inserts (push back) to the individual bitvectors, acceleration is possible as follows:
Given a string of symbols to batch-create a new wavelet tree, first scan the string and encode all the symbols using the chosen alphabet encoder. Build an "empty" wavelet tree with nodes for each of the encoded symbols. For each node, precompute the symbols which belong to its partition, as well as whether those symbols belong to its left or right child. Now, the bitvectors of each node can be populated in parallel by iterating symbols in the string, and:
Since the empty tree is pre-constructed, and the bitvectors are all independent, there is no synchronization to worry about, and all threads can read from one copy of the string in memory. Run one thread per node up to a max number of concurrent threads.
Conceptually it is that simple but there are a few things to optimize, e.g. you can wait to create a node and initiate its thread until coming across the first symbol belonging to it, so you don't actually have to pre-scan the string first.
This would also work for push_back_many
or even insert_many
, where you have threads per node that scan the whole vector of symbols to push back, or the pairs of (symbol, position) to insert, and only do the push back or insert if the symbol belongs to its partition.
There should be support for serialization and deserialization.
The dynamic.hpp
header has some template specializations in it, which means that if it is included in multiple compilation units, it produces a multiple definitions linker error when you go to link them together (because the object code for the specializations gets generated twice). If you want to use template specializations like that, you need to put them in a .cpp file and generate an object file to hold them.
I think an alternative may be to make the specializations themselves still templates, somehow. Or to inline the specializations.
Hello,
ran into
$ obj-$(dpkg-architecture -qDEB_TARGET_GNU_TYPE)/benchmark -g 1000000 0.01
size = 1000000. P = 0.01
Benchmarking gap bitvector
insert ... done.
access ... done.
rank 0 ... done.
rank 1 ... done.
select 0 ... done.
select 1 ... done.
remove ... benchmark: /home/moeller/git/med-team/dynamic/dynamic-1.0~alpha.1/include/dynamic/internal/packed_vector.hpp:465: void dyn::packed_vector::remove(uint64_t): Assertion `(size_ / int_per_word_ == words.size() || !(words[size_ / int_per_word_] >> ((size_ % int_per_word_) * width_))) && "uninitialized non-zero values in the end of the vector"' failed.
Aborted (core dumped)
And peeking a boo in gdb:
Core was generated by `obj-x86_64-linux-gnu/benchmark -g 1000000 0.01'.
Program terminated with signal SIGABRT, Aborted.
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007fa19f15e55b in __GI_abort () at abort.c:79
#2 0x00007fa19f15e42f in __assert_fail_base (fmt=0x7fa19f2c4b48 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
assertion=0x557aacd476a0 "(size_ / int_per_word_ == words.size() || !(words[size_ / int_per_word_] >> ((size_ % int_per_word_) * width_))) && \"uninitialized non-zero values in the end of the vector\"",
file=0x557aacd47058 "/home/moeller/git/med-team/dynamic/xxsds-dynamic-1.0~alpha.1/include/dynamic/internal/packed_vector.hpp", line=465, function=<optimized out>) at assert.c:92
#3 0x00007fa19f16d092 in __GI___assert_fail (
assertion=0x557aacd476a0 "(size_ / int_per_word_ == words.size() || !(words[size_ / int_per_word_] >> ((size_ % int_per_word_) * width_))) && \"uninitialized non-zero values in the end of the vector\"",
file=0x557aacd47058 "/home/moeller/git/med-team/dynamic/xxsds-dynamic-1.0~alpha.1/include/dynamic/internal/packed_vector.hpp", line=465, function=0x557aacd477e0 "void dyn::packed_vector::remove(uint64_t)") at assert.c:101
#4 0x0000557aacd3867c in dyn::packed_vector::remove (this=0x557aae434630, i=<optimized out>) at ./include/dynamic/internal/packed_vector.hpp:981
#5 0x0000557aacd4373a in dyn::spsi<dyn::packed_vector, 256u, 16u>::node::remove (this=0x557aae4232c0, i=3) at ./include/dynamic/internal/packed_vector.hpp:591
#6 0x0000557aacd46a16 in dyn::spsi<dyn::packed_vector, 256u, 16u>::remove (this=0x7ffd24959ba0, i=3) at ./include/dynamic/internal/spsi.hpp:276
#7 dyn::gap_bitvector<dyn::spsi<dyn::packed_vector, 256u, 16u> >::delete1 (i=123, this=0x7ffd24959ba0) at ./include/dynamic/internal/gap_bitvector.hpp:253
#8 dyn::gap_bitvector<dyn::spsi<dyn::packed_vector, 256u, 16u> >::remove (i=123, this=0x7ffd24959ba0) at ./include/dynamic/internal/gap_bitvector.hpp:197
#9 benchmark_bv<dyn::gap_bitvector<dyn::spsi<dyn::packed_vector, 256u, 16u> > > (size=1000000, p=<optimized out>) at ./benchmark.cpp:122
#10 0x0000557aacd373f5 in main (argc=<optimized out>, argv=0x7ffd24959d48) at ./benchmark.cpp:165
but found it to reproduce only every 2nd to 5th invocation. Spotted with version 1.0-alpha.1.
Anything you want me to do about it?
Cheers,
Steffen
{
dyn::succinct_spsi spsi;
spsi.insert(0,0);
spsi.remove(0);
spsi.insert(0,0);
spsi.insert(0,0);
spsi.insert(2,1);
spsi.insert(1,1);
spsi.remove(3);
for(size_t i = 0; i < spsi.size(); ++i)
std::cout << spsi[i] << ",";
std::cout << std::endl;
spsi.insert(3,1);
for(size_t i = 0; i < spsi.size(); ++i)
std::cout << spsi[i] << ",";
std::cout << std::endl;
}
outputs
0,1,0,
0,0,0,1,
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.