simongog / sdsl-lite Goto Github PK
View Code? Open in Web Editor NEWSuccinct Data Structure Library 2.0
License: Other
Succinct Data Structure Library 2.0
License: Other
Include the following LCP construction algorithms again:
Hard coded lookup tables are not nice.
Usually it is hard to debug your code with gdb, since it does not show the contents of std containers or sdsl types. However, it is possible to write a gdb-script to show the contents.
Here an example for std
http://www.yolinux.com/TUTORIALS/src/dbinit_stl_views-1.03.txt
We should make the same for sdsl ;)
Complex structures like WTs, CSAs, and CSTs can be constructed by calling the construct
method. The object, which will be created by construct
is defined by solely the type of the object and the input and it is not possible to pass values for constructors of sub data structures. Therefore parameters like block size, sample densities and so on should be template arguments, so that
You Get What You Declare :)
@tb38 currently the method returns rank(c, i) and the number of symbols smaller/greater are returned by reference. With C++11 we have the possibility to return a three value tuple with no overhead (move constructor!).
This would, IMHO, make the usage of the method easier.
Another point: wt_int should have the same functionality as wt_pc for order-preserving shapes. Anyone keen to do the coding?
This class is only used in the construction of wt_int. However, we can also int_vector_buffer there. I'll replace it and remove temp_write_read_buffer.hpp, if nobody else needs it.
the WtByteTests alone take more than 30 minutes to run in travisCI. especially the IntervalSymbols method is very slow.
Most temporary results during the construction process are stored and reread from disk. This reduces the memory footprint of the construction. However for small inputs, reading and writing to disk produces an overhead compared to in-memory processing.
I would like to solve this by implementing a simple RAM-file system. It would be necessary to encapsulate the file IO operations to be able to use the RAM-file system.
RAM-files should start with a prefix like "RAM://"...
The old library implemented two version of the PHI algorithm.
The first using 5n bytes, the second 4n bytes.
We decieded to integrate only the first one, since it is faster and the memory peak is not much higher than the fast suffix array construction.
Planed optimization: keep track of the largest lcp value and use only n \log max_lcp bits instead of n \log n bits for the resulting lcp array.
The repository contains a few files that are not needed anymore or are not up to date:
anything else?
the content in the algorihtms.hpp file looks like it should be moved somewhere else (the namespace is also called algorithm not algorithms?
It is cleaner to return a pair of rank(i,c) and the determined character c, instead of having a reference for c.
select_support_scan can replace select_support_dummy, since both have the same serialized representation (=nothing is written). The _scan version even supports selection. Therefore I will remove the _dummy version.
for some bitvectors the mcl select support is not properly constructed.
Would be nice if int_vector would overload
operator++
operator--
operator+=
operator-=
Till now, inverse permutations like inverse suffix arrays (inverse of the suffix array), LF (inverse of psi) are accessed through the operator()
. This lead to user confusion several times. We will change the access to isa[]
and lf[]
to avoid this in the future.
Working with the library, I feel that I'm constrained by the construction workflow.
My understanding of the construction workflow is:
I have multiple concerns with this workflow:
Simple solutions to these concerns are possible, E.g. (1) one could simply not write the int_vector to disc (as it is now, it is completely loaded into memory anyway). (2) The extension to int_vector_file_buffer is the most extensive code work of these suggestions, but should be decoupled enough to only involve the int_vector_file_buffer implementation. (3) Though affecting the library more, the library interface could rely on a "construct" function in every structure instead of a constructor. This would save memory during construction, since we wouldn't need the temporary object, and make it possible to change the parameters of the structure at run time, before the actual construction. This change should be extremely simple, but it affects many structures.
So, what are your thoughts on the suggested changes? I think they would give more flexibility, and at the same time save some resources at construction time.
It would be more comfortable, if methods like cache_file_name, register_cache_file, cache_file_exists would also accept std::string instead of const char*.
The cheatsheet provides a succinct ;) description of the library functionality. For release this documentation should exactly match the released library version. Also, the cheatsheet should include a footnote or so for which library version it describes.
I have replaced the fixed size integer vector by and int_vector in the Larson's algorithm. Everything works fine for inputs <= 2G elements. But it breaks for > G elements.
It's possibly an unsigned/signed problem.
Probably you moved code from wt.hpp to somewhere, but forgot to commit somewhere?
searching for NULL in the code still results in a lot of occurrences. as we are now c++11 compatible these should all be changed to nullptr
I'm in favor of renaming good old bit_magic
to simply bits
. Also often used methods like b1Cnt
(one bits count=popcount), i1BP
(i-th 1
-bits position=select), r1BP
(rightmost 1
-bit position), and l1BP
(leftmost 1
-bit position) should get better names:
How about that:
bit_magic::b1Cnt
-> bits::cnt
bit_magic::i1BP
-> bits::sel
bit_magic::r1BP
-> bits::lo
bit_magic::l1BP
-> bits::hi
?
It would be nice to have a logger for the allocated space, which outputs the current allocated space each time an int_vector is created, resized or deleted.
I will add this functionality to the memory_management class mm
.
You can then assign a output stream to mm and the information will be written into the stream.
besides bitmagic lots of the code contains lots of magic numbers which makes it hard for "newcomers" to get into. for example:
for (size_type i=0; i < 511; ++i)
m_nodes[i] = wt.m_nodes[i];
for (size_type i=0; i<256; ++i)
m_c_to_leaf[i] = wt.m_c_to_leaf[i];
for (size_type i=0; i<256; ++i) {
m_path[i] = wt.m_path[i];
}
or
size_type sb = (m_arg_cnt+4095)>>12;
maybe we should spend some time to making the code more easier to read in this aspect.
c++11 introduces the concept of moving objects. the library currently implements this concept only in the int_vector<> class I think. as we advertise to be c++11 compatible we should make sure that the main objects (csa/cst/wt) can be moved efficiently.
Signature of range_search_2d
: currently two pointers to vectors are passed to range_search_2d. The found index/value pairs are stored in theses two vector in case the pointers are not nullptrs. The number of found index/value pairs is returned.
I would suggest to replace the two pointers in the parameter list by a bool generate_pairs
, which indicates if a index/value pair vector should be created. The return value is then a pair, consisting of number of found pairs and a vector of index/value pairs. This vector has size zero in the case generate_pairs=false
.
This source file contains basically the methods algorithm::count
, algorithm::locate
, and algorithm::extract
.
I suggest to
suffix_array_helper.hpp
and suffix_tree_helper.hpp
, since they expect such a data structure as input and the user does not have to include another header besides suffix_[array|tree]s.hpp
.csa_wt
CSA but forward decoding on a csa_sada
one. This can be done by using the same technique as in the construct
methods.algorithm
namespace.I would love to see approximate string matching in SDSL-lite. Especially, I am interested in using compressed suffix trees with wildcard characters in the indexed string.
Example:
Indexed string T=AC*CA, where * is a wildcard character matching any symbol
Query Q=CTC
T has a match of Q at position 1, since the wildcard could be replaced by T. In the same way, there is a match for CGC, CTC, etc.
Note that I do not ask for approximate queries, but only approximate indexed strings. However, approximate queries (for instance wildcards in queries) would be nice too :-)
Thanks,
Sebastian
The following method names are quite long
and should be replaced by
since it is clear from the context that node v represents a subtree.
We have to delete all installed library files on the Jenkins server after the build and test run, since this is not done automatically by Jenkins and causes problems, when include files are removed from the library.
there are many "defunct" classes and functions that don't do anything or should not be used in practice.
For example, a library user who does not really know that i1BP is the right function to call instead of the many other (k1BP/j1BP)
same goes for classes/files like wt_fixed_block.hpp which are incomplete
to avoid segfaulting when loading an index created with a different classes we could serialise the hash of the demangle() / util::class_to_hash() output. during loading we compare the stored hash with the hash of the object we are loading and abort if they do not match.
I also suggest we write the size first and if it is small (say less than 1024) entries we do not write the hash as it might contribute to the size of small data structures. for larger data structures the 64bit hash value is negligible
It should be possible to determine, if the given install path is relative and then translate it to the corresponding absolute path...
the stopwatch class is not very portable and should be replaced with the more portable c++11 timing classes
A cheat sheet would make it easier to get an overview of the functionality of the library.
Lets have automatic builds tested using Travis CI
inserting assert() into every "access" function would be very useful for debugging purposes while not affecting runtime performance when compiled with NDEBUG
the util namespace/class is very messy (in my opinion).
for example there are many different functions in different places to read something from disk mixed with other functions that are just "helper" functions.
there are also other classes for example testutils::file which also perform I/O.
a consistent name space? maybe sdsl::io where all file reading/writing is performed would increase the usability of the library. (maybe make load/serialize of each class private and force the user to use the load_from_disk store_to_disk functions?)
I added accidentally multiple binary blobs which should be removed in the history.
there would be quite a few places where we could speed things up using simple openmp primitives (similar to what libdivsufsort). for example:
void construct_init_rank_select() {
util::init_support(m_tree_rank, &m_tree);
util::init_support(m_tree_select0, &m_tree);
util::init_support(m_tree_select1, &m_tree);
}
of course this would have to be disabled when you do "construction time experiments"
we have a quite powerful memory manager in the library which could be enhanced to support quite a few number of functions.
some sdsl data structures use std::vectors to store data. currently there are several helper functions that can be used to serialize and load the content stored in these vectors. unfortunately, the current way this is implemented is very slow as each element in the vector is serialised individually. there might be a faster way to do this for POD.
With the usage of C++11 we can now create variadic template functions which handle the execution of the mentioned operations. This will simplify the code.
Currently, the paths to testcases are hard coded into the sources of the test programs. So adding a test case requires recompilation.
It would be more convenient to read the test cases from config files.
I was trying to construct an lcp array and could not figure out how to do it. there seems to be no construct() method which constructs the lcp array form a file? There are also no examples showing how to construct an lcp array. I had to construct a cst and use cst.lcp[] to access the lcp array?
the construct_lcp classes all require cache configs which I don't know how to use? do I just point to the sa / text on disk using the constants KEY_BWT? do these have to be stored in sdsl format or in normal uint8_t or uint64_t? Can I somehow specify the num_bytes type of each input in the cache_config file? For example KEY_SA num_bytes=8 is a uncompressed sa on disk using 64bit integers.
another comment: the different values of num_bytes should be defines similar to KEY_SA
int_vector_buffer is used most of the time for reading int_vectors buffered form disk.
So I would suggest to have std::ios::in as default value for the second ctor parameter.
Agreed? @tb38?
releasing a new version is a good time for a license change.
some files that need to be changed:
As it is done in the benchmarks now. Make the code of the test program easier. And make can do the job of reading the instances from config files.
Make will also download the test cases from an online source, if they are not already present.
files such as test_index_performance.hpp contain large segments of commented out code that are not needed anymore. as we are all gentlemen programmers, we should clean up after ourselves and remove these unneeded code segments before release.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.