simongog / sdsl-lite Goto Github PK

View Code? Open in Web Editor NEW

2.2K 2.2K 346.0 9.3 MB

Succinct Data Structure Library 2.0

License: Other

CMake 1.64% R 2.00% Makefile 2.52% C++ 90.10% TeX 2.87% C 0.43% Python 0.04% Shell 0.32% Batchfile 0.03% GDB 0.07%

sdsl-lite's People

Contributors

Stargazers

Watchers

Forkers

garonenur mpetri tb37 xosh sadit cuskinfor chenlonggang hohocode ludo6577 mmuggli smdgjmigop alienfeel wayne1337 lespeholt egisatoshi bruce3557 tchen0123 rafkt spohl shaunharker adamnovak cosmo-team ramgtv kowr peterwilliams97 rkonow diegocaro garviz andreasvc emaxerrno esaul bruce2008github wanglawrence999 jdiego claudiouzelac samstudio8 dlzhang franramirez688 mbesta dorleosterode olydis pucika fermeise vazj jlabeit sjas tsnorri farruggia tempbottle weitzj lambder basicsbeauty uikit0 mmallad mpw kimundi ekg andmaj thinred rcanovas edawson murray1991 flying7993 junix wissil tb38 0xdec0de8 mcculloh simple555a xuerenlv stephennimo gbaruch linearregression anukat2015 ufosaga qzhong0605 deenz makerbob kdawes xlpe ottolu aarenmeng timur-losev salikhov-kamil fmontoto iliulu from0tohero niklasb nmate vgteam enamcse jabro chris1201 sebastianschlag 3rduncle kittobi1992 nathandunn rsharris babooppa6 kdm9

sdsl-lite's Issues

Reintegration of LCP construction algorithms

Include the following LCP construction algorithms again:

semi-external PHI (sampling=64) is very space efficient; takes about the text size.
goPHI; space dependent on reducible values in the text
go; most times faster than goPHI, but quadratic worst case complexity

Replace lookup tables in bp_support_algorithm.hpp

Hard coded lookup tables are not nice.

debug sdsl code with gdb

Usually it is hard to debug your code with gdb, since it does not show the contents of std containers or sdsl types. However, it is possible to write a gdb-script to show the contents.
Here an example for std
http://www.yolinux.com/TUTORIALS/src/dbinit_stl_views-1.03.txt

We should make the same for sdsl ;)

YGWYD paradigm

Complex structures like WTs, CSAs, and CSTs can be constructed by calling the construct method. The object, which will be created by construct is defined by solely the type of the object and the input and it is not possible to pass values for constructors of sub data structures. Therefore parameters like block size, sample densities and so on should be template arguments, so that

You Get What You Declare :)

Return type of lex_count for order-preserving WTs

@tb38 currently the method returns rank(c, i) and the number of symbols smaller/greater are returned by reference. With C++11 we have the possibility to return a three value tuple with no overhead (move constructor!).
This would, IMHO, make the usage of the method easier.

Another point: wt_int should have the same functionality as wt_pc for order-preserving shapes. Anyone keen to do the coding?

temp_write_read_buffer

This class is only used in the construction of wt_int. However, we can also int_vector_buffer there. I'll replace it and remove temp_write_read_buffer.hpp, if nobody else needs it.

test suite takes to long to run

the WtByteTests alone take more than 30 minutes to run in travisCI. especially the IntervalSymbols method is very slow.

Construction in RAM

Most temporary results during the construction process are stored and reread from disk. This reduces the memory footprint of the construction. However for small inputs, reading and writing to disk produces an overhead compared to in-memory processing.

I would like to solve this by implementing a simple RAM-file system. It would be necessary to encapsulate the file IO operations to be able to use the RAM-file system.
RAM-files should start with a prefix like "RAM://"...

Reinsert PHI algorithm for LCP construction

The old library implemented two version of the PHI algorithm.
The first using 5n bytes, the second 4n bytes.
We decieded to integrate only the first one, since it is faster and the memory peak is not much higher than the fast suffix array construction.
Planed optimization: keep track of the largest lcp value and use only n \log max_lcp bits instead of n \log n bits for the resulting lcp array.

Repository cleanup

The repository contains a few files that are not needed anymore or are not up to date:

CHANGES (out of date)
CMakemodules/Finddivsufsort* ?

anything else?

the content in the algorihtms.hpp file looks like it should be moved somewhere else (the namespace is also called algorithm not algorithms?

Return type of inverse_select

It is cleaner to return a pair of rank(i,c) and the determined character c, instead of having a reference for c.

Remove select_support_dummy ?

select_support_scan can replace select_support_dummy, since both have the same serialized representation (=nothing is written). The _scan version even supports selection. Therefore I will remove the _dummy version.

select_mcl construction init_fast

for some bitvectors the mcl select support is not properly constructed.

operator++ for int_vector

Would be nice if int_vector would overload
operator++
operator--
operator+=
operator-=

Signature of inverse permutations

Till now, inverse permutations like inverse suffix arrays (inverse of the suffix array), LF (inverse of psi) are accessed through the operator(). This lead to user confusion several times. We will change the access to isa[] and lf[] to avoid this in the future.

Construction process improvements

Working with the library, I feel that I'm constrained by the construction workflow.

My understanding of the construction workflow is:

Make an instance of the desired structure, with the default constructor (which should do nothing).
Call the construct function, which (at least for WTs) (1) loads the file into an int_vector. (2) Writes that int_vector to disc. (3) Creates a new instance of the desired structure by calling its constructor with an int_vector_file_buffer. (4) Uses swap to swap the new structure with the one passed to the construct function. (5) Deletes the temporary file containing the int_vector structure.

I have multiple concerns with this workflow:

Copying the file into an int_vector and then storing this structure on disc seams rather wasteful both in terms of I/O time, but also in terms of disc space (especially if one uses succinct structures because one works on very large data sets).
The int_vector_file_buffer is not very flexible. Seek functionality and the possibility to set 'limits' could be useful. By limits I mean that one could give a start and an end offset, after which the buffer would appear to only contain the values in between.
It is only possible to affect the structure at run-time, before construction, by the use of template variables.

Simple solutions to these concerns are possible, E.g. (1) one could simply not write the int_vector to disc (as it is now, it is completely loaded into memory anyway). (2) The extension to int_vector_file_buffer is the most extensive code work of these suggestions, but should be decoupled enough to only involve the int_vector_file_buffer implementation. (3) Though affecting the library more, the library interface could rely on a "construct" function in every structure instead of a constructor. This would save memory during construction, since we wouldn't need the temporary object, and make it possible to change the parameters of the structure at run time, before the actual construction. This change should be extremely simple, but it affects many structures.

So, what are your thoughts on the suggested changes? I think they would give more flexibility, and at the same time save some resources at construction time.

parameters: string instead of / additional to const char*

It would be more comfortable, if methods like cache_file_name, register_cache_file, cache_file_exists would also accept std::string instead of const char*.

Update the documentation to match the current version of the code

The cheatsheet provides a succinct ;) description of the library functionality. For release this documentation should exactly match the released library version. Also, the cheatsheet should include a footnote or so for which library version it describes.

Construct Integer Array > 2G elements

I have replaced the fixed size integer vector by and int_vector in the Larson's algorithm. Everything works fine for inputs <= 2G elements. But it breaks for > G elements.
It's possibly an unsigned/signed problem.

WtByteTest.cpp doesn't compile any more

Probably you moved code from wt.hpp to somewhere, but forgot to commit somewhere?

code cleanup - remove NULL

searching for NULL in the code still results in a lot of occurrences. as we are now c++11 compatible these should all be changed to nullptr

Renaming of bit_magic

I'm in favor of renaming good old bit_magic to simply bits. Also often used methods like b1Cnt (one bits count=popcount), i1BP (i-th 1-bits position=select), r1BP (rightmost 1-bit position), and l1BP (leftmost 1-bit position) should get better names:
How about that:

bit_magic::b1Cnt -> bits::cnt
bit_magic::i1BP -> bits::sel
bit_magic::r1BP -> bits::lo
bit_magic::l1BP -> bits::hi

Memory log

It would be nice to have a logger for the allocated space, which outputs the current allocated space each time an int_vector is created, resized or deleted.
I will add this functionality to the memory_management class mm.
You can then assign a output stream to mm and the information will be written into the stream.

magic numbers

besides bitmagic lots of the code contains lots of magic numbers which makes it hard for "newcomers" to get into. for example:

            for (size_type i=0; i < 511; ++i)
                m_nodes[i] = wt.m_nodes[i];
            for (size_type i=0; i<256; ++i)
                m_c_to_leaf[i] = wt.m_c_to_leaf[i];
            for (size_type i=0; i<256; ++i) {
                m_path[i] = wt.m_path[i];
            }

size_type sb    = (m_arg_cnt+4095)>>12;

maybe we should spend some time to making the code more easier to read in this aspect.

Move operators

c++11 introduces the concept of moving objects. the library currently implements this concept only in the int_vector<> class I think. as we advertise to be c++11 compatible we should make sure that the main objects (csa/cst/wt) can be moved efficiently.

range_search_2d

Signature of range_search_2d: currently two pointers to vectors are passed to range_search_2d. The found index/value pairs are stored in theses two vector in case the pointers are not nullptrs. The number of found index/value pairs is returned.

I would suggest to replace the two pointers in the parameter list by a bool generate_pairs, which indicates if a index/value pair vector should be created. The return value is then a pair, consisting of number of found pairs and a vector of index/value pairs. This vector has size zero in the case generate_pairs=false.

Remove algorithms_for_string_matching.hpp

This source file contains basically the methods algorithm::count, algorithm::locate, and algorithm::extract.
I suggest to

place them into suffix_array_helper.hpp and suffix_tree_helper.hpp, since they expect such a data structure as input and the user does not have to include another header besides suffix_[array|tree]s.hpp.
rewrite them such that automatically the fastest algorithm is chosen. E.g. using backward decoding for extract on a csa_wt CSA but forward decoding on a csa_sada one. This can be done by using the same technique as in the construct methods.
get rid of the algorithm namespace.

Approximate matching with wildcards in the indexed string

I would love to see approximate string matching in SDSL-lite. Especially, I am interested in using compressed suffix trees with wildcard characters in the indexed string.

Example:

Indexed string T=AC*CA, where * is a wildcard character matching any symbol
Query Q=CTC

T has a match of Q at position 1, since the wildcard could be replaced by T. In the same way, there is a match for CGC, CTC, etc.

Note that I do not ask for approximate queries, but only approximate indexed strings. However, approximate queries (for instance wildcards in queries) would be nice too :-)

Thanks,
Sebastian

Method names in CSTs

The following method names are quite long

leaves_in_the_subtree(v)
leftmost_leaf_in_the_subtree(v)
rightmost_leaf_in_the_subtree(v)

and should be replaced by

size(v)
leftmost_leaf(v)
rightmost_leaf(v)

since it is clear from the context that node v represents a subtree.

Jenkins CI cleanup after test cases

We have to delete all installed library files on the Jenkins server after the build and test run, since this is not done automatically by Jenkins and causes problems, when include files are removed from the library.

unused code makes the library less readable/usable

there are many "defunct" classes and functions that don't do anything or should not be used in practice.

For example, a library user who does not really know that i1BP is the right function to call instead of the many other (k1BP/j1BP)

same goes for classes/files like wt_fixed_block.hpp which are incomplete

add hash of class type to avoid loading incorrect objects

to avoid segfaulting when loading an index created with a different classes we could serialise the hash of the demangle() / util::class_to_hash() output. during loading we compare the stored hash with the hash of the object we are loading and abort if they do not match.

I also suggest we write the size first and if it is small (say less than 1024) entries we do not write the hash as it might contribute to the size of small data structures. for larger data structures the 64bit hash value is negligible

install.sh should accept also a relative install path

It should be possible to determine, if the given install path is relative and then translate it to the corresponding absolute path...

replace stopwatch class with c++11 functionality

the stopwatch class is not very portable and should be replaced with the more portable c++11 timing classes

Create a cheat sheet for sdsl

A cheat sheet would make it easier to get an overview of the functionality of the library.

Automatic Builds using Travis CI

Lets have automatic builds tested using Travis CI

assert() in all rank() / access() / select() functions

inserting assert() into every "access" function would be very useful for debugging purposes while not affecting runtime performance when compiled with NDEBUG

the util namespace is very messy and it is not clear to the library user that most of these functions exist

the util namespace/class is very messy (in my opinion).

for example there are many different functions in different places to read something from disk mixed with other functions that are just "helper" functions.

there are also other classes for example testutils::file which also perform I/O.

a consistent name space? maybe sdsl::io where all file reading/writing is performed would increase the usability of the library. (maybe make load/serialize of each class private and force the user to use the load_from_disk store_to_disk functions?)

Remove large binary files in repo history

I added accidentally multiple binary blobs which should be removed in the history.

openmp for construction

there would be quite a few places where we could speed things up using simple openmp primitives (similar to what libdivsufsort). for example:

        void construct_init_rank_select() {
            util::init_support(m_tree_rank, &m_tree);
            util::init_support(m_tree_select0, &m_tree);
            util::init_support(m_tree_select1, &m_tree);
        }

of course this would have to be disabled when you do "construction time experiments"

memory management enhancements

we have a quite powerful memory manager in the library which could be enhanced to support quite a few number of functions.

keep track of the memory usage and support functions like peak_memory()
optionally "log" each int_vector<> "create" and "delete" action so we can easily create a "memory usage" graph
optionally move all bitmagic constants into memory managed by the memory manager so they reside in hugepage space if used
we could use a "real" memory manager like jemalloc to manage all our memory ourself. we could use functions like mm::new mm::free mm::realloc to manage our own memory space which would make using hugepages easier. this would be quite a bit of work though.

improve serialize std::vector sdsl code

some sdsl data structures use std::vectors to store data. currently there are several helper functions that can be used to serialize and load the content stored in these vectors. unfortunately, the current way this is implemented is very slow as each element in the vector is serialised individually. there might be a faster way to do this for POD.

Simplify load and serialize method bodies

With the usage of C++11 we can now create variadic template functions which handle the execution of the mentioned operations. This will simplify the code.

tests: move testcase paths from sources into .config files

Currently, the paths to testcases are hard coded into the sources of the test programs. So adding a test case requires recompilation.
It would be more convenient to read the test cases from config files.

Integrate SPIRE 2011 LCP construction algorithm (BWT->LCP)

lcp construction

I was trying to construct an lcp array and could not figure out how to do it. there seems to be no construct() method which constructs the lcp array form a file? There are also no examples showing how to construct an lcp array. I had to construct a cst and use cst.lcp[] to access the lcp array?

the construct_lcp classes all require cache configs which I don't know how to use? do I just point to the sa / text on disk using the constants KEY_BWT? do these have to be stored in sdsl format or in normal uint8_t or uint64_t? Can I somehow specify the num_bytes type of each input in the cache_config file? For example KEY_SA num_bytes=8 is a uncompressed sa on disk using 64bit integers.

another comment: the different values of num_bytes should be defines similar to KEY_SA

Default parameter for int_vector_buffer?

int_vector_buffer is used most of the time for reading int_vectors buffered form disk.
So I would suggest to have std::ios::in as default value for the second ctor parameter.
Agreed? @tb38?

license change to lgpl

releasing a new version is a good time for a license change.

some files that need to be changed:

all the headers of the source code
the COPYING file
the readme file clearly stating that the library license

Tests: pass test file names as command line arguments

As it is done in the benchmarks now. Make the code of the test program easier. And make can do the job of reading the instances from config files.
Make will also download the test cases from an online source, if they are not already present.

remove dead and commented out code segments that are not needed anymore

files such as test_index_performance.hpp contain large segments of commented out code that are not needed anymore. as we are all gentlemen programmers, we should clean up after ourselves and remove these unneeded code segments before release.