Giter VIP home page Giter VIP logo

ilyagrebnov / libsais Goto Github PK

View Code? Open in Web Editor NEW
168.0 15.0 20.0 390 KB

libsais is a library for linear time suffix array, longest common prefix array and burrows wheeler transform construction based on induced sorting algorithm.

License: Apache License 2.0

C 99.88% CMake 0.12%
bwt burrows-wheeler-transform saca suffix-array sais divsufsort suffixarray lcp lcp-array longest-common-prefix

libsais's People

Contributors

atgoogat avatar ilyagrebnov avatar schafferde avatar tansy avatar zvezdochiot avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

libsais's Issues

feature request: binary BWT

Hello. I'm grateful for your library.
However, I would really appreciate if you would add function to perform BWT over binary strings with 1/0 alphabet. Sometimes it's important for entropy analysis. But it require a lot of memory and computation time. Possibly, algorithm could be optimized for such case.

Feature request?

When dealing with large files, for space reasons I often need a partial suffix array, that is, the suffix array of every other (or every n-th) positions of the file. I am guessing it would be much faster than sorting the whole file. Would this be a sensible feature or it would require a considerate amount of work?

intel compiler: libsais64 doesn't produce correct SuffixArray

Using the Intel compiler and calling the libsais64 does not result in correct suffix array.

How to reproduce

  • Checkout newest commit #d5e74eb93

  • create test.c file
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    
    #include <libsais.h>
    #include <libsais64.h>
    
    int main() {
        // input data
        char *Text = "abracadabra";
        int n = strlen(Text);
        int i, j;
    
        // allocate
        int64_t *SA = (int64_t *)malloc(n * sizeof(int64_t));
    
        // sort
        libsais64((unsigned char *)Text, SA, n, 0, NULL);
    
        // output
        for(i = 0; i < n; ++i) {
            printf("SA[%2d] = %2ld: ", i, SA[i]);
            for(j = SA[i]; j < n; ++j) {
                printf("%c", Text[j]);
            }
            printf("$\n");
        }
    
        // deallocate
        free(SA);
    
        return 0;
    }
  • compiled with icx -O2 -DNDEBUG src/libsais64.c src/libsais.c test.c
  • run ./a.out

Output:

SA[ 0] = 10: a$                                                                                                                                                
SA[ 1] =  7: abra$                                                                                                                                             
SA[ 2] =  0: abracadabra$                                                                                                                                      
SA[ 3] =  3: acadabra$                                                                                                                                         
SA[ 4] =  0: abracadabra$                                                                                                                                      
SA[ 5] =  0: abracadabra$                                                                                                                                      
SA[ 6] =  3: acadabra$                                                                                                                                         
SA[ 7] =  0: abracadabra$                                                                                                                                      
SA[ 8] =  0: abracadabra$                                                                                                                                      
SA[ 9] =  0: abracadabra$                                                                                                                                      
SA[10] =  0: abracadabra$       

Expected Output:

SA[ 0] = 10: a$
SA[ 1] =  7: abra$
SA[ 2] =  0: abracadabra$
SA[ 3] =  3: acadabra$
SA[ 4] =  5: adabra$
SA[ 5] =  8: bra$
SA[ 6] =  1: bracadabra$
SA[ 7] =  4: cadabra$
SA[ 8] =  6: dabra$
SA[ 9] =  9: ra$
SA[10] =  2: racadabra$

Version

icx --version

Intel(R) oneAPI DPC++/C++ Compiler 2023.2.0 (2023.2.0.20230622)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/intel/oneapi/compiler/2023.2.0/linux/bin-llvm
Configuration file: /opt/intel/oneapi/compiler/2023.2.0/linux/bin-llvm/../bin/icx.cfg

Things I know:

  • using -O1 works, code works with -O2 under clang and gcc.
  • libsais works but libsais64 doesn't

Thank you for all the help!

Makefile missing

Hello, it looks like the Makefile was deleted in 3847654. Is this intentional? Is there a new preferred way to build this library now? Thanks.

segmentation fault when building the suffix array for some large texts

With some large texts I get a segmentation fault in libsais64 and libsais64_omp (see Screenshot 2023-04-13 100612). Here is one of the texts the issue occurs with: https://drive.google.com/file/d/1UJLAHUW9KZVrW7Y8zekAJLnVp6Au_epc/view?usp=sharing. I prepared an example repository "test_libsais" on my github to test this. I can confirm that the issue occurs on Ubuntu 20.04 in WSL2 (with i7-12700K and i7-1160G7) and on Ubuntu 18.04 (with 2x AMD EPYC 7452). I tested with GCC 9.4.0 and 11.1.0 and libomp-dev (5.0.1-1).

Crash for file size close to 2GB

As an old user looking forward to switch from libdivsufsort, I noticed libsais would crash as file size approaches 2GB (with or without giving extra space), while divsufsort won't as long as file size is strictly under 2G (210241024*1024). I wonder what is the max size doable without switching to 64-bit version.

strange bug in combination with malloc_count

I am not sure where the root of this issue lies, but on some operating systems, a strange bug occurs when using libsais with OpenMP enabled in combination with malloc_count (https://github.com/bingmann/malloc_count) and gcc. I have not encountered any issues when using malloc_count in combination with any other library yet, so I suspected the issue might be linked to libsais and decided to post it here. I prepared a repository to illustrate it: https://github.com/LukasNalbach/libsais-malloc_count-bug

test-1.cpp uses malloc_count and libsais: Screenshot 2023-07-23 120025
test-2.cpp uses libsais without malloc_count and test-3.cpp uses malloc_count without libsais, no errors.

The bug does not occur when using clang. It also occurs in the sigle-thread version, as long as LIBSAIS_USE_OPENMP is set to ON. It seems to occurr regardless of the gcc and OpenMP versions used. For example, on Ubuntu 18.04 and 22.04 the bug occurs, but on 20.04 and 23.04 it does not. The issue does not occur when executing the program with valgrind with the options --tool=memcheck --leak-check=full.

Even just linking against (not including or calling) libsais when also using malloc_count can cause the same issue in a simple "#pragma omp parallel for" region, regardless of whether LIBSAIS_USE_OPENMP is set to ON. I have not been able to replicate this issue in the example repository, but for example, it occurs here: https://github.com/LukasNalbach/move-r/blob/b615972ca07ea4031416b7450bea65c762cc9e7d/include/move_r/algorithms/construction/modes/common.cpp#L12

edit: Setting THREAD_SAFE_GCC_INTRINSICS to 1 does not solve the issue.

multiple vulnerabilities in bzip3

Hello,

I reported multiple vulnerabilities in bzip3, and some of these looks to be related to libsais.

bzip3 uses its own version of libsais, but you may want to see the report to understand if those issues affect your libsais as well.

As a side note:
If you have C tests that uses libsais to parse untrusted input, I can help discover bugs with fuzzing.

Overlapping parameters are passed to memcpy

I was playing around with libsais and Silesia corpus when I encountered a possibility for overlapping parameters passed to memcpy. This can happen precisely at libsais_reconstruct_compacted_lms_suffixes_32s_2k_omp libsais/src/libsais.c:3912.

Whether this causes problems or not, seems to depend on environment. In an environment where calls to memcpy are redirected to memmove this does not seem to cause problems and is only detected with AddressSanitizer. However in an environment where memcpy is actually utilized, this causes either a corrupted SA and/or segmentation fault, and is detected also with Valgrind.

I was able to isolate the part of the payload causing this error and wrote a minimal reproducer for it. Payload is a 32kiB block extracted from concatenation of Silesia corpus contents in alphabetical order.

How to use reproducer:
Compile with gcc -o reproducer -g -fsanitize=address reproducer.c libsais.c
Run with ./reproducer payload

If there is some information that I could provide additionally, please feel free to ask.

UB in libsais_unbwt

Hi! The following C code will trigger undefined behaviour in libsais' unbwt code:

unsigned char arr[] = { 0xFB, 0xB7, 0x46, 0xA8, 0x13, 0xBC, 0x88, 0xC8, 0x9B, 0xBC, 0x97, 0xCB, 0x1A, 0xA6, 0xAE, 0x96, 0xBC, 0xBD, 0x13, 0xB7, 0xA3, 0xE2, 0x95, 0x88, 0x9B, 0xB6, 0xC2, 0x87, 0x65, 0x77, 0xF7, 0xB8, 0x8E, 0xCE, 0xE1, 0xCB, 0x9F, 0x63, 0x9B, 0xF3, 0xCB, 0x63, 0x42, 0x26, 0x14, 0x2F, 0xC4, 0xCE, 0x43 };
int size = 49; int bwt_idx = 1;

int main(void) {
    s32 * A = (s32 *) malloc(sizeof(s32) * (size + 1));
    libsais_unbwt(arr, arr, A, size, NULL, bwt_idx);
    printf("%d\n", arr[0]);
}

The Valgrind output follows:

==2044781== Use of uninitialised value of size 8
==2044781==    at 0x11D5C2: libsais_unbwt_decode_1 (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x11F638: libsais_unbwt_decode (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x12016F: libsais_unbwt_decode_omp (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x120205: libsais_unbwt_core (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x12031E: libsais_unbwt_main (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x120601: libsais_unbwt_aux (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x120489: libsais_unbwt (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x10927E: main (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==
==2044781== Use of uninitialised value of size 8
==2044781==    at 0x11D5EA: libsais_unbwt_decode_1 (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x11F638: libsais_unbwt_decode (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x12016F: libsais_unbwt_decode_omp (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x120205: libsais_unbwt_core (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x12031E: libsais_unbwt_main (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x120601: libsais_unbwt_aux (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x120489: libsais_unbwt (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x10927E: main (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==
==2044781== Use of uninitialised value of size 8
==2044781==    at 0x11D5A8: libsais_unbwt_decode_1 (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x11F638: libsais_unbwt_decode (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x12016F: libsais_unbwt_decode_omp (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x120205: libsais_unbwt_core (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x12031E: libsais_unbwt_main (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x120601: libsais_unbwt_aux (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x120489: libsais_unbwt (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x10927E: main (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==
==2044781== Conditional jump or move depends on uninitialised value(s)
==2044781==    at 0x11D5CA: libsais_unbwt_decode_1 (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x11F638: libsais_unbwt_decode (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x12016F: libsais_unbwt_decode_omp (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x120205: libsais_unbwt_core (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x12031E: libsais_unbwt_main (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x120601: libsais_unbwt_aux (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x120489: libsais_unbwt (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x10927E: main (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==
==2044781== Use of uninitialised value of size 8
==2044781==    at 0x11D607: libsais_unbwt_decode_1 (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x11F638: libsais_unbwt_decode (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x12016F: libsais_unbwt_decode_omp (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x120205: libsais_unbwt_core (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x12031E: libsais_unbwt_main (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x120601: libsais_unbwt_aux (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x120489: libsais_unbwt (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x10927E: main (in /home/palaiologos/Desktop/bzip3/unit)

After a closer investigation, this happens on the if statement below:

static void libsais_unbwt_decode_1(u8 * RESTRICT U, sa_uint_t * RESTRICT P, sa_uint_t * RESTRICT bucket2,
                                   u16 * RESTRICT fastbits, fast_uint_t shift, fast_uint_t * i0, fast_uint_t k) {
    u16 * RESTRICT U0 = (u16 *)(void *)U;

    fast_uint_t i, p0 = *i0;

    for (i = 0; i != k; ++i) {
        u16 c0 = fastbits[p0 >> shift];
        if (bucket2[c0] <= p0) {
            do {
                c0++;
            } while (bucket2[c0] <= p0);
        }
        p0 = P[p0];
        U0[i] = libsais_bswap16(c0);
    }

    *i0 = p0;
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.