The string-algorithms from krzysztof-turowski

Fix a bug in Four Russians LCS algorithm

The following test:

approximate_string_matching.lcs.four_russians(
    '#bbbaaab', '#aaaabbb', 7, 7, approximate_string_matching.distance.INDEL_DISTANCE)

returns 3 instead of 4.

There has to be a bug somewhere in four_russians.py

Implement alternative indices for standard text problems

Example publications:

LZ-index from Navarro - Indexing text using the Ziv-Lempel trie,
FM-index from Ferragina, Manzini - Opportunistic data structures with applications,
succint replacements for LCP and suffix array from Sadakane - Succint representations of lcp information and improvements in the compressed suffix arrays,
suffix trays and suffix trists from Cole, Kopelowitz, Lewenstein - Suffix Trays and Suffix Trists: Structures for Faster Text Indexing.

Implement shortest common superstring algorithms

a 5/2-approximation algorithm from Sweedyk - A 2 1/2-approximation algorithm for shortest superstring
a 5/2-approximation algorithm from Kaplan, Shafrir - The greedy algorithm for shortest superstrings
a 8/3-approximation algorithm from Armen, Stein - A 2 2/3-Approximation Algorithm for the Shortest Superstring Problem
a 8/3-approximation algorithm from Breslauer, Jiang, Jiang - Rotations of Periodic Strings and Short Superstrings
a 109/42-approximation algorithm from Breslauer, Jiang, Jiang - Rotations of Periodic Strings and Short Superstrings
a 57/23-approximation algorithm from Mucha - Lyndon Words and Short Superstrings
a 71/30-approximation algorithm from Paluch - Better Approximation Algorithms for Maximum Asymmetric Traveling Salesman and Shortest Superstring

Implement algorithms for distance and longest common subsequence problem

Example publications for LCS:

Apostolico, Guerra - The longest common subsequence problem revisited
Apostolico, Browne, Guerra - Fast linear-space computations of longest common subsequences
Chin, Pooh - A fast algorithm for computing longest common subsequences of small alphabet size
Eppstein, Galil, Giancarlo, Italiano - Sparse dynamic programming I: Linear cost functions

Implement longest common prefix algorithms

Fischer - Inducing the LCP-Array
Gog, Ohlebusch - Compressed Suffix Trees: Efficient Computation and Storage of LCP-Values (conference version: Fast and Lightweight LCP-Array Construction Algorithms)
Manzini - Two Space Saving Tricks for Linear Time LCP Array Computation
Puglisi, Turpin - Space-Time Tradeoffs for Longest-Common-Prefix Array Computation
Sadakane - Succinct Representations of lcp Information and Improvements in the Compressed Suffix Arrays

Implementation and comparison of BWT encoding/decoding algorithms

Overall, the aim is to implement several BTW and IBWT transformation algorithms, and to compare them in terms of efficiency (number of accesses of characters) and running time.

Selected relevant literature:
[1] Adjeroh, Bell, Mukherjee - The Burrows-Wheeler Transform:: Data Compression, Suffix Arrays, and Pattern Matching
[2] Burrows, Wheeler - A Block-sorting Lossless Data Compression Algorithm - especially algorithms C and D, and Move-To-Front coding and decoding,
[3] Kärkkäinen - Fast BWT in small space by blockwise suffix sorting
[4] Yokoo - Notes on Block-Sorting Data Compression
[5] Lauther, Lukovszki - Space Efficient Algorithms for the Burrows-Wheeler Backtransformation
[6] Kärkkäinen, Puglisi - Medium-Space Algorithms for Inverse BWT - algorithms LR-B and VLR-B

Implement approximate string matching algorithms

Approximate string matching with respect to the edit distance:

Galil, Giancarlo - Improved String Matching with k Mismatches - requires also computation LCP(i, j) in $O(1)$ time from LCP i SA arrays e.g. using Fischer, Heun - Theoretical and Practical Improvements on the RMQ-Problem, with Applications to LCA and LCE
Schoenmeyr, Yu Zhang - FFT-based algorithms for the string matching with mismatches problem
Liu, Chen, James Borneman, Jiang - A Fast Algorithm for Approximate String Matching on Gene Sequences
Salmela, Tarhio, Kalsi - Approximate Boyer-Moore String Matching for Small Alphabets (both Hamming and edit distance)

Approximate string matching with respect to the edit distance:

Matching with wildcards:

Fischer, Paterson - String-matching and other products
Muthukrishnan, Palem - Non-standard Stringology: Algorithms and Complexity
Indyk - Faster algorithms for string matching problems: matching the convolution bound
Kalai - Efficient Pattern-Matching with Don't Cares

Implement suffix tree and suffix array construction algorithms

Example publications for suffix tree:

Breslauer, Italiano - Near real-time suffix tree construction via the fringe marked ancestor problem
Giegerich, Kurtz, Stoye - Efficient implementation of lazy suffix trees
Andersson, Nilsson - Efficient Implementation of Suffix Trees

Example publications for suffix array:

Sadakane - A Fast Algorithm for Making Suffix Arrays and for Burrows-Wheeler Transformation,
Nong - Practical Linear-Time O(1)-Workspace Suffix Sorting for Constant Alphabets,
Kim, Sim, Park, Park - Linear-time construction of suffix arrays,
Rajasekaran, Nicolae - An elegant algorithm for the construction of suffix arrays,
Abouelhoda, Kurtz, Ohlebusch - Replacing suffix trees with enhanced suffix arrays,
Manzini, Ferragina - Engineering a Lightweight Suffix Array Construction Algorithm,
Itoh, Tanaka - An efficient method for in memory construction of suffix arrays.

Implement exact string matching algorithms

Comparison of LZ77/78 decomposition and compression algorithms

Overall, the aim is to implement several LZ77/78 decomposition algorithms and their reverse methods, and to compare them in terms of efficiency (number of accesses of characters) and running time.

Selected relevant literature:
[1] Ziv, Lempel - A universal algorithm for sequential data compression
[2] Ziv, Lempel - Compression of individual sequences via variable-rate coding
[3] Na, Apostolico, Iliopoulos, Park - Truncated suffix trees and their application to data compression (truncated suffix tree structure and its usage in LZ77)
[4] Chen, Puglisi, Smyth - Lempel–Ziv Factorization Using Less Time & Space
[5] Ohno et al. - A faster implementation of online RLBWT and its application to LZ77 parsing
[6] Rodeh, Pratt, Even - Linear Algorithm for Data Compression via String Matching
[7] Kärkkäinen, Kempa, Puglisi - Lazy Lempel-Ziv Factorization Algorithms

krzysztof-turowski / string-algorithms Goto Github PK

string-algorithms's People

Contributors

Stargazers

Watchers

Forkers

string-algorithms's Issues

Fix a bug in Four Russians LCS algorithm

Implement alternative indices for standard text problems

Implement shortest common superstring algorithms

Implement algorithms for distance and longest common subsequence problem

Implement longest common prefix algorithms

Implementation and comparison of BWT encoding/decoding algorithms

Implement approximate string matching algorithms

Implement suffix tree and suffix array construction algorithms

Implement exact string matching algorithms

Comparison of LZ77/78 decomposition and compression algorithms

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent