Giter VIP home page Giter VIP logo

dgst's Introduction

DGST

DGST (Distributed Generalized Suffix Tree) is an efficient and scalable generalized suffix tree construction algorithm on distributed data-parallel platforms. DGST supports indexing all suffixes for a variety of strings, ranging from a single long string to multiple strings of varied lengths. In addition, DGST is built on the widely-used distributed data-parallel platform Apache Spark.

News

DGST won the first place in the 3rd National University Cloud Computing Application Innovation Competition Big Data Skills Challenge held in 2016-2017, China.

Prerequisites

  • Apache Spark: As DGST is built on top of Spark, you need to get the Spark installed first. If you are not clear how to setup Spark, please refer to the guidelines here. Currently, DGST is developed on the APIs of Spark 1.6.3 version.
  • Apache HDFS: DGST uses HDFS as the distributed file system. If you are not clear how to setup HDFS, please refer to the guidelines here. The HDFS version is 2.6.5.
  • Java: The JDK version is 1.7.

Compile

DGST is built using Apache Maven. To build DGST, Run mvn scala:compile compile package in the root directory.

The compiled jar is available at target/DGST-1.0-SNAPSHOT.jar

Note: A precompiled jar is provided in the lib directory.

Run

The entry of DGST is defined in the scala class GST.Main.

Run DGST with the following command:

spark-submit --master [SPARK_MASTER_ADDRESS] \
--executor-memory 2G \
--driver-memory 20G \
--total-executor-cores [Total cores in the Spark cluster] \ 
--conf spark.memory.fraction=0.9 \
--conf spark.memory.storageFraction=0.1 \
--driver-class-path [Configuration directory] \
--class GST.Main \
DGST-1.0-SNAPSHOT.jar \
<Input path> <Output path> <Temporary path> \
<Range size> <Maximum sub-tree size> <Parallelism>

The run command contains the following parameters:

  • <Input path>: The input data path on HDFS or local file system.
  • <Output path>: The output data path on HDFS or local file system.
  • <Temporary path>: The tempoaray data path on HDFS or local file system.
  • <Range size>: The size of range in the LCP-Range structure
  • <Maximum sub-tree size>: The Maximum sub-tree size (i.e., the maximum S-prefix frequency)
  • <Parallelism>: The computation parallelism on Spark

Note: --driver-class-path is the directory where the configuration file is. The parameters configured in the run command have higher priority than those in the configuration file. To find more information about the configuration file, please refer to the Configuration Guide.

Note: We also port the state-of-the-art ERa algorithm to Spark. Please replace the main class with GST.EraMain to run the ERa algorithm. The performance comparison on Spark is here.

Demo

The dataset in demo directory contais two strings. Each file represents a string.

The generalized suffix tree for the two demo strings can be found here.

Licence

Please contact the authors for the licence info of the source code.

dgst's People

Contributors

zhutongxue avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

dgst's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.