Giter VIP home page Giter VIP logo

awyshw / phrase-at-scale Goto Github PK

View Code? Open in Web Editor NEW

This project forked from kavgan/phrase-at-scale

0.0 2.0 0.0 82.55 MB

Detect common phrases in large amounts of text using a data-driven approach. Size of discovered phrases can be arbitrary. Can be used in languages other than English

Home Page: http://kavita-ganesan.com/how-to-incorporate-phrases-into-word2vec-a-text-mining-approach/

Python 100.00%

phrase-at-scale's Introduction

Phrase-At-Scale

Phrase-At-Scale provides a fast and easy way to discover phrases from large text corpora using PySpark. Here's an example of phrases extracted from a review dataset:

Features

  • Discover most common phrases in your text
  • Size of discovered phrases can be arbitrary (typically: bigrams and trigrams)
  • Adjust configuration to control quality of phrases
  • Can be used in languages other than English
  • Can be run locally using multiple threads, or in parallel on multiple machines
  • Annotate your corpora with the phrases discovered

Quick Start

Run locally

To re-run phrase discovery using the default dataset:

  1. Install Spark

  2. Clone this repo and move into its top-level directory.

    git clone [email protected]:kavgan/phrase-at-scale.git
    
  3. Run the spark job:

    <your_path_to_spark>/bin/spark-submit --master local[200] --driver-memory 4G phrase_generator.py 
    

This will use settings (including input data files) as specified in config.py.

  1. You should be able to monitor the progress of your job at http://localhost:4040/

Notes:

  • The above command runs the job on master and uses the specified number of threads within local[num_of_threads].
  • This job outputs 2 files:
    1. the list of phrases under top-opinrank-phrases.txt
    2. the annotated corpora under data/tagged-data/

Configuration

To change configuration, just edit the config.py file.

Config Description
input_file Path to your input data files. This can be a file or folder with files. The default assumption is one text document (of any size) per line. This can be one sentence per line, one paragraph per line, etc.
output-folder Path to output your annotated corpora. Can be local path or on HDFS
phrase-file Path to file that should hold the list of discovered phrases.
stop-file Stop-words file to use to indicate phrase boundary.
min-phrase-count Minimum number of occurrence for phrases. Guidelines: use 50 for < 300 MB of text, 100 for < 2GB and larger values for a much larger dataset.

Dataset

The default configuration uses a subset of the OpinRank dataset, consisting of about 255,000 hotel reviews. You can use the following to cite the dataset:

@article{ganesan2012opinion,
  title={Opinion-based entity ranking},
  author={Ganesan, Kavita and Zhai, ChengXiang},
  journal={Information retrieval},
  volume={15},
  number={2},
  pages={116--150},
  year={2012},
  publisher={Springer} 
}

Contact

This repository is maintained by Kavita Ganesan. Please send me an e-mail or open a GitHub issue if you have questions.

phrase-at-scale's People

Contributors

kavgan avatar

Watchers

James Cloos avatar Awy avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.