Giter VIP home page Giter VIP logo

text_rank's Introduction

TextRank

Status

Gem Version Build Status Code Climate Test Coverage MIT License

Description

TextRank is an unsupervised keyword extraction algorithm based on PageRank. Other strategies for keyword extraction generally rely on either statistics (like inverse document frequency and term frequency) which ignore context, or they rely on machine learning, requiring a corpus of training data which likely will not be suitable for all applications. TextRank is found to produce superior results in many situations with minimal computational cost.

Features

  • Multiple PageRank implementations to choose one best suited for the performance needs of your application
  • Framework for adding additional PageRank implementations (e.g. a native implemenation)
  • Extensible architecture to customize how text is filtered
  • Extensible architecture to customize how text is tokenized
  • Extensible architecture to customize how tokens are filtered
  • Extensible architecture to customize how keywords ranks are filtered/processed

Installation

gem install text_rank

Requirements

  • Ruby 3.0.0 or higher
  • engtagger gem is optional but required for TextRank::TokenFilter::PartOfSpeech
  • nokogiri gem is optional but required for TextRank::CharFilter::StripHtml

Usage

TextRank

require 'text_rank'

text = <<-END
  In a castle of Westphalia, belonging to the Baron of Thunder-ten-Tronckh, lived
  a youth, whom nature had endowed with the most gentle manners. His countenance
  was a true picture of his soul. He combined a true judgment with simplicity of
  spirit, which was the reason, I apprehend, of his being called Candide. The old
  servants of the family suspected him to have been the son of the Baron's
  sister, by a good, honest gentleman of the neighborhood, whom that young lady
  would never marry because he had been able to prove only seventy-one
  quarterings, the rest of his genealogical tree having been lost through the
  injuries of time.
END

# Default, basic keyword extraction.  Try this first:
keywords = TextRank.extract_keywords(text)

# Keyword extraction with all of the bells and whistles:
keywords = TextRank.extract_keywords_advanced(text)

# Fully customized extraction:
extractor = TextRank::KeywordExtractor.new(
  strategy:   :sparse,  # Specify PageRank strategy (dense or sparse)
  damping:    0.85,     # The probability of following the graph vs. randomly choosing a new node
  tolerance:  0.0001,   # The desired accuracy of the results
  char_filters: [...],  # A list of filters to be applied prior to tokenization
  tokenizers: [...],    # A list of tokenizers to perform tokenization
  token_filters: [...], # A list of filters to be applied to each token after tokenization
  graph_strategy: ...,  # A class or strategy instance for producing a graph from tokens
  rank_filters: [...],  # A list of filters to be applied to the keyword ranks after keyword extraction
)

# Add another filter to the end of the char_filter chain
extractor.add_char_filter(:AsciiFolding)

# Add a part of speech filter to the token_filter chain BEFORE the Stopwords filter
pos_filter = TextRank::TokenFilter::PartOfSpeech.new(parts_to_keep: %w[nn])
extractor.add_token_filter(pos_filter, before: :Stopwords)

# Perform the extraction with at most 100 iterations
extractor.extract(text, max_iterations: 100)

PageRank

It is also possible to use this gem for PageRank only.

require 'page_rank'

PageRank.calculate(strategy: :sparse, damping: 0.8, tolerance: 0.00001) do
  add('node_a', 'node_b', weight: 3.2)
  add('node_b', 'node_d', weight: 2.1)
  add('node_b', 'node_e', weight: 4.7)
  add('node_e', 'node_a', weight: 1.3)
end

There are currently two pure Ruby implementations of PageRank:

  1. sparse: A sparsely-stored strategy which performs multiplication proportional to the number of edges in the graph. For graphs with a very low node-to-edge ratio, this will perform better in a pure Ruby setting. It is recommended to use this strategy until such a time as there are native implementations.
  2. dense: A densely-stored matrix strategy which performs up to max_iterations matrix multiplications or until the tolerance is reached. This is more of a canonical implementation and is fine for small or dense graphs, but it is not advised for large, sparse graphs as Ruby is not fast when it comes to matrix multiplication. Each iteration is O(N^3) where N is the number of graph nodes.

License

MIT. See the LICENSE file.

References

R. Mihalcea and P. Tarau, “TextRank: Bringing Order into Texts,” in Proceedings of EMNLP 2004. Association for Computational Linguistics, 2004, pp. 404–411.

Brin, S.; Page, L. (1998). "The anatomy of a large-scale hypertextual Web search engine". Computer Networks and ISDN Systems 30: 107–117.

text_rank's People

Contributors

david-mccullars avatar

Stargazers

Roland Tanglao avatar Ed Colen avatar Bruno B. avatar Delon R. Newman avatar Tianlu avatar  avatar jfr avatar Abhinav Mathur avatar Molly Gelsey avatar Eugen Rochko avatar Josh Weir avatar Clemens avatar Marcel Bensch avatar Dmytro Piliugin avatar Kane avatar Nick Chapman avatar Gabriel Almeida Escodino da Silva avatar Kevin avatar Pete Matsyburka avatar Jeremy Fabre avatar

Watchers

Bheeshmar Redheendran avatar  avatar James Cloos avatar Bibek Thapa avatar

text_rank's Issues

could support chinese?

For example, when i use it TextRank.extract_keywords("**的海岸线很长"), I just get {}。 So what should I do

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.