Giter VIP home page Giter VIP logo

text-summarizer-2's Introduction

Description

Text Summary tool - a project which was part of Artificial Intelligence course at BITS Pilani

Algorithms

This approach treats sentences as ordered sequences and words as ordered sequences within sentences. It follows four fundamental steps:

o Weighting of words o Weighting of sentences o Choosing all sentences above a certain weight threshold o Ordering the selected sentences as they appear in the original article

The approach of weighting is based on frequencies. Every word/term is assigned a weight using tf-idf (term frequency – inverted document frequency) approach. The weight of a term = term frequency * inverse of document frequency

Term frequency is the number of a times a word occurs within a document. Inverted document frequency is 1 / number of documents the words appears in.

Additionally the score incorporates parameters like location of the word, syntactic structure of the sentence in which it appears, presence of the word in title etc. Each sentence is assigned a weight equal to the sum of weights of the words. Once all sentences are weighted, they are sorted in descending order of their weights. A certain threshold is set on the weight of a sentence that can be in a summary and then the sentences are filtered. The filtered sentences are put in the original order as they appear in the document. This approach is a statistical method that purely relies on term level content of the story. This method involves preprocessing on terms like removing stop words, normalizing terms, replacing synonyms etc.

This approach performs weighting of sentences based on the incoming user query. The weight of sentence is calculated using a mixture of 2 parameters.

First, the weight of sentence obtained by tf-idf weighting.

Second, the weight of query calculated using number of sentences the query occurs in.

The combination of these 2 retrieves important phrases in the story relevant to the user query. This finds applications in creating a story search engine where a user can query for a story subject like “student wizard magic potions” and the search engine would present books like Harry Potter with a summary of the books extracted using this query. This will present the user with relevant stories and the part of the stories they are interested in.

3. Information Extraction

This method works in 2 phases. Selection of useful information and generation of a summary using the information. This improves upon the naïve automatic extraction technique by adding summary composition to create a more readable and coherent summary.

The steps in this algorithm are:

  • Preprocess document to remove punctuations, bracket symbols, expand short form of some words.
  • Annotate each word with its part of speech information example if it is a noun/verb/adjective and so on.
  • Extract subject-verb-object triplets from each sentence.
  • Filter out triplets where verb is a past participle, infinitive, part of a conditional clause. These leads to false rejections as well which need to be controlled by correctly identifying triplets in multi verb sentences.
  • Generate noun phrases to represent subject, object, and indirect object of the sentences.
  • Generate a verb complement if no subject is present. Prepositional phrase generation for complementing noun phrases.
  • Generate verb phrase to link all components together.
  • Rank generated sentences using Document frequency of their terms.
  • Merge the sentence to form a summary. This is done by combining the generated sentences, then greedily pulling out subjects to see if the sentence can be reduced. This step uses a natural language generation engine.

This method captures the meaning of sentences by grammatical analysis which is why it is better than the automatic extraction approach. It can help in development of intelligent agents that gain better semantic knowledge of text.

text-summarizer-2's People

Contributors

agarwaltanmay avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.