bnosac / textrank Goto Github PK

View Code? Open in Web Editor NEW

75.0 7.0 9.0 57 KB

Summarise text by finding relevant sentences and keywords using the Textrank algorithm

R 100.00%

textrank textrank-algorithm r nlp natural-language-processing

textrank's Introduction

Summarize Text by Ranking Sentences and Extracting Keywords

This repository contains an R package which handles summarizing text by using textrank.

For ranking sentences, this algorithm basically consists of.

Finding links between sentences by looking for overlapping terminology
Using Google Pagerank on the sentence network to rank sentences in order of importance

For finding keywords, this algorithm basically consists of.

Extract words following one another to construct a word network
Using Google Pagerank on the word network to rank words in order of importance
Constructing keywords - which are the combination of relevant words identified by the Pagerank algorithm which follow each other

Installation & License

The package is available under the Mozilla Public License Version 2.0. Installation can be done as follows. Please visit the package documentation and package vignette for further details.

install.packages("textrank")
vignette("textrank", package = "textrank")

For installing the development version of this package: devtools::install_github("bnosac/textrank", build_vignettes = TRUE)

Support in text mining

Need support in text mining? Contact BNOSAC: http://www.bnosac.be

textrank's People

Contributors

Stargazers

Watchers

Forkers

iamjoshbinder julianflowers jbdatascience centemians vishalbelsare emillykkejensen satyajitsen seonghobae amberjia0511

textrank's Issues

[FYI] Training and evaluating summarization systems

Hi @jwijffels . I found interesting dataset and decent paper which compares different techniques:

website - https://summari.es/
Article from website above

textrank is very competitive to modern alternatives.
May be the dataset and work above it can be useful for future development.

textrank_sentences to filter out dataframes of one sentence

textrank_sentences() should contain stopif(dim(sentences)[1] == 1) to cover the case in which textrank_sentences if fed a piece of text that tokenises to only one sentence.

unique sentence_id not possible

Refer to your textrank vignette for textrank_sentence function.
Refer specifically to the command

tr <- textrank_sentences(data = sentences, terminology = terminology)

Now this returns an error
Error: sum(duplicated(data[, 1])) == 0 is not TRUE

The reason for this appears to be non-unique sentence_ids.

There cannot be unique sentence_ids because it is reset after every paragraph ending in any CONLLU format file.

Your vignette is either incomplete or there is a bug in textrank_sentence

Ranking by doc or paragraph?

Hello:

the package works great!

I'm trying to rank tweets and I would like to rank them as documents, not as sentences, so that all the content of a tweet is used in the ranking, instead of having the separate sentences ranked individually.

Is there a way to do that? It feels like I should be able to use "doc_id" to do that, but I can't find a way to do it.

Might I suggest

Thanks for a great package

When running textrank_sentences() on very large datasets, the textrank_candidates_all() (in particularly the utils::combn() function within) can’t really cope and throws an error. Therefor I have built a simpler textrank_candidates_all() which I believe can do the same job – but faster and more memory efficient.

textrank_candidates_all2 <- function(x){
  
  x <- unique(x)
  x <- setdiff(x, NA)
  
  x_length <- length(x)
  
  dtlist <- lapply(seq(x)[-x_length], function(i){
    data.table::data.table(textrank_id_1 = x[i], textrank_id_2 = x[(i+1L):x_length])
  })
  
  candidates <- data.table::rbindlist(dtlist)
  candidates <- data.table::setDF(candidates)
  
  return(candidates)
  
}

To compare with the old, try running:

id_list <- 1:100000

textrank_candidates <- textrank_candidates_all(id_list)
textrank_candidates2 <- textrank_candidates_all2(id_list)

Here I get an error using textrank_candidates_all() but not using textrank_candidates_all2()

If you lower the number of id's and run it again, you will get big performance difference between the two functions:

id_list <- 1:3000

system.time(textrank_candidates <- textrank_candidates_all(id_list))
system.time(textrank_candidates2 <- textrank_candidates_all2(id_list))

which gives me:

For textrank_candidates_all()

   user  system elapsed 
  9.305   0.474  15.236

…and textrank_candidates_all2()

   user  system elapsed 
  0.922   0.004   1.397

Finally the two functions seems to output the same values:

identical(textrank_candidates, textrank_candidates2) is TRUE

So, you could consider implementing this function if you please.

textrank_sentences

Dear team,

Im using textrank_sentences for ranking tweets (n =5,000). But I get the following error.

xx <- rtllFILTRADO3[!duplicated(rtllFILTRADO3$text), ]

sentences <- unique(xx&text[, c("sentence_id", "sentence")])
cat(sentences$sentence)

terminology <- subset(xx, upos %in% c("NOUN", "ADJ"))

terminology <- terminology[, c("sentence_id", "lemma")]
head(terminology)

Textrank for finding the most relevant sentences

tr <- textrank_sentences(data = sentences, terminology = terminology)
names(tr)

Error in textrank_sentences(data = sentences, terminology = terminology) :
sum(duplicated(data[, 1])) == 0 is not TRUE

What Im doing wrong?

Best, VHM

textrank_candidates_lsh() function returns error "Error in vecseq(f, len, if (allow.cartesian || notjoin || !anyDuplicated(f__, :"

I'm working with a collection of about 1200 short texts. Some are a single sentence while others are a paragraph. They are mostly descriptions of scholarships and who the recipients should be. Most of the texts contain language that preference will be given to certain students. The data below are fabricated for privacy reasons but are realistic. One of our main goals is to look for analyses beyond keyword search, so at this point it's just exploratory analysis.

I've gotten the function to work on other data that I have access to and am impressed with how it's working. I don't fully understand the technical reasons why it might not be working here, but I do wonder if it's because I'm doing something it wasn't designed to do. In the joboffer example, that was a single document of reasonable length. None of my texts are that long, so summarizing them individually makes little sense. Perhaps concatenating them in the way that I did isn't an appropriate method for applying the function to my texts.

Everything runs smoothly until I get to the this chunk of code:

candidates <- textrank_candidates_lsh(x = terminology$lemma, sentence_id = terminology$textrank_id,
                                      minhashFUN = minhash,
                                      bands = 500)

The error message that I'm getting is:

Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, :
Join results in 6728000 rows; more than 116000 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.

I'm a little confused by how to overcome this because the merge that seems to be causing the problem is being called internally by the function. The suggestions to use the by=.EACHI or allow.cartesian=TRUE are not accepted by the function, which isn't too surprising. But I also haven't been able to pre-merge the data outside of the textrank_candidates_lsh() function using these suggestions. I did try asking chatGPT about it, but its suggestion was to remove duplicates and the solution was rather messy so I wasn't able to integrate it.

Sample fabricated data (for privacy reasons):
Although short, these data do faithfully reproduce the error.

funds_abbr <- data.table(NBR_TEXT = c(1234, 2345, 3456, 4567, 5678), TEXT = c("The purpose of this scholarship is to help undergraduate or graduate students in the Computer Science department at the State University. Preference shall be given to students demonstrating financial need. The scholarship is renewable for 4 years, provided students remain in good academic standing.", "The purpose of this scholarship is to help undergraduate students at the State University earn their degree without taking on excessive student loan debt. Preference for this scholarship will be given to women.",  "To provide scholarships to underrepresented students studying agriculture.", "The purpose of this scholarship is to help students studying Business Analytics at the Business School of the State University. Preference will be given to students who come from minority backgrounds that are traditionally underrepresented at the State University or in the corporate world.", "The purpose of this fund is to help first-generation undergraduate students studying American Studies at the State University. Preference will be given to students who come from minority backgrounds and who are traditionally underrepresented at the State University."))

The full code for reproducibility is below:

library(textrank)
library(tm)
library(tokenizers)
library(igraph)
library(proxy)
library(udpipe)
library(textreuse)

setwd("path/to/my/directory")

# The udmodel is stored in an Rdata file and accessed with the 'object = udmodel$language' argument in the udpipe() function
# I had to get special permission to access the internet to download the file, which I can't do each time I work on this type of analysis
load("english_udmodel.Rdata")

# My data are stored in an Rdata file as a data.table as well so I can easily knit the Rmarkdown file (but through a database when done interactively)
load("unique_text_purpose.Rdata")

# Retaining the NBR_TEXT and TEXT columns is an artefact of previous work to meet the requirements of the DocumentTermMatrix() function but all I'm trying to do is retain unique IDs and the associated text
funds_abbr <- data[, c("NBR_TEXT", "TEXT")]

# When I ran this on my machine, I used a 500-text subset (out of ~1200 texts) which took ~2 hours to run
# The sample data consist of only 5 texts, but when I ran this on my machine, I used funds_abbr[1:500]$TEXT
x <- paste(funds_abbr$TEXT, collapse = "\n")
x <- udpipe(x, object = udmodel$language)

# Process data
x$textrank_id <- unique_identifier(x, c("doc_id", "paragraph_id", "sentence_id"))
sentences <- unique(x[, c("textrank_id", "sentence")])
terminology <- subset(x, upos %in% c("NOUN", "ADJ"))
terminology <- terminology[, c("textrank_id", "lemma")]

tr<- textrank_sentences(data = sentences, terminology = terminology)

lsh_probability(h = 500, b = 500, s = 0.1) # A 10 percent Jaccard overlap will be detected well
minhash <- minhash_generator(n = 500, seed = 999)
terminology <- subset(x, upos %in% c("NOUN", "ADJ"), select = c("textrank_id", "lemma"))

candidates <- textrank_candidates_lsh(x = terminology$lemma, sentence_id = terminology$textrank_id,
                                      minhashFUN = minhash,
                                      bands = 500)

[Clarification] textrank_sentences()

Hi,

I need some clarification on stopifnot(sum(duplicated(data[, 1])) == 0) when using textrank_sentences()
I dedup my sentences and my terminology as well using sentences <- unique(verbatim_tokens[, c("sentence_id", "sentence")]) terminology <- subset(verbatim_tokens, upos %in% c("NOUN", "ADJ")) terminology <- terminology[, c("sentence_id", "lemma")] terminology <- unique(terminology[, c("sentence_id", "lemma")])

But I still get this error Error in textrank_sentences(data = sentences, terminology = terminology) : sum(duplicated(data[, 1])) == 0 is not TRUE
I did debugonce(textrank_sentences) and data[,1] seems to be my sentence_id which is bound to be duplicated?

A question for the textrank_candidates_lsh

Thanks for this great package.

Have a question about the minihash function in textrank_candidates_lsh.

I want to rank 56K+ sentences. Time cost seems unbearable if using textrank_sentences diecelty.
so followed instruction in the viggette and tried to reduce the number of sentences.
But seems the minihash generate duplicated bucket hashes, which cause the failure of the merge function.

>   sentence_to_bucket[,.N, by = .(bucket)][N>1]
                                  bucket  N
     1: e85dc607460b46bd724b11b3cfb1acae  2
     2: dcb3fff4eaf711ec67e1edd1172a01c3  2
     3: 72b7e79c1172d5ece5e9c89d84ffd1b2  2
     4: ba44247c05f7c3c97ba4579e66b5dca9 11
     5: 3c1335b0bf9855848759145d74ca772f  2
    ---                                    
409592: 5bb1d93f2559d471878d4456170801f9  2
409593: 4d8991c632b5158dade7734d5aac6577  2
409594: c948c8f202acc7928fa17b5df61078c1  2
409595: b3b0344fc419746d5b66b90ad2b68a23  2
409596: 9a48fa57176af1e82ea0cfbc98cee834  2

 candidates <- merge(sentence_to_bucket, sentence_to_bucket, 
                      by = "bucket", suffixes = c(".left", ".right"), all.x = TRUE, 
                      all.y = FALSE)

error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__,  : 
  Join results in 23882240 rows; more than 8198000 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE.

Wondering if there is something I can try?

Thank you in advance for any feedback.