I'm working with a collection of about 1200 short texts. Some are a single sentence while others are a paragraph. They are mostly descriptions of scholarships and who the recipients should be. Most of the texts contain language that preference will be given to certain students. The data below are fabricated for privacy reasons but are realistic. One of our main goals is to look for analyses beyond keyword search, so at this point it's just exploratory analysis.
I've gotten the function to work on other data that I have access to and am impressed with how it's working. I don't fully understand the technical reasons why it might not be working here, but I do wonder if it's because I'm doing something it wasn't designed to do. In the joboffer example, that was a single document of reasonable length. None of my texts are that long, so summarizing them individually makes little sense. Perhaps concatenating them in the way that I did isn't an appropriate method for applying the function to my texts.
Everything runs smoothly until I get to the this chunk of code:
candidates <- textrank_candidates_lsh(x = terminology$lemma, sentence_id = terminology$textrank_id,
minhashFUN = minhash,
bands = 500)
The error message that I'm getting is:
Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, :
Join results in 6728000 rows; more than 116000 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.
I'm a little confused by how to overcome this because the merge that seems to be causing the problem is being called internally by the function. The suggestions to use the by=.EACHI
or allow.cartesian=TRUE
are not accepted by the function, which isn't too surprising. But I also haven't been able to pre-merge the data outside of the textrank_candidates_lsh()
function using these suggestions. I did try asking chatGPT about it, but its suggestion was to remove duplicates and the solution was rather messy so I wasn't able to integrate it.
Sample fabricated data (for privacy reasons):
Although short, these data do faithfully reproduce the error.
funds_abbr <- data.table(NBR_TEXT = c(1234, 2345, 3456, 4567, 5678), TEXT = c("The purpose of this scholarship is to help undergraduate or graduate students in the Computer Science department at the State University. Preference shall be given to students demonstrating financial need. The scholarship is renewable for 4 years, provided students remain in good academic standing.", "The purpose of this scholarship is to help undergraduate students at the State University earn their degree without taking on excessive student loan debt. Preference for this scholarship will be given to women.", "To provide scholarships to underrepresented students studying agriculture.", "The purpose of this scholarship is to help students studying Business Analytics at the Business School of the State University. Preference will be given to students who come from minority backgrounds that are traditionally underrepresented at the State University or in the corporate world.", "The purpose of this fund is to help first-generation undergraduate students studying American Studies at the State University. Preference will be given to students who come from minority backgrounds and who are traditionally underrepresented at the State University."))
The full code for reproducibility is below:
library(textrank)
library(tm)
library(tokenizers)
library(igraph)
library(proxy)
library(udpipe)
library(textreuse)
setwd("path/to/my/directory")
# The udmodel is stored in an Rdata file and accessed with the 'object = udmodel$language' argument in the udpipe() function
# I had to get special permission to access the internet to download the file, which I can't do each time I work on this type of analysis
load("english_udmodel.Rdata")
# My data are stored in an Rdata file as a data.table as well so I can easily knit the Rmarkdown file (but through a database when done interactively)
load("unique_text_purpose.Rdata")
# Retaining the NBR_TEXT and TEXT columns is an artefact of previous work to meet the requirements of the DocumentTermMatrix() function but all I'm trying to do is retain unique IDs and the associated text
funds_abbr <- data[, c("NBR_TEXT", "TEXT")]
# When I ran this on my machine, I used a 500-text subset (out of ~1200 texts) which took ~2 hours to run
# The sample data consist of only 5 texts, but when I ran this on my machine, I used funds_abbr[1:500]$TEXT
x <- paste(funds_abbr$TEXT, collapse = "\n")
x <- udpipe(x, object = udmodel$language)
# Process data
x$textrank_id <- unique_identifier(x, c("doc_id", "paragraph_id", "sentence_id"))
sentences <- unique(x[, c("textrank_id", "sentence")])
terminology <- subset(x, upos %in% c("NOUN", "ADJ"))
terminology <- terminology[, c("textrank_id", "lemma")]
tr<- textrank_sentences(data = sentences, terminology = terminology)
lsh_probability(h = 500, b = 500, s = 0.1) # A 10 percent Jaccard overlap will be detected well
minhash <- minhash_generator(n = 500, seed = 999)
terminology <- subset(x, upos %in% c("NOUN", "ADJ"), select = c("textrank_id", "lemma"))
candidates <- textrank_candidates_lsh(x = terminology$lemma, sentence_id = terminology$textrank_id,
minhashFUN = minhash,
bands = 500)