AttributeError: 'DocumentSet' object has no attribute 'title' is displayed, even after

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Corpus function - AttributeError: 'DocumentSet' object has no attribute 'title' about litstudy HOT 13 CLOSED

nlesc commented on August 10, 2024

Corpus function - AttributeError: 'DocumentSet' object has no attribute 'title'

from litstudy.

Comments (13)

stijnh commented on August 10, 2024 1

Thanks for using LitStudy!

Looks like build_corpus expects a DocumentSet and it seems that docs_springer is not a DocumentSet but something else.

Could you maybe provide the rest of the notebook, or do you have the line that creates docs_springer?

from litstudy.

stijnh commented on August 10, 2024 1

refine_scopus returns two document sets: One for the document found on scopus and one for the documents not found on scopus.

You would need to do something like this:

docs_springer, docs_not_found = litstudy.refine_scopus(docs_springer)
print(len(docs_springer), "papers found on Scopus")
print(len(docs_not_found), "papers NOT found on Scopus")

from litstudy.

stijnh commented on August 10, 2024 1

Hi,

Great, thanks stijnh.

Separately, I was wondering whether the full results from the word distribution can somehow be viewed, as the table output seems to provide only a snapshot?

Thanks, as always,

S

This is the complete table of all ngrams, that means all the words that contain a _ after processing (that is what the .filter(like="_") does).

Remove .filter(...) part to see a list of the complete word distribution.

@stijnh I'm also a bit confused about the ngram_threshold, even after reading the guidance documents. An ngram_threshold of 0.8 does what exactly? Classifies something as agreeing with/matching that ngram if 80% of its characters are the same as the reference ngram (included in the corpus)?

Sorry for the question, but I can't seem to clarifying on my own and it would be good to know how LitStudy is working here.

Thanks,

S

The parameter ngram_threshold determines how sensitive the preprocessing is to detecting bigrams (also called ngrams). The higher the value, the more bigrams will be detected. A bigram is a pair of words that frequently appear next after each other (for example, think of words like "data processing", "social media", "human rights", "United states").

The actual processing is done by gensim, here is the documentation and look at the threshold parameter: https://radimrehurek.com/gensim/models/phrases.html#gensim.models.phrases.Phrases

from litstudy.

SS159 commented on August 10, 2024

Hi stijnh thanks for the quick response.

Sure, here we go:

from litstudy.

SS159 commented on August 10, 2024

I have defined DocumentSet as docs_springer in my case, and it seems to have resolved the error, as the output is no longer an AttributeError, but instead (As below):

Does this look correct to you?

from litstudy.

SS159 commented on August 10, 2024

Great, thanks stijnh.

Separately, I was wondering whether the full results from the word distribution can somehow be viewed, as the table output seems to provide only a snapshot?

Thanks, as always,

from litstudy.

SS159 commented on August 10, 2024

@stijnh I'm also a bit confused about the ngram_threshold, even after reading the guidance documents. An ngram_threshold of 0.8 does what exactly? Classifies something as agreeing with/matching that ngram if 80% of its characters are the same as the reference ngram (included in the corpus)?

Sorry for the question, but I can't seem to clarifying on my own and it would be good to know how LitStudy is working here.

Thanks,

from litstudy.

SS159 commented on August 10, 2024

Great, thanks for your help. I have removed the .filter(like="_") and am obviously presented with a larger list. My question is how I can view/export/download this list in its entirety?

Thanks again,

Sam

from litstudy.

SS159 commented on August 10, 2024

Hi @stijnh another quick question from me which might have a simple answer, hence why I am not opening it as a new issue:

In the word distribution plot which has been produced below, is the highest result saying that the word 'nature' only appears across 35% of the documents? I am asking because it was a key search term used in the original Scopus search, so in theory all of the documents (that is, 100%) should include the word 'nature'.

Thanks, as always, for your patience and advice,

Sam

from litstudy.

stijnh commented on August 10, 2024

Great, thanks for your help. I have removed the .filter(like="_") and am obviously presented with a larger list. My question is how I can view/export/download this list in its entirety?

Thanks again,

Sam

The thing returned by compute_word_distribution is a regular pandas dataframe. You can use the functions to export it to a file: https://pandas.pydata.org/docs/reference/io.html

For example, you can add ...sort_index().to_csv("word_distrbution.csv")

Hi @stijnh another quick question from me which might have a simple answer, hence why I am not opening it as a new issue:

In the word distribution plot which has been produced below, is the highest result saying that the word 'nature' only appears across 35% of the documents? I am asking because it was a key search term used in the original Scopus search, so in theory all of the documents (that is, 100%) should include the word 'nature'.

Thanks, as always, for your patience and advice,

Sam

Not sure about this one. Maybe sometimes nature is followed by solutions and it is interpreted as the bigram nature_solutions. You can disable bigram detection by removing the ngram_threshold= options from build_corpus.

Good luck!

from litstudy.

SS159 commented on August 10, 2024

Hi,

Great, thanks stijnh.
Separately, I was wondering whether the full results from the word distribution can somehow be viewed, as the table output seems to provide only a snapshot?

Thanks, as always,
S

This is the complete table of all ngrams, that means all the words that contain a _ after processing (that is what the .filter(like="_") does).

Remove .filter(...) part to see a list of the complete word distribution.

@stijnh I'm also a bit confused about the ngram_threshold, even after reading the guidance documents. An ngram_threshold of 0.8 does what exactly? Classifies something as agreeing with/matching that ngram if 80% of its characters are the same as the reference ngram (included in the corpus)?
Sorry for the question, but I can't seem to clarifying on my own and it would be good to know how LitStudy is working here.
Thanks,
S

The parameter ngram_threshold determines how sensitive the preprocessing is to detecting bigrams (also called ngrams). The higher the value, the more bigrams will be detected. A bigram is a pair of words that frequently appear next after each other (for example, think of words like "data processing", "social media", "human rights", "United states").

The actual processing is done by gensim, here is the documentation and look at the threshold parameter: https://radimrehurek.com/gensim/models/phrases.html#gensim.models.phrases.Phrases

Thanks for sharing this @stijnh - one (final) question which isn't clear to me from the guidance, how can we change the parameters to search for trigrams? I have a feeling that the top scoring bigram below "nature_solutions" is actually "nature-based solutions" or "nature based solutions", and would like to capture this in the word distribution output.

from litstudy.

SS159 commented on August 10, 2024

Great, thanks for your help. I have removed the .filter(like="_") and am obviously presented with a larger list. My question is how I can view/export/download this list in its entirety?

Thanks again,
Sam

The thing returned by compute_word_distribution is a regular pandas dataframe. You can use the functions to export it to a file: https://pandas.pydata.org/docs/reference/io.html

For example, you can add ...sort_index().to_csv("word_distrbution.csv")

Hi @stijnh another quick question from me which might have a simple answer, hence why I am not opening it as a new issue:
In the word distribution plot which has been produced below, is the highest result saying that the word 'nature' only appears across 35% of the documents? I am asking because it was a key search term used in the original Scopus search, so in theory all of the documents (that is, 100%) should include the word 'nature'.

Thanks, as always, for your patience and advice,
Sam

Not sure about this one. Maybe sometimes nature is followed by solutions and it is interpreted as the bigram nature_solutions. You can disable bigram detection by removing the ngram_threshold= options from build_corpus.

Good luck!

Thanks @stijnh , although I can't seem to get pandas to write the DataFrame to a .csv, here's what I'm doing:

There's no error returned, but nothing being written to the .csv either...

from litstudy.

stijnh commented on August 10, 2024

There's no error returned, but nothing being written to the .csv either...

Replace

DataFrame = pd.DataFrame()

DataFrame = litstudy.compute_word_distribution(corpus).sort_index()

You were creating an empty DataFrame and then calling to_excel on that one.

from litstudy.

Corpus function - AttributeError: 'DocumentSet' object has no attribute 'title' about litstudy HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent