Giter VIP home page Giter VIP logo

Comments (13)

stijnh avatar stijnh commented on August 10, 2024 1

Thanks for using LitStudy!

Looks like build_corpus expects a DocumentSet and it seems that docs_springer is not a DocumentSet but something else.

Could you maybe provide the rest of the notebook, or do you have the line that creates docs_springer?

from litstudy.

stijnh avatar stijnh commented on August 10, 2024 1

refine_scopus returns two document sets: One for the document found on scopus and one for the documents not found on scopus.

You would need to do something like this:

docs_springer, docs_not_found = litstudy.refine_scopus(docs_springer)
print(len(docs_springer), "papers found on Scopus")
print(len(docs_not_found), "papers NOT found on Scopus")

from litstudy.

stijnh avatar stijnh commented on August 10, 2024 1

Hi,

Great, thanks stijnh.

Separately, I was wondering whether the full results from the word distribution can somehow be viewed, as the table output seems to provide only a snapshot?

image

Thanks, as always,

S

This is the complete table of all ngrams, that means all the words that contain a _ after processing (that is what the .filter(like="_") does).

Remove .filter(...) part to see a list of the complete word distribution.

@stijnh I'm also a bit confused about the ngram_threshold, even after reading the guidance documents. An ngram_threshold of 0.8 does what exactly? Classifies something as agreeing with/matching that ngram if 80% of its characters are the same as the reference ngram (included in the corpus)?

Sorry for the question, but I can't seem to clarifying on my own and it would be good to know how LitStudy is working here.

Thanks,

S

The parameter ngram_threshold determines how sensitive the preprocessing is to detecting bigrams (also called ngrams). The higher the value, the more bigrams will be detected. A bigram is a pair of words that frequently appear next after each other (for example, think of words like "data processing", "social media", "human rights", "United states").

The actual processing is done by gensim, here is the documentation and look at the threshold parameter: https://radimrehurek.com/gensim/models/phrases.html#gensim.models.phrases.Phrases

from litstudy.

SS159 avatar SS159 commented on August 10, 2024

Hi stijnh thanks for the quick response.

Sure, here we go:

image

from litstudy.

SS159 avatar SS159 commented on August 10, 2024

I have defined DocumentSet as docs_springer in my case, and it seems to have resolved the error, as the output is no longer an AttributeError, but instead (As below):

image

Does this look correct to you?

from litstudy.

SS159 avatar SS159 commented on August 10, 2024

Great, thanks stijnh.

Separately, I was wondering whether the full results from the word distribution can somehow be viewed, as the table output seems to provide only a snapshot?

image

Thanks, as always,

S

from litstudy.

SS159 avatar SS159 commented on August 10, 2024

@stijnh I'm also a bit confused about the ngram_threshold, even after reading the guidance documents. An ngram_threshold of 0.8 does what exactly? Classifies something as agreeing with/matching that ngram if 80% of its characters are the same as the reference ngram (included in the corpus)?

Sorry for the question, but I can't seem to clarifying on my own and it would be good to know how LitStudy is working here.

Thanks,

S

from litstudy.

SS159 avatar SS159 commented on August 10, 2024

Great, thanks for your help. I have removed the .filter(like="_") and am obviously presented with a larger list. My question is how I can view/export/download this list in its entirety?

image

Thanks again,

Sam

from litstudy.

SS159 avatar SS159 commented on August 10, 2024

Hi @stijnh another quick question from me which might have a simple answer, hence why I am not opening it as a new issue:

In the word distribution plot which has been produced below, is the highest result saying that the word 'nature' only appears across 35% of the documents? I am asking because it was a key search term used in the original Scopus search, so in theory all of the documents (that is, 100%) should include the word 'nature'.

image

Thanks, as always, for your patience and advice,

Sam

from litstudy.

stijnh avatar stijnh commented on August 10, 2024

Great, thanks for your help. I have removed the .filter(like="_") and am obviously presented with a larger list. My question is how I can view/export/download this list in its entirety?

image

Thanks again,

Sam

The thing returned by compute_word_distribution is a regular pandas dataframe. You can use the functions to export it to a file: https://pandas.pydata.org/docs/reference/io.html

For example, you can add ...sort_index().to_csv("word_distrbution.csv")

Hi @stijnh another quick question from me which might have a simple answer, hence why I am not opening it as a new issue:

In the word distribution plot which has been produced below, is the highest result saying that the word 'nature' only appears across 35% of the documents? I am asking because it was a key search term used in the original Scopus search, so in theory all of the documents (that is, 100%) should include the word 'nature'.

image

Thanks, as always, for your patience and advice,

Sam

Not sure about this one. Maybe sometimes nature is followed by solutions and it is interpreted as the bigram nature_solutions. You can disable bigram detection by removing the ngram_threshold= options from build_corpus.

Good luck!

from litstudy.

SS159 avatar SS159 commented on August 10, 2024

Hi,

Great, thanks stijnh.
Separately, I was wondering whether the full results from the word distribution can somehow be viewed, as the table output seems to provide only a snapshot?
image
Thanks, as always,
S

This is the complete table of all ngrams, that means all the words that contain a _ after processing (that is what the .filter(like="_") does).

Remove .filter(...) part to see a list of the complete word distribution.

@stijnh I'm also a bit confused about the ngram_threshold, even after reading the guidance documents. An ngram_threshold of 0.8 does what exactly? Classifies something as agreeing with/matching that ngram if 80% of its characters are the same as the reference ngram (included in the corpus)?
Sorry for the question, but I can't seem to clarifying on my own and it would be good to know how LitStudy is working here.
Thanks,
S

The parameter ngram_threshold determines how sensitive the preprocessing is to detecting bigrams (also called ngrams). The higher the value, the more bigrams will be detected. A bigram is a pair of words that frequently appear next after each other (for example, think of words like "data processing", "social media", "human rights", "United states").

The actual processing is done by gensim, here is the documentation and look at the threshold parameter: https://radimrehurek.com/gensim/models/phrases.html#gensim.models.phrases.Phrases

Thanks for sharing this @stijnh - one (final) question which isn't clear to me from the guidance, how can we change the parameters to search for trigrams? I have a feeling that the top scoring bigram below "nature_solutions" is actually "nature-based solutions" or "nature based solutions", and would like to capture this in the word distribution output.

image

from litstudy.

SS159 avatar SS159 commented on August 10, 2024

Great, thanks for your help. I have removed the .filter(like="_") and am obviously presented with a larger list. My question is how I can view/export/download this list in its entirety?
image
Thanks again,
Sam

The thing returned by compute_word_distribution is a regular pandas dataframe. You can use the functions to export it to a file: https://pandas.pydata.org/docs/reference/io.html

For example, you can add ...sort_index().to_csv("word_distrbution.csv")

Hi @stijnh another quick question from me which might have a simple answer, hence why I am not opening it as a new issue:
In the word distribution plot which has been produced below, is the highest result saying that the word 'nature' only appears across 35% of the documents? I am asking because it was a key search term used in the original Scopus search, so in theory all of the documents (that is, 100%) should include the word 'nature'.
image
Thanks, as always, for your patience and advice,
Sam

Not sure about this one. Maybe sometimes nature is followed by solutions and it is interpreted as the bigram nature_solutions. You can disable bigram detection by removing the ngram_threshold= options from build_corpus.

Good luck!

Thanks @stijnh , although I can't seem to get pandas to write the DataFrame to a .csv, here's what I'm doing:

image
image
image

There's no error returned, but nothing being written to the .csv either...

from litstudy.

stijnh avatar stijnh commented on August 10, 2024

There's no error returned, but nothing being written to the .csv either...

Replace

DataFrame = pd.DataFrame()

by

DataFrame = litstudy.compute_word_distribution(corpus).sort_index()

You were creating an empty DataFrame and then calling to_excel on that one.

from litstudy.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.