Comments (13)
Thanks for using LitStudy!
Looks like build_corpus
expects a DocumentSet
and it seems that docs_springer
is not a DocumentSet
but something else.
Could you maybe provide the rest of the notebook, or do you have the line that creates docs_springer
?
from litstudy.
refine_scopus
returns two document sets: One for the document found on scopus and one for the documents not found on scopus.
You would need to do something like this:
docs_springer, docs_not_found = litstudy.refine_scopus(docs_springer)
print(len(docs_springer), "papers found on Scopus")
print(len(docs_not_found), "papers NOT found on Scopus")
from litstudy.
Hi,
Great, thanks stijnh.
Separately, I was wondering whether the full results from the word distribution can somehow be viewed, as the table output seems to provide only a snapshot?
Thanks, as always,
S
This is the complete table of all ngrams, that means all the words that contain a _
after processing (that is what the .filter(like="_")
does).
Remove .filter(...)
part to see a list of the complete word distribution.
@stijnh I'm also a bit confused about the ngram_threshold, even after reading the guidance documents. An ngram_threshold of 0.8 does what exactly? Classifies something as agreeing with/matching that ngram if 80% of its characters are the same as the reference ngram (included in the corpus)?
Sorry for the question, but I can't seem to clarifying on my own and it would be good to know how LitStudy is working here.
Thanks,
S
The parameter ngram_threshold
determines how sensitive the preprocessing is to detecting bigrams (also called ngrams). The higher the value, the more bigrams will be detected. A bigram is a pair of words that frequently appear next after each other (for example, think of words like "data processing", "social media", "human rights", "United states").
The actual processing is done by gensim
, here is the documentation and look at the threshold
parameter: https://radimrehurek.com/gensim/models/phrases.html#gensim.models.phrases.Phrases
from litstudy.
Hi stijnh thanks for the quick response.
Sure, here we go:
from litstudy.
I have defined DocumentSet as docs_springer in my case, and it seems to have resolved the error, as the output is no longer an AttributeError, but instead (As below):
Does this look correct to you?
from litstudy.
Great, thanks stijnh.
Separately, I was wondering whether the full results from the word distribution can somehow be viewed, as the table output seems to provide only a snapshot?
Thanks, as always,
S
from litstudy.
@stijnh I'm also a bit confused about the ngram_threshold, even after reading the guidance documents. An ngram_threshold of 0.8 does what exactly? Classifies something as agreeing with/matching that ngram if 80% of its characters are the same as the reference ngram (included in the corpus)?
Sorry for the question, but I can't seem to clarifying on my own and it would be good to know how LitStudy is working here.
Thanks,
S
from litstudy.
Great, thanks for your help. I have removed the .filter(like="_") and am obviously presented with a larger list. My question is how I can view/export/download this list in its entirety?
Thanks again,
Sam
from litstudy.
Hi @stijnh another quick question from me which might have a simple answer, hence why I am not opening it as a new issue:
In the word distribution plot which has been produced below, is the highest result saying that the word 'nature' only appears across 35% of the documents? I am asking because it was a key search term used in the original Scopus search, so in theory all of the documents (that is, 100%) should include the word 'nature'.
Thanks, as always, for your patience and advice,
Sam
from litstudy.
Great, thanks for your help. I have removed the .filter(like="_") and am obviously presented with a larger list. My question is how I can view/export/download this list in its entirety?
Thanks again,
Sam
The thing returned by compute_word_distribution
is a regular pandas dataframe. You can use the functions to export it to a file: https://pandas.pydata.org/docs/reference/io.html
For example, you can add ...sort_index().to_csv("word_distrbution.csv")
Hi @stijnh another quick question from me which might have a simple answer, hence why I am not opening it as a new issue:
In the word distribution plot which has been produced below, is the highest result saying that the word 'nature' only appears across 35% of the documents? I am asking because it was a key search term used in the original Scopus search, so in theory all of the documents (that is, 100%) should include the word 'nature'.
Thanks, as always, for your patience and advice,
Sam
Not sure about this one. Maybe sometimes nature
is followed by solutions
and it is interpreted as the bigram nature_solutions
. You can disable bigram detection by removing the ngram_threshold=
options from build_corpus
.
Good luck!
from litstudy.
Hi,
Great, thanks stijnh.
Separately, I was wondering whether the full results from the word distribution can somehow be viewed, as the table output seems to provide only a snapshot?
Thanks, as always,
SThis is the complete table of all ngrams, that means all the words that contain a
_
after processing (that is what the.filter(like="_")
does).Remove
.filter(...)
part to see a list of the complete word distribution.@stijnh I'm also a bit confused about the ngram_threshold, even after reading the guidance documents. An ngram_threshold of 0.8 does what exactly? Classifies something as agreeing with/matching that ngram if 80% of its characters are the same as the reference ngram (included in the corpus)?
Sorry for the question, but I can't seem to clarifying on my own and it would be good to know how LitStudy is working here.
Thanks,
SThe parameter
ngram_threshold
determines how sensitive the preprocessing is to detecting bigrams (also called ngrams). The higher the value, the more bigrams will be detected. A bigram is a pair of words that frequently appear next after each other (for example, think of words like "data processing", "social media", "human rights", "United states").The actual processing is done by
gensim
, here is the documentation and look at thethreshold
parameter: https://radimrehurek.com/gensim/models/phrases.html#gensim.models.phrases.Phrases
Thanks for sharing this @stijnh - one (final) question which isn't clear to me from the guidance, how can we change the parameters to search for trigrams? I have a feeling that the top scoring bigram below "nature_solutions" is actually "nature-based solutions" or "nature based solutions", and would like to capture this in the word distribution output.
from litstudy.
Great, thanks for your help. I have removed the .filter(like="_") and am obviously presented with a larger list. My question is how I can view/export/download this list in its entirety?
Thanks again,
SamThe thing returned by
compute_word_distribution
is a regular pandas dataframe. You can use the functions to export it to a file: https://pandas.pydata.org/docs/reference/io.htmlFor example, you can add
...sort_index().to_csv("word_distrbution.csv")
Hi @stijnh another quick question from me which might have a simple answer, hence why I am not opening it as a new issue:
In the word distribution plot which has been produced below, is the highest result saying that the word 'nature' only appears across 35% of the documents? I am asking because it was a key search term used in the original Scopus search, so in theory all of the documents (that is, 100%) should include the word 'nature'.
Thanks, as always, for your patience and advice,
SamNot sure about this one. Maybe sometimes
nature
is followed bysolutions
and it is interpreted as the bigramnature_solutions
. You can disable bigram detection by removing thengram_threshold=
options frombuild_corpus
.Good luck!
Thanks @stijnh , although I can't seem to get pandas to write the DataFrame to a .csv, here's what I'm doing:
There's no error returned, but nothing being written to the .csv either...
from litstudy.
There's no error returned, but nothing being written to the .csv either...
Replace
DataFrame = pd.DataFrame()
by
DataFrame = litstudy.compute_word_distribution(corpus).sort_index()
You were creating an empty DataFrame
and then calling to_excel
on that one.
from litstudy.
Related Issues (20)
- 'No Edges Given' for Network Analysis HOT 4
- ValueError: n_components must be < n_features; got 50 >= 47 HOT 2
- `build_corpus` always removes words having a frequency below 5 HOT 4
- Different results from unique() and difference of deduplicated set HOT 2
- module 'networkx' has no attribute 'to_scipy_sparse_matrix' HOT 2
- Incompability with gensim 4 HOT 1
- Unexpected results from litstudy.plot_author_histogram() HOT 2
- Listing document titles HOT 3
- Support for google scholar HOT 1
- refine_scopus - low it/s speed; necessary to refine every time? HOT 1
- TypeError: object of type 'method' has no len() HOT 1
- Saving language models
- Documentation on search_ function queries HOT 3
- Search_semanticscholar with list
- Scopus400Error: Error translating query - Refining results with "source title" query argument HOT 6
- train_lda_model() fails to access gensim HOT 3
- Scopus400Error: Exceeds the maximum number allowed for the service level. HOT 1
- Scopus exceeds csv field limit
- SemanticScholar search optimization HOT 2
- DocumentIdentifier.matches() is case-sensitive
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from litstudy.