jkeirstead / scholar Goto Github PK

View Code? Open in Web Editor NEW

312.0 312.0 82.0 3.75 MB

Analyse citation data from Google Scholar

License: Other

R 98.81% Makefile 1.19%

scholar's People

Contributors

Stargazers

Watchers

scholar's Issues

compare_scholar_careers capped at 8 years

Looks like compare_scholar_careers() is capped at 8 years. E.g. for Feynman and Hawking the output is:

             id year cites career_year               name
1  B7vSqZsAAAAJ 2009  3161           0    Richard Feynman
2  B7vSqZsAAAAJ 2010  3375           1    Richard Feynman
3  B7vSqZsAAAAJ 2011  3471           2    Richard Feynman
4  B7vSqZsAAAAJ 2012  4060           3    Richard Feynman
5  B7vSqZsAAAAJ 2013  4146           4    Richard Feynman
6  B7vSqZsAAAAJ 2014  4039           5    Richard Feynman
7  B7vSqZsAAAAJ 2015  3843           6    Richard Feynman
8  B7vSqZsAAAAJ 2016  4069           7    Richard Feynman
9  B7vSqZsAAAAJ 2017  2409           8    Richard Feynman
10 qj74uXkAAAAJ 2009  4942           0 Stephen W. Hawking
11 qj74uXkAAAAJ 2010  4860           1 Stephen W. Hawking
12 qj74uXkAAAAJ 2011  5477           2 Stephen W. Hawking
13 qj74uXkAAAAJ 2012  5730           3 Stephen W. Hawking
14 qj74uXkAAAAJ 2013  5853           4 Stephen W. Hawking
15 qj74uXkAAAAJ 2014  6207           5 Stephen W. Hawking
16 qj74uXkAAAAJ 2015  5944           6 Stephen W. Hawking
17 qj74uXkAAAAJ 2016  6196           7 Stephen W. Hawking
18 qj74uXkAAAAJ 2017  4084           8 Stephen W. Hawking

The function description mentions the bar chart in scholar profiles as a source; however, clicking on that chart does reveal citations for all years of a career, which doesn't seem to be used here. Is this expected behaviour? Limits the utility of this function quite drastically.

get_publications return empty pubid

This is what I would like to do:

run get_publications(author)
for each of the publications: get_article_cite_history(author,pubid)

However, get_publications(author) returns void(0) values in the pubid column, which means I cannot run get_article_cite_history.

Any idea what is not working? The same script was working a couple of months ago, so would it be possible that it might be cause by the Google Scholar recent update?

Clarify which authors are returned by coauthor_network and friends

@cimentadaj as mentioned in discussion of #54, I think it would be best if you made it clear at the appropriate place in the docs and vignette that coauthor_network and friends only return co-authors listed on the google scholar profile, not from all retrieved publications.

get_coauthors: font family not found & arguments imply differing number of rows: 0, 1

@cimentadaj I just tested the new PR (#54 ), it's super cool, but I've encountered those three issues:

A coauthors appears as "truncated"?

coauthors <- scholar::get_coauthors("bg0BZ-QAAAAJ", n_coauthors=5, n_deep=1)
scholar::plot_coauthors(coauthors)

As you can see, the coauthors Pascale Piolino seems to be lacking her coauthors. This does not apppear to change when modifying n_coauthors and n_deep.

The same command returns the following error:

Warning messages:
1: In grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y,  :
  font family not found in Windows font database

This might be caused by a non-standart font used for the plot?

Chanching n_deep to 2 throws the following error:

Error in data.frame(author = author_name, author_url = url, coauthors = coauthors,  : 
  arguments imply differing number of rows: 0, 1

Thanks ;)

H-index predicts negative values

As reported by email, predict_h_index can give negative numbers. This is because the underlying method is a regression analysis calibrated on neuroscientists so those in other fields can get weird answers (see the documentation).

While this isn't a bug per se, it might be preferable to have the method restrict future h-index values to be greater than or equal to the current h-index. Any objections?

Add option to limit number of publications

In the examples, compare_scholars and compare_scholar_careers (src) takes ages to run because it's pulling down hundreds of publications for Stephen Hawking (damn those prolific scientists). I've changed this to Isaac Newton for the moment and made the slowest compare_scholars example DONTRUN but that's a rather temporary fix.

Fields not filled correctly from get_profile()

The get_profile() function fills returns total citations in the h_index slot, and h_index in the i10_index slot, etc. Output and a sessionInfo dump below.

get_profile('0ryVFl8AAAAJ')
$id
[1] "0ryVFl8AAAAJ"

$name
[1] "Chris Miller"

$affiliation
[1] "The Genome Institute at Washington University"

$total_cites
[1] NA

$h_index
[1] 1257

$i10_index
[1] 11

$fields
[1] "cancer genomics" "computational biology" "systems biology"

$homepage
[1] "http://www.chrisamiller.com/"

Warning message:
In get_profile("0ryVFl8AAAAJ") : NAs introduced by coercion

sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] scholar_0.1.0

loaded via a namespace (and not attached):
[1] R.cache_0.9.0 R.methodsS3_1.5.2 R.oo_1.15.8 R.utils_1.27.1
[5] XML_3.2-0 digest_0.4.2 plyr_1.8 stringr_0.4

error for get_article_city_history

I extracted the publication list from one author and tried to get the article citation history though with an error message. The article tried here is from 2012 and has 86 citations. Not that much to crash a function.

article1 <- get_article_cite_history('fYQY8Y8AAAAJ', '7591188124196201684')
Error in min(years):max(years) : result would be too long a vector In addition: Warning messages: 1: In min(years) : no non-missing arguments to min; returning Inf 2: In max(years) : no non-missing arguments to max; returning -Inf

get_article_cite_history

I get an 'empty' object with 3 variable (year, cites, pubid) and zero observations with the command
get_article_cite_history(id, article)
I obtained the article ids by using the get_publications functions and I have tried different article ids without success.
There is no problem with: get_citation_history(id)

I am using Version: 0.1.6

Is this a bug or am I doing something wrong?

Cannot get `pubid` after Google Scholar redesign

Google Scholar changed its design recently and pubid return by scholar::get_publications() are now "void(0)".

Incomplete author list for publication

I observed that Google Scholar do not list all authors if a publication has number of authors more than 5 or 6 ( "..." appear at the end of authors).

For this, we need to parse publication page instead of profile page for any such specific publication. I wrote a simple code to achieve this. Can you incorporate in package so other people can get benefit?

getCompleteAuthors = function(id, pubid)
{
auths = ""
url_template = "http://scholar.google.com/citations?view_op=view_citation&citation_for_view=%s:%s"
url = sprintf(url_template, id, pubid)

print("parsing html")

tree = htmlTreeParse(url, useInternalNodes = T)

print("finding authors")

auths = xpathApply(tree, '//*/div[@Class="gsc_value"]',xmlValue)[[1]]

return(auths)

}

Get URL or DOI

Would it be possible that scholar::get_publications() also find DOI / URL ? That would very useful!

Thanks for that package!

get_article_cite_history -- one citation in future year

I have found by trial and error that get_article_cite_history fails when there are 0 citations to the article, giving the error message below. I know you're aware of this, so please keep reading after the error message.

Error in min(years):max(years) : result would be too long a vector
In addition: Warning messages:
1: In min(years) : no non-missing arguments to min; returning Inf
2: In max(years) : no non-missing arguments to max; returning -Inf

I have run in to another case where I get the same error when there is one citation to the article having a publication year in the future. Consider the following example

get_article_cite_history("xXHEaAUAAAAJ","Wp0gIr-vW9MC")

There is one citing article, with publication year 2016. Looking at the source code for this function, there is an attempt to read the citation bar chart provided by GS; however, GS reports no bar chart for the citations to this article, thus the error and perhaps no easy fix, although a check and a graceful return from the function would be welcome.

2015 is almost over, so this particular example will fix itself soon :)

Here is another example with two citations, one in 2016 and one with no year. The same error occurs

get_article_cite_history("0pYNftwAAAAJ","M3ejUd6NZC8C")

Support for private Scholar profiles

Some users may choose to make their profiles private, in which case the current code is unable to get this data. Investigate whether it's possible to support such profiles.

better error message for bad article ID?

This StackOverflow question clearly reflects user error/confusion, but the error message that you get when putting in a bogus article ID,

Error in min(years):max(years) : result would be too long a vector
In addition: Warning messages:
1: In min(years) : no non-missing arguments to min; returning Inf
2: In max(years) : no non-missing arguments to max; returning -Inf

is not terribly transparent to new users, and could probably be clearer ...

Number of publications

Very nice package!

But I found some problems. I am doing
id <- "xLd8lNoAAAAJ&hl"
augusto <- get_profile(id)
get_publications(id)
get_num_articles(id)

The result is showing only 20 of my papers (not all of them). Am I doing something wrong?

Thanks for your help.

Prepare for scholar 0.1.6

Switch maintainer in DESCRIPTION
Remove missing maintainer note in README
merge develop to master and bump version in DESCRIPTION
tag and prepare github release
release to CRAN

Additionally

travis?
add rstudio project?

@GuangchuangYu are you OK to review develop and take this the next step?

Reason for difference in cite counts from get_citation_history and get_publications

Is there a reason why the count of cites using the two methods should be different? Thanks.
id="B7vSqZsAAAAJ"
citationsbyyear=get_citation_history(id)
sum(citationsbyyear$cites)
pubs=get_publications(id)
sum(pubs$cites)

get_publications only works once

Hello, I am trying to pull publications for multiple authors. However, get_publications() only works the with the first ID it is used on in a session- if I try to use it with a subsequent ID, the following is returned:

[1] title author journal number cites year cid pubid
<0 rows> (or 0-length row.names)

Any thoughts? I've carefully checked the IDs, and they are valid and all work- as long as they are the first one tried in a particular session. To do another one, I have to end the session and reopen R, which is very inefficient!

SCImago Journal

Just discovered this nice package for extracting the SJR index of journals' prestige. It could potentially be a nice addition to the impact_factor related functions.

Specifically, as it includes the SJR index for different years, it would provide a unique opportunity to compute this index for each author's publication at their time of publication. Could be interesting for developing new authors' impact metrics.

get_publications() gives error

I get the following error when trying to run get_publications():

Error in assign(name, value, envir = attr(static, ".env")) : use of NULL environment is defunct

Error: is.handle(handle) is not TRUE in get_profile() ... and other functions

Like others, I was having problems with the cookies / 2 requests issues just recently fixed. Was happy to see the new version posted fixing the error.

With the new version, however, everything I try gives me the following error:
Error: is.handle(handle) is not TRUE

This includes all the simple examples on the Readme doc, such as the following code:

# Define the id for Richard Feynman
id <- 'B7vSqZsAAAAJ'

# Get his profile and print his name
l <- get_profile(id)
l$name 

# Get his citation history, i.e. citations to his work in a given year 
get_citation_history(id)

# Get his publications (a large data frame)
get_publications(id)

Zero counts in a year halts get_citation_history(

Greetings,

The following reveals get_citation_history is barfing on a year with zero counts... latest versions of R, RStudio, and all packages installed...

library(scholar)
library(ggplot2)
cit <- get_citation_history('juybEFMAAAAJ&hl=en')
Error in data.frame(year = years, cites = vals) :
arguments imply differing number of rows: 5, 4

Thanks!

tidy_id : Error in curl::curl_fetch_memory(url, handle = handle) : Problem with the SSL CA cert (path? access rights?)

Seems the tidy_id function is not working anymore because of the "https" call in the "sample_url" variable.

Replacing "https" by "http" in the "sample_url" variable solve the issue, however it needs a package recompilation.

Error retrieving article citation history

Hi.

I love this package!

But I get errors getting some citation history. I get this very inconsistent error. That is, the error is always the same, but it occurs at differing times (sometimes it will loop through all publications I want without error, other times it will fail at varying papers, i.e. it does not always fail at the same paper!):

Error in if (zero_range(from) || zero_range(to)) { :
missing value where TRUE/FALSE needed
In addition: Warning messages:
1: In max(as.numeric(cit$year)) :
no non-missing arguments to max; returning -Inf
2: In max(cit$cites) : no non-missing arguments to max; returning -Inf

trying to get cite data from all of this users publications (not counting the ones that have NA's in them).

I'll try to give you reproducible example (but as I said, it does not always fail!)

USR="yiSLTAcAAAAJ"

### Publications and citation history ####
Pubs = get_publications(USR)  #get publication information
Pubs$cites = as.numeric(as.character(Pubs$cites))
Pubs = Pubs[complete.cases(Pubs),]

Years = seq(from=min(Pubs$year,na.rm=T),to=max(Pubs$year,na.rm=T),by=1)
CiteYears = as.data.frame(matrix(0, ncol=length(Years), nrow=nrow(Pubs)))
names(CiteYears) = as.character(Years)

###Get each article's cite history
pb = txtProgressBar(min = 1, max = nrow(CiteYears), initial = 1, style=3) ##Set progressbar
for(i in 1:nrow(CiteYears)){
  setTxtProgressBar(pb,i)
  CitePub = get_article_cite_history(USR,Pubs$pubid[i])
  CitePub = cast(CitePub, value="cites", ~year)[,-1]
  CiteYears[i, grepl(paste(names(CitePub),collapse="|"),names(CiteYears))] = CitePub
}

Pubs = cbind(Pubs,CiteYears)

Odd behaviour using compare_scholars()

sorry - should've read the documentation before posting this issue!

Rob

Add google article id to table in `get_publications(id)`

It would be great if this table also included the ID used by google for each article.

This information is available next to citation count in the form of a link, e.g. for this link http://scholar.google.com/scholar?oi=bibs&hl=en&cites=6728932339706166581
the article id is 6728932339706166581.

As far as I can tell, this should be relatively straight forward, just need to extract an extra element at this point in code.

There are three advantages of having the article id:

Allows you to easily link to list of citations for any article using a link of format: http://scholar.google.com/scholar?cites=6728932339706166581
Allows you to link out to article on google using a link of format http://scholar.google.com/scholar?cluster=6728932339706166581
Provides a single unique identifier for the reference that you can record and search against.

I'll have a go at this sometime soon, but don't let that discourage anyone who already knows how to do it.

Update code to work with new page layouts

Google Scholar has changed the layout of the page so the xpaths no longer extract the correct elements.

Push the latest release to CRAN

@GuangchuangYu @jefferis The CRAN version of scholar is still v0.1.4, built on 2015-11-21. Some recent issues (#47, #48) are strange and should be coming for the older build. Please push the latest release to CRAN so people can update by update.packages() or install.packages("scholar").

Integrating function from `coauthornetwork` package.

Hi! Thanks a lot for scholar, it's a very interesting package. Inadvertently, I created the coauthornetwork package which does a very simple thing: extracts your coauthor network and visualizes a network of coauthorship.

Of course the package would be much better if it used the already existing functions from the scholar package. Would you be open to integrating the function from coauthornetwork in scholar? @mkiang suggested the idead here and it made sense to me as this is purely a two-function package which will be probably fit much better using the already existing structure of scholar.

I'd adapt the code to match the style/dependencies of the package, of course.

You can check out the package here.

get_citation_history(id) gives error for some id's

Hi. For some reason when I try to get my citation history I get the following error:
Error in data.frame(year = years, cites = vals) :
arguments imply differing number of rows: 9, 8
This doesn't happen for many other IDs that I've tried. My id is 'xpECwJQAAAAJ'... do you know why it doesn't work? Too few citations? ;-)

compare_scholar_careers bug?

If I execute this code:

ids <- c('dYWNWicAAAAJ', 'R2ZrHtsAAAAJ', 'yg0LY3QAAAAJ', 'OixfQOcAAAAJ', 'k6b-lLYAAAAJ', 'g-U6tyoAAAAJ', 'ixGcu5gAAAAJ', 'docGNYEAAAAJ', 'kU2jvOMAAAAJ', '_XJMeP0AAAAJ', 'aqsaHZwAAAAJ', 'wEU99lsAAAAJ')
cmp2 <- compare_scholar_careers(ids)

I get:

Error in data.frame(year = years, cites = vals) :
arguments imply differing number of rows: 7, 6
Calls: compare_scholar_careers ... lapply -> FUN -> cbind -> get_citation_history -> data.frame
Execution halted

PS. these same instructions worked before but I guess some changes in the google data triggers this error.

Error occuring after several cunsecutive commands.

I am trying to retrieve current h-indeces for about 200 scholars at once. I have the ID list, and have run a loop that runs predict_h_index() on every ID. It worked fine for the first 70 or so and then ran into Error in tables[[1]] : subscript out of bounds, which also occured when I was running a similar loop for getting said IDs. It seems the first time Google blocked me out for scraping (as indicated by the fact that when I went on Scholar manually it asked me to verify I wasn't a robot) and getting on a different network solved it naturally.

However several minutes later, I try rerunning the command and am now receiving:
Error in if (any(diff(h.vals) < 0)) warning(paste0("Decreasing h-values predicted. ", : missing value where TRUE/FALSE needed In addition: Warning message: In min(papers$year, na.rm = TRUE) : no non-missing arguments to min; returning Inf
I can't tell if I'm being blocked again, because like I said the commend worked fine 70-ish times in a row, nothing changed, but now it doesn't work on any of the IDs. If this is literally Google blocking me out every time for scraping, even though I did put in Sys.sleep(), what would you recommend? If not, how might I solve this problem?

Edit: Hello. After further inspections I found that only certain IDs prompted this new error, even though the data manually inspected on Scholar looks fine. The rest work as intended. I made a workaround with tryCatch but I'd love a solution for these particular IDs. Examples of ones that prompt errors: 5JserkUAAAAJ and EdV8gVgAAAAJ.

Add function to flush the cache

Caching is used to avoid hammering Google's servers when making multiple requests for a scholar's data. However it would be useful to have an option to flush the cache, at least for development.

how to find out the author ID given the name

Hello,

In the given example, you have started with an ID for the author. Is there a way to find out the author ID given his/her name?

Thanks

Single citations

Is there a way to download not an aggregate of citations for a paper of a person but a detailed list of who is citing that paper and, ideally, when the citation came in? This would open many doors for great things in terms of building networks and analyzing the "rate of adoption" (how long it takes for the first citation to drop, etc.).

ERROR: dependency ‘XML’ is not available for package ‘scholar’

I tried install.packages("scholar", dependencies=T) but eventually I get the error above. I'm using R version 3.0.1 (2013-05-16) under Linux Ubuntu 13.10 (64 bit).

Get citation history for a single article

In addition to giving citation histories for an author, you can also view the results for a single article. It would be nice to able to retrieve these values as well.

Getting ID of author by function

Is there a way to get ID of the scholar by entering his name? Imagine I have a large character vector of scholar names and I want to get their IDs without searching for them manually one by one. Thanks.

Note limitation on get_citation_history function

Scholar only provides citation history values for the past 9 years. This should be clarified in the documentation.

get_coauthors: normalize authors names

@cimentadaj A minor suggestion: I've noticed some people have their name written in full uppercase (Serge NICOLAS) while some others don't (Dominique Makowski).

Wouldn't it be more neat-looking (especially in the network graph) to homogenise the author names?

It can be easily done by title-casing the names vector:

stringr::str_to_title("Serge NICOLAS")
stringr::str_to_title("Dominique Makowski")

What do you think?

get_oldest_article() returns Inf when article does not have year

get_oldest_article() returns Inf when some articles do not have year.

like in this profile: get_oldest_article("QW5aIMgAAAAJ")

I would like to get multiple authors' information using purrr:map(ids, get_oldest_article) and the function stops due to Inf result.
Is it possible to return NA or "smallest year available" ?

Best wishes

get_publications() returns only 20 publications.

It seems get_publications() is not retrieving the first 100 papers as mentioned in the literature but only the first 20.

library(scholar)
id = "xJaxiEEAAAAJ" # Isaac Newton
get_num_articles(id)
[1] 20
packageVersion("scholar")
[1] ‘0.1.2’

It would be great if get_publications() could retrieve all papers, or a specific number of papers (with some safe defaults).

get_publications: no publication IDs returned

Hello, Thank you for your package.
I found the publication IDs not returned from get_publications. Without publication IDs, it is impossible to proceed commands like get_article_cite_history.

chaoyisheng

Error: Service Unavailable (HTTP 503)

I ran:

sen <- get_publications(id="sLNFo0sAAAAJ", cstart = 0, pagesize = 10, flush = FALSE)

I am getting the following error message:

Error in read_xml.response(x, encoding, ..., as_html = TRUE) :
Service Unavailable (HTTP 503).

Am I getting this because I am being banned by Google Scholar? If so, how can I slow-down the downloading?

Thanks
Adel

aggregate or group h-index would be great

What would be really useful is an addition that allows the calculation of overall lifetime h-index for a group of scholars- such as a lab, a department, and so forth....allow the user to input all ids within the group, and spit out the total number of citations and h-index for the group as a whole. G-index would be a welcome addition to this proposed module or existing ones.

No content is retrieved, potential error at the readHTMLTable stage.

As of today (3AM CET, as the earliest measured occurrence):

get_profile() returns the an error with table:
Error in tables[[1]] : subscript out of bounds
get_citation_history() returns an empty df
[1] year cites
<0 rows> (or 0-length row.names)

I suspect something has changed in the google API?

Just tried figuring out (but I'm not that skilled) however a pull of the XML content like so using RCurl

getURL('https://scholar.google.com/citations?hl=en&user=qZLGnroAAAAJ')

returns a bunch of source code but with an error message at the end:

We're sorry but it appears that there has been an internal server error while processing your request. Our engineers have been notified and are working to resolve the issue.<p>Please try again later.</p>"`

I'll let you decide if this is worth closing this case. I imagine it is, but there may be more to it that someone more expert can check out.

ideas for g index, i10, 150, i100

Just a simple idea, but

pubs<- get_publications(id, cstart = 0, pagesize = 400, flush = FALSE)

pubs$cumsum <- cumsum(pubs$cites)

pubs$citerank <- get_num_articles(id) - rank(pubs$cites, ties.method = "last") +1

pubs$htest <- (pubs$cites - pubs$citerank) >=0
pubs$hvalue <- sum(pubs$htest)

pubs$gtest <- (pubs$cumsum -pubs$citerank^2) >=0
pubs$gvalue <- sum(pubs$gtest)

pubs$i10 <- pubs$cites >10
pubs$i10value <- sum(pubs$i10)

pubs$i50 <- pubs$cites >50
pubs$i50value <- sum(pubs$i50)

pubs$i100 <- pubs$cites >100
pubs$i100value <- sum(pubs$i100)

You could probably do similar things to get the more exotic indices that Harzing's PoP produces
https://harzing.com/pophelp/metrics.htm

Unicode support

There appear to be some issues with Unicode support. See jaumebonet@e90bd0e and problems parsing the number of citations for struck-through values (e.g. when citations are grouped with another article like 'Le cours de physique de Feynman')

scholar_call_home fails when package is not attached

> scholar::tidy_id("cuXoCA8AAAAJ")
Error in if (getOption("scholar_call_home")) { : 
  argument is of length zero
> library(scholar)
> scholar::tidy_id("cuXoCA8AAAAJ")
[1] "cuXoCA8AAAAJ"

This is relevant if the scholar package is imported by another package.

jkeirstead / scholar Goto Github PK

scholar's People

Contributors

Stargazers

Watchers

Forkers

scholar's Issues

print("parsing html")

print("finding authors")

Recommend Projects

Recommend Topics

Recommend Org