ckreibich / scholar.py Goto Github PK

View Code? Open in Web Editor NEW

2.1K 2.1K 777.0 71 KB

A parser for Google Scholar, written in Python

Python 100.00%

scholar.py's People

Contributors

Stargazers

Watchers

Forkers

nmatra smidm digdugg twodell pablooliveira ameyahate chendaniely noahadler arcolife wintermind soheilsalehian boulabiar northstar rexmac gabrielstanovsky dzhuyx heavenscar simonengelke silky perellonieto d1m0 fedorov c24b idoerg simjoly xdanx maximepeabody jawadrashid2011 melegati corymcox zhenx sjakobi yuqirose wander2001 rpgoldman alfcrisci agkonings openhero ayumilong priyank1508 guerre50 scholarpedia mayur007 huhugravity pmadhyastha bfoste01 nikhgarg sikoried phargogh sagnik enricobacis carlitostrickland nushio3 twistedmove xinyuwang maitri-upadhyay bigboss21x martinpatrick mstrupler jtneill mmartinez oznog7 enarsee cmw48 nadirhamid cgiuffr naupaka hoodie naominickerson krysia alangrafu rgranit guyduer andertaker toddbodnar xaedes enlanwang benlansdell cward086 sisima1729 adasilva luciadacunto imclab leighton jigyasu11 urwithajit9 wloo toinsson christopherthomson xizhonghua mmadsen sangchenyin selimonat kyeongan abhishekghosh lilicoding epichub hh1985 gligorijevic cgatoxford

scholar.py's Issues

scholar.py not working at all, always TypeError happens

I always get this error when trying to run scholar.py:

C:>python scholar.py --phrase "quantum" -c 1
Traceback (most recent call last):
File "scholar.py", line 1272, in
sys.exit(main())
File "scholar.py", line 1255, in main
querier.send_query(query)
File "scholar.py", line 983, in send_query
html = self._get_http_response(url=query.get_url(),
File "scholar.py", line 823, in get_url
urlargs[key] = quote(encode(val))
File "C:\Python34\lib\urllib\parse.py", line 694, in quote
return quote_from_bytes(string, safe)
File "C:\Python34\lib\urllib\parse.py", line 719, in quote_from_bytes
raise TypeError("quote_from_bytes() expected bytes")
TypeError: quote_from_bytes() expected bytes

No matter what parameters I put in it does not work at all. Im running it under Win 7 with Python v 3.4.2.

Missing 'pdf_url' in search results

I ran a few tests on several papers I know and for some of them the download url for the pdf couldn't be fetched, although it is available.

Proxy Support

I can not connect to google scholar using this file because of the network issues in our company.
If this program can add proxy support, it will be the best.

More Than one Author search argument

Hello
Does this tool supports more than one author search like this : --author "Akihiro Shibayama, Shosuke Sato, Fumihiko Imamura"

How to deal with "Please show you're not a robot"

I foud that the html returned is only "Please show you're not a robot". Please tell me how to avoid that?

delete me

same name

for the writers that have same name, results find more articles.
for example when I use this query:
scholar.py --txt-globals -a "Dustin Tran"
it finds 20 results. but in his page : https://scholar.google.com/citations?user=wVazIm8AAAAJ&hl=en
there is only 15 articles.

-t option raises QueryArgumentError, other options do not

I am getting a QueryArgumentError when I am using the -t flag but not the -p flag:

$ python scholar.py -c 1 -p "On the mechanism of DNA replication in mammalian chromosomes"
         Title On the mechanism of DNA replication in mammalian chromosomes
           URL http://www.sciencedirect.com/science/article/pii/0022283668900132
          Year 1968
     Citations 933
      Versions 3
    Cluster ID 16701884832670113656
Citations list http://scholar.google.com/scholar?cites=16701884832670113656&as_sdt=2005&sciodt=0,5&hl=en
 Versions list http://scholar.google.com/scholar?cluster=16701884832670113656&hl=en&as_sdt=0,5

$ python scholar.py -c 1 -t "On the mechanism of DNA replication in mammalian chromosomes"
Traceback (most recent call last):
  File "scholar.py", line 1068, in <module>
    sys.exit(main())
  File "scholar.py", line 1051, in main
    querier.send_query(query)
  File "scholar.py", line 809, in send_query
    html = self._get_http_response(url=query.get_url(),
  File "scholar.py", line 639, in get_url
    raise QueryArgumentError('search query needs more parameters')
__main__.QueryArgumentError: search query needs more parameters

I am also not having this error with the -A flag either:

$ python scholar.py -c 1 -A "On the mechanism of DNA replication in mammalian chromosomes"                                                  Title On the mechanism of DNA replication in mammalian chromosomes
           URL http://www.sciencedirect.com/science/article/pii/0022283668900132
          Year 1968
     Citations 933
      Versions 3
    Cluster ID 16701884832670113656
Citations list http://scholar.google.com/scholar?cites=16701884832670113656&as_sdt=2005&sciodt=0,5&hl=en
 Versions list http://scholar.google.com/scholar?cluster=16701884832670113656&hl=en&as_sdt=0,5

given a paper, can it give all authors?

given a paper as input can this API provide me list of all authors associated with it (each having a unique id) ?

BibTex citation works

Hi,
thanks a lot for the tool! I would like to ask if someone has ever had problems with the citation in bibtex format. I have used the option "--citation bt" but sometimes it works, sometimes it doesn't without any apparent reasons.
Thanks in advance!

Giovanni

Allow paging

Allow paging to receive results >20. Can be done with Google Scholar's search parameter 'start'.

How to get around being blocked permanently? (Persistent 503 error)

I wrote an automated script using scholar.py (not realizing that Google Scholar has a query limit). Now my program consistently runs into a 503 error even though I've successfully done the captcha in my web browser. I have some questions about this:

When will the ban usually be lifted?
I've seen some mention cookies as a solution to this - can anyone tell me the details on how to do this?

Thank you and thanks for making a great API!

Program not working

I tested the program, but it shows zero results for any query.
When running on python3 I get an error:

TypeError: quote_from_bytes() expected bytes

--citations-only doesn't exist

Used this command:

./scholar.py --phrase "Online Clustering of Bandits" --citations-only --citation bt

Got this error:

scholar.py: error: no such option: --citations-only

list index out of range

When I run this code:
title = 'correlating equations for laminar and turbulent free convection from a vertical plate' paper = next(sch.search_pubs_query(title)).fill()
I get an error that says this:
File "citation_info.py", line 5, in <module> paper = next(sch.search_pubs_query(title)).fill() File "/usr/local/lib/python2.7/site-packages/scholarly.py", line 183, in fill bibtex = _get_page(self.url_scholarbib) File "/usr/local/lib/python2.7/site-packages/scholarly.py", line 62, in _get_page img_url = img_url_soup.findAll(alt='scholarly_captcha')[0].get('src') IndexError: list index out of range
"citation_info.py" is the script that calls the two lines of code. Thanks in advance for any help.

--citation=FORMAT now produces blank output

"scholar.py -c 1 --txt --author einstein quantum --citation=bt"

gives no output, while

"scholar.py -c 1 --txt --author einstein quantum"

produces the correct output as seen in the example documentation.

This has only been a recent problem, "--citation=bt" used to work for me.

Sort by available PDFs

Hi,
I would like to have the query option added to only return journal articles that have pdfs that I have access too.
Is this possible?

Thanks,
Brad

How to use from within Python for "searching with author/word/phrase"

Which is the proper way to use code from within Python(instead of command line)? I tried with the following line of code, but its giving me an empty array.

import scholar
querier = scholar.ScholarQuerier()
settings = scholar.ScholarSettings()
querier.apply_settings(settings)

def searchScholar(searchphrase):
query = scholar.SearchScholarQuery()
query.set_words(searchphrase)
querier.send_query(query)
print(len(querier.articles))

searchScholar('Evaluating technologies for education')

Queries with counts above 19 come back blank

When I try to do queries with counts over 19 it just goes straight to the next input. What is the reason for this behavior? Does google block queries over 19 results? Alternatively, is it because the script only looks at the first page of results? Is there an easy workaround in this case or would I have to code this functionality myself?

License

Hi @ckreibich - great library!
I'm considering writing a golang library like this for Google Scholar
by using the functions in this library as a reference.
Of course, I will attribute and link to this project,
but I think you hold all of the copyright without a license.
A reference for this is: https://help.github.com/articles/open-source-licensing/

Would you be willing to add a license so I can do this?

-Brandon.

scholar.py is not working in Amazon Web Services (AWS)

I tried the example script and get nothing, the same script works in my local computer. Any ideas?

python scholar.py -c 1 --author "albert einstein" --phrase "quantum theory"

Get citations of specific user

Hi! Mybe this is not an issue but i will use you plugin for my personal webpage. When i run your script with "--author=My Name" it works well but it gets me all the articles published by all the other guys in the world that have the same name than me.
I need to get my specific papers. I have found that google uses a "user=HASH" on the profile of a user and there it get the papers that this user have claimed authorship.
There is a way of getting this particular page?

Thanks in advance!

Google scholar limit query rate ?

Hi, thank you very much for the tool.
Does anyone have an rough idea of Google Scholar limit query rates ? It would help to respect them.
Thanks

cannot run python scholar.py : We need BeautifulSoup

i try run
$python scholar.py -c 1 --author "albert einstein" --phrase "quantum theory"
the output is : We need BeautifulSoup, sorry...

i has install BeautifulSoup ( pip install beautifulsoup)
thx for the solution :-)

BeautifulSoup Parser Warning

BeautifulSoup complains:

/python2.7/site-packages/bs4/__init__.py:166: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

To get rid of this warning, change this:

 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

"More Results" option

Use from within Python

What is a good way to use code from within Python?
Using package from cmd is a nuisance for me.
Thanx for package!

Author field in output

Except for bibtex output, it is not possible to output author fields. Don't know whether it's deliberate, but would be really nice.

citation option

Does anyone else have problems with --citation option not working anymore?

Abstract extraction in CSV not correctly handled

The following search is an example of an abstract that is being incorrectly split into multiple fields such that there are more resulting CSV fields than headers.

$ ./scholar.py -p 'Sensible Scenes: Visual Understanding of Complex Structures through Causal Analysis.' -t --csv-header

title|url|year|num_citations|num_versions|cluster_id|url_pdf|url_citations|url_versions|url_citation|excerpt
Sensible Scenes: Visual Understanding of Complex Structures through Causal Analysis.|http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.52.3770&rep=rep1&type=pdf|1993|31|5|13569529201397445945|None|http://scholar.google.com/scholar?cites=13569529201397445945&as_sdt=2005&sciodt=0,5&hl=en|http://scholar.google.com/scholar?cluster=13569529201397445945&hl=en&as_sdt=0,5|None|Abstract An important result of visual understanding is an explanation of a scene's causal structure: How action| usually motion| is originated, constrained, and prevented, and how this determines what will happen in the immediate future. To be useful for a purposeful  ...

HTTP503 error when running from server

Why is there a HTTP503 error when I run this from a server

Python's built-in HTMLParser cannot parse the given document.

/usr/lib/python2.6/site-packages/bs4/builder/_htmlparser.py:163: RuntimeWarning: Python's built-in HTMLParser cannot parse the given document. This is not a bug in Beautiful Soup. The best solution is to install an external parser (lxml or html5lib), and use Beautiful Soup with that parser. See http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser for help.

Trouble running the parser from webserver

Hi,

I'm trying to run this script directly from a web server, I can get it to run some basic options like --help but get only empty output when running an actual query.. suggestions?!

Thanks,
Roy

Giving a cluster ID without additional arguments should just pull up all the articles in the cluster

Some queries retrieve nothing

Hi,

I wanna report a problem.
I use scholar.py to collect citations of papers. Generally, it works fine. But some queries have no any retrieved result (They are supposed to have results because I tried them in Google Scholar manually).

Below is an example. the name of the paper is "The chemoattractant chemerin suppresses melanoma by recruiting natural killer cell antitumor defenses Chemerin is a natural tumor-suppressive cytokine":

python scholar.py -c 5 --phrase "The chemoattractant chemerin suppresses melanoma by recruiting natural killer cell antitumor defenses Chemerin is a natural tumor-suppressive cytokine"

{'count': 5, 'none': None, 'after': None, 'author': None, 'cookie_file': None, 'citation': None, 'some': None, 'title_only': False, 'pub': None, 'allw': None, 'version': False, 'cluster_id': None, 'debug': 0, 'phrase': 'The chemoattractant chemerin suppresses melanoma by recruiting natural killer cell antitumor defenses Chemerin is a natural tumor-suppressive cytokine', 'csv_header': None, 'txt': None, 'csv': None, 'before': None}

(No any search result follows)

Does anyone know what happened?
Thanks.

[wishlist]: query based on a known "cluster" id

E.g. if I already know an ID google assigned to an article, e.g.

http://scholar.google.com/scholar?cluster=2738650369888597462

count still could be applied. Thank you in advance! ;)

how to save output in csv

I want to save data in CSV format. Is it possible to save out put?

Automatic CV annotator?

Has anyone used this to write a tool to parse and annotate PDF CVs, with the number of Scholar citations? That would be very useful! Kind of hard to do , I guess...

List of papers cited by a paper

List papers citing a paper

Xavi Anguera has suggested making the list of papers citing a paper queryable via the API. This needs a bit more thinking about the notion of paper identity (cluster ID) vs presentation to the user, but shouldn't be a big problem otherwise.

Google Policy on Scraping Google Scholar

I know that google scholar imposes a query limit, but does it have any explicit policy prohibiting automated scraping of google scholar results? Applications like Harzing's Publish or Perish openly scrape google scholar and have been operating for years.

Getting APA/MLA citation for a result

Hi, nice work on the script! Apart from getting citations in bibtext, is there a way I could directly get the citation for a given result (i.e in APA or MLA)?

Get h-index, i-10 index

Doesn't handle CAPTCHAs from Google

Currently, scholar.py just returns blank if a captcha is displayed by google. Could a method for displaying the captcha be added so it can be solved?

Queries seem limited to the first 20 results provided by Scholar

Hello,

I executed the following command line:

python scholar.py scholar.py -c 100 --after=2007 --author "brice morin" --citation bt > ref.bib

My goal was to extract a bibtex file containing all my papers. It however seems that what I get as a result (despite the -c 100 option) is the bibtex entries of all my papers provided on the first page on my Google Scholar page, basically the first 20 papers.

It would be nice if I (and I guess it makes sense for many other users of your tool) would be able to get all entries, not just the first page provided by Google Scholar.

Thank you

UnicodeEncodeError: 'ascii' codec can't encode character

python3.3 scholar.py -c 5 -a "albert einstein" -t --none "quantum theory" --after 1970
Traceback (most recent call last):
File "scholar.py", line 1275, in
sys.exit(main())
File "scholar.py", line 1267, in main
txt(querier, with_globals=options.txt_globals)
File "scholar.py", line 1098, in txt
print(encode(art.as_txt()) + '\n')
UnicodeEncodeError: 'ascii' codec can't encode character '\xa9' in position 824: ordinal not in range(128)

Newly added features and some clarifications

Hello @ckreibich
Did you include all the added modifications/features included in the posted issues to the original scholar.py ?
Also, there is another tools which query Google scholar. I wonder how it's different from the one you proposed? see this link : https://github.com/hildensia/scholar

Lastly, what is the rate limit enforced by this scholar.py tool to query Google Scholar?

Thanks for your support

MAX_PAGE_RESULTS = 20 # Current maximum for per-page results