ckreibich / scholar.py Goto Github PK
View Code? Open in Web Editor NEWA parser for Google Scholar, written in Python
A parser for Google Scholar, written in Python
I always get this error when trying to run scholar.py:
C:>python scholar.py --phrase "quantum" -c 1
Traceback (most recent call last):
File "scholar.py", line 1272, in
sys.exit(main())
File "scholar.py", line 1255, in main
querier.send_query(query)
File "scholar.py", line 983, in send_query
html = self._get_http_response(url=query.get_url(),
File "scholar.py", line 823, in get_url
urlargs[key] = quote(encode(val))
File "C:\Python34\lib\urllib\parse.py", line 694, in quote
return quote_from_bytes(string, safe)
File "C:\Python34\lib\urllib\parse.py", line 719, in quote_from_bytes
raise TypeError("quote_from_bytes() expected bytes")
TypeError: quote_from_bytes() expected bytes
No matter what parameters I put in it does not work at all. Im running it under Win 7 with Python v 3.4.2.
I ran a few tests on several papers I know and for some of them the download url for the pdf couldn't be fetched, although it is available.
I can not connect to google scholar using this file because of the network issues in our company.
If this program can add proxy support, it will be the best.
Hello
Does this tool supports more than one author search like this : --author "Akihiro Shibayama, Shosuke Sato, Fumihiko Imamura"
I foud that the html returned is only "Please show you're not a robot". Please tell me how to avoid that?
for the writers that have same name, results find more articles.
for example when I use this query:
scholar.py --txt-globals -a "Dustin Tran"
it finds 20 results. but in his page : https://scholar.google.com/citations?user=wVazIm8AAAAJ&hl=en
there is only 15 articles.
I am getting a QueryArgumentError when I am using the -t flag but not the -p flag:
$ python scholar.py -c 1 -p "On the mechanism of DNA replication in mammalian chromosomes"
Title On the mechanism of DNA replication in mammalian chromosomes
URL http://www.sciencedirect.com/science/article/pii/0022283668900132
Year 1968
Citations 933
Versions 3
Cluster ID 16701884832670113656
Citations list http://scholar.google.com/scholar?cites=16701884832670113656&as_sdt=2005&sciodt=0,5&hl=en
Versions list http://scholar.google.com/scholar?cluster=16701884832670113656&hl=en&as_sdt=0,5
$ python scholar.py -c 1 -t "On the mechanism of DNA replication in mammalian chromosomes"
Traceback (most recent call last):
File "scholar.py", line 1068, in <module>
sys.exit(main())
File "scholar.py", line 1051, in main
querier.send_query(query)
File "scholar.py", line 809, in send_query
html = self._get_http_response(url=query.get_url(),
File "scholar.py", line 639, in get_url
raise QueryArgumentError('search query needs more parameters')
__main__.QueryArgumentError: search query needs more parameters
I am also not having this error with the -A flag either:
$ python scholar.py -c 1 -A "On the mechanism of DNA replication in mammalian chromosomes" Title On the mechanism of DNA replication in mammalian chromosomes
URL http://www.sciencedirect.com/science/article/pii/0022283668900132
Year 1968
Citations 933
Versions 3
Cluster ID 16701884832670113656
Citations list http://scholar.google.com/scholar?cites=16701884832670113656&as_sdt=2005&sciodt=0,5&hl=en
Versions list http://scholar.google.com/scholar?cluster=16701884832670113656&hl=en&as_sdt=0,5
given a paper as input can this API provide me list of all authors associated with it (each having a unique id) ?
Hi,
thanks a lot for the tool! I would like to ask if someone has ever had problems with the citation in bibtex format. I have used the option "--citation bt" but sometimes it works, sometimes it doesn't without any apparent reasons.
Thanks in advance!
Giovanni
Allow paging to receive results >20. Can be done with Google Scholar's search parameter 'start'.
I wrote an automated script using scholar.py (not realizing that Google Scholar has a query limit). Now my program consistently runs into a 503 error even though I've successfully done the captcha in my web browser. I have some questions about this:
Thank you and thanks for making a great API!
I tested the program, but it shows zero results for any query.
When running on python3 I get an error:
TypeError: quote_from_bytes() expected bytes
Used this command:
./scholar.py --phrase "Online Clustering of Bandits" --citations-only --citation bt
Got this error:
scholar.py: error: no such option: --citations-only
When I run this code:
title = 'correlating equations for laminar and turbulent free convection from a vertical plate' paper = next(sch.search_pubs_query(title)).fill()
I get an error that says this:
File "citation_info.py", line 5, in <module> paper = next(sch.search_pubs_query(title)).fill() File "/usr/local/lib/python2.7/site-packages/scholarly.py", line 183, in fill bibtex = _get_page(self.url_scholarbib) File "/usr/local/lib/python2.7/site-packages/scholarly.py", line 62, in _get_page img_url = img_url_soup.findAll(alt='scholarly_captcha')[0].get('src') IndexError: list index out of range
"citation_info.py" is the script that calls the two lines of code. Thanks in advance for any help.
"scholar.py -c 1 --txt --author einstein quantum --citation=bt"
gives no output, while
"scholar.py -c 1 --txt --author einstein quantum"
produces the correct output as seen in the example documentation.
This has only been a recent problem, "--citation=bt" used to work for me.
Hi,
I would like to have the query option added to only return journal articles that have pdfs that I have access too.
Is this possible?
Thanks,
Brad
Which is the proper way to use code from within Python(instead of command line)? I tried with the following line of code, but its giving me an empty array.
import scholar
querier = scholar.ScholarQuerier()
settings = scholar.ScholarSettings()
querier.apply_settings(settings)
def searchScholar(searchphrase):
query = scholar.SearchScholarQuery()
query.set_words(searchphrase)
querier.send_query(query)
print(len(querier.articles))
searchScholar('Evaluating technologies for education')
When I try to do queries with counts over 19 it just goes straight to the next input. What is the reason for this behavior? Does google block queries over 19 results? Alternatively, is it because the script only looks at the first page of results? Is there an easy workaround in this case or would I have to code this functionality myself?
Hi @ckreibich - great library!
I'm considering writing a golang library like this for Google Scholar
by using the functions in this library as a reference.
Of course, I will attribute and link to this project,
but I think you hold all of the copyright without a license.
A reference for this is: https://help.github.com/articles/open-source-licensing/
Would you be willing to add a license so I can do this?
-Brandon.
scholar.py is not working in Amazon Web Services (AWS)
I tried the example script and get nothing, the same script works in my local computer. Any ideas?
python scholar.py -c 1 --author "albert einstein" --phrase "quantum theory"
Hi! Mybe this is not an issue but i will use you plugin for my personal webpage. When i run your script with "--author=My Name" it works well but it gets me all the articles published by all the other guys in the world that have the same name than me.
I need to get my specific papers. I have found that google uses a "user=HASH" on the profile of a user and there it get the papers that this user have claimed authorship.
There is a way of getting this particular page?
Thanks in advance!
Hi, thank you very much for the tool.
Does anyone have an rough idea of Google Scholar limit query rates ? It would help to respect them.
Thanks
i try run
$python scholar.py -c 1 --author "albert einstein" --phrase "quantum theory"
the output is : We need BeautifulSoup, sorry...
i has install BeautifulSoup ( pip install beautifulsoup)
thx for the solution :-)
BeautifulSoup complains:
/python2.7/site-packages/bs4/__init__.py:166: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
To get rid of this warning, change this:
BeautifulSoup([your markup])
to this:
BeautifulSoup([your markup], "lxml")
What is a good way to use code from within Python?
Using package from cmd is a nuisance for me.
Thanx for package!
Except for bibtex output, it is not possible to output author fields. Don't know whether it's deliberate, but would be really nice.
Does anyone else have problems with --citation option not working anymore?
The following search is an example of an abstract that is being incorrectly split into multiple fields such that there are more resulting CSV fields than headers.
$ ./scholar.py -p 'Sensible Scenes: Visual Understanding of Complex Structures through Causal Analysis.' -t --csv-header
title|url|year|num_citations|num_versions|cluster_id|url_pdf|url_citations|url_versions|url_citation|excerpt
Sensible Scenes: Visual Understanding of Complex Structures through Causal Analysis.|http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.52.3770&rep=rep1&type=pdf|1993|31|5|13569529201397445945|None|http://scholar.google.com/scholar?cites=13569529201397445945&as_sdt=2005&sciodt=0,5&hl=en|http://scholar.google.com/scholar?cluster=13569529201397445945&hl=en&as_sdt=0,5|None|Abstract An important result of visual understanding is an explanation of a scene's causal structure: How action| usually motion| is originated, constrained, and prevented, and how this determines what will happen in the immediate future. To be useful for a purposeful ...
Why is there a HTTP503 error when I run this from a server
/usr/lib/python2.6/site-packages/bs4/builder/_htmlparser.py:163: RuntimeWarning: Python's built-in HTMLParser cannot parse the given document. This is not a bug in Beautiful Soup. The best solution is to install an external parser (lxml or html5lib), and use Beautiful Soup with that parser. See http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser for help.
Hi,
I'm trying to run this script directly from a web server, I can get it to run some basic options like --help
but get only empty output when running an actual query.. suggestions?!
Thanks,
Roy
Hi,
I wanna report a problem.
I use scholar.py to collect citations of papers. Generally, it works fine. But some queries have no any retrieved result (They are supposed to have results because I tried them in Google Scholar manually).
Below is an example. the name of the paper is "The chemoattractant chemerin suppresses melanoma by recruiting natural killer cell antitumor defenses Chemerin is a natural tumor-suppressive cytokine":
python scholar.py -c 5 --phrase "The chemoattractant chemerin suppresses melanoma by recruiting natural killer cell antitumor defenses Chemerin is a natural tumor-suppressive cytokine"
{'count': 5, 'none': None, 'after': None, 'author': None, 'cookie_file': None, 'citation': None, 'some': None, 'title_only': False, 'pub': None, 'allw': None, 'version': False, 'cluster_id': None, 'debug': 0, 'phrase': 'The chemoattractant chemerin suppresses melanoma by recruiting natural killer cell antitumor defenses Chemerin is a natural tumor-suppressive cytokine', 'csv_header': None, 'txt': None, 'csv': None, 'before': None}
(No any search result follows)
Does anyone know what happened?
Thanks.
E.g. if I already know an ID google assigned to an article, e.g.
http://scholar.google.com/scholar?cluster=2738650369888597462
count still could be applied. Thank you in advance! ;)
I want to save data in CSV format. Is it possible to save out put?
Has anyone used this to write a tool to parse and annotate PDF CVs, with the number of Scholar citations? That would be very useful! Kind of hard to do , I guess...
Xavi Anguera has suggested making the list of papers citing a paper queryable via the API. This needs a bit more thinking about the notion of paper identity (cluster ID) vs presentation to the user, but shouldn't be a big problem otherwise.
I know that google scholar imposes a query limit, but does it have any explicit policy prohibiting automated scraping of google scholar results? Applications like Harzing's Publish or Perish openly scrape google scholar and have been operating for years.
Hi, nice work on the script! Apart from getting citations in bibtext, is there a way I could directly get the citation for a given result (i.e in APA or MLA)?
Currently, scholar.py just returns blank if a captcha is displayed by google. Could a method for displaying the captcha be added so it can be solved?
Hello,
I executed the following command line:
python scholar.py scholar.py -c 100 --after=2007 --author "brice morin" --citation bt > ref.bib
My goal was to extract a bibtex file containing all my papers. It however seems that what I get as a result (despite the -c 100
option) is the bibtex entries of all my papers provided on the first page on my Google Scholar page, basically the first 20 papers.
It would be nice if I (and I guess it makes sense for many other users of your tool) would be able to get all entries, not just the first page provided by Google Scholar.
Thank you
python3.3 scholar.py -c 5 -a "albert einstein" -t --none "quantum theory" --after 1970
Traceback (most recent call last):
File "scholar.py", line 1275, in
sys.exit(main())
File "scholar.py", line 1267, in main
txt(querier, with_globals=options.txt_globals)
File "scholar.py", line 1098, in txt
print(encode(art.as_txt()) + '\n')
UnicodeEncodeError: 'ascii' codec can't encode character '\xa9' in position 824: ordinal not in range(128)
Hello @ckreibich
Did you include all the added modifications/features included in the posted issues to the original scholar.py ?
Also, there is another tools which query Google scholar. I wonder how it's different from the one you proposed? see this link : https://github.com/hildensia/scholar
Lastly, what is the rate limit enforced by this scholar.py tool to query Google Scholar?
Thanks for your support
Can you provide an example python script where you print the BibTex Citation?
The result is empty now. I've confirmed the results have been ok about a week ago. I've also checked it with different network. Google Scholar may has been updated.
I want to get all number of articles but it give me only 20 article as a result.
I changed this line too but this limitation is still there:
MAX_PAGE_RESULTS = 20 # Current maximum for per-page results
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.