Scopus_crawler
Crawl information of papers (and citing articles) searched byadvancedqueryonScopus (www.scopus.com) via Selenium.
Table of Contents
- Table of Contents
- Introduction
- Installation
- Tutorial
- Troubleshooting
- Release History
- Acknowledgements
Introduction
Crawl information (citation, bibliography, abstract, fund and other information) of papers
& citing articles
searched by advanced query on Scopus (www.scopus.com) via Selenium.
Specially, this program subdivides YEARS into subyear(s), then combines your QUERY and a subyear repeatedly for advanced search, given that we can only manually download at most 2000 results per batch on Scopus.
Workflow of this program:
try:
# Main workflow
except: #(i.e. If failed)
# Try main workflow again
Crawling process shown by the console (when you successfully run this program):
Installation
git clone https://github.com/matrixChimera/Scopus_crawler.git
Tutorial
-
Before executing this program, please manually log in Scopus (you have to be able to manually log in, or rather, this program cannnot help you escape the access requirement on Scopus).
-
Before executing this program, please install packages within your python environment:
- prettytable
- selenium
- webdriver
- Before executing this program, please define/modify parameters in settings.py:
3.1 Define the way you log in:
ACCESS = 'institution' # or 'cookies'
3.2 Define information about logging in:
- If you log in Scopus via your institution (ACCESS = 'institution'), please define your username, password, and institution name:
# ★★★Define your username:
USERNAME = ''
# ★★★Define your password:
PASSWORD = ''
# ★★★Define your institution:
INSTITUTION = ''
- If you log in Scopus via cookies (ACCESS = 'cookies'), please define Chrome cookies of the URL with Scopus' access:
chrome_cookies = 'scopusSessionUUID=9461fa86-9c06-4f0c-b;screenInfo="640:1024";SCSessionID=9D8BCB4DD7A64A57C24BFAE9B43FF959.wsnAw8kcdt7IPYLO0V48gA;# ... #;xmlHttpRequest=true'
(You can extract cookies via EditThisCookie (an extension of Chrome) after you manually logged in Scopus.)
3.3 Define the permissible query (without PUBYEAR) for advanced search:
QUERY = 'TITLE-ABS-KEY(neurofibroma) AND LANGUAGE(english) AND DOCTYPE(ar)'
3.4 Define start_year & end_year for advanced search:
start_year = 2018
end_year = 2019
3.5 If necessary (especially when your network is slowed down by the Great Firewall in Mainland China), modify time limits according to your aims and the performance of this program (after you tried this program):
# Define the longest time (second) of implicitly wait of Selenium' execution:
WAIT_TIME = 30
# Define the time limit to downloading:
DOWNLOAD_TIMEOUT = 90
# Define the sleep time for waiting for the rendering (of HTML/JavaScript):
# (If necessary, please prolong the sleep time,
# especially when your network is slowed down by the Great Firewall in Mainland China)
SLEEPTIME_LONG = 10 # Generally for waiting for redirecting/loading of the search page of Scopus
SLEEPTIME_MEDIUM = 5 # Generally for waiting for interacting with elements shown via rendering of JavaScript
SLEEPTIME_SHORT = 2 # Generally for waiting for interacting with elements shown via rendering of HTML
# Times limit to trying (to download) again:
TRY_AGAIN_TIMES = 5
-
Run advanced_query_articles.py to crawl information of papers. Output: one folder, named articles, including .ris files will appear.
-
Run advanced_query_citingArticles.py to crawl information of citing articles. Output: one folder, named citingArticles, including .ris files will appear.
-
If neccesary (after you crawled .ris files), you can run merge_ris.py to merge all .ris files in the articles/citingArticles folder into one .ris file.
Troubleshooting
-
You may see the following tips when you failed to log in (perhaps due to wrong information of logging in, low speed of your network, or low time limits):
-
You may see the following tips when you triggered the TryAgain (perhaps due to low speed of your network, or low time limits):
Release History
- 1.0
- 2020/04/24
- Create: settings.py, advanced_query_articles.py, advanced_query_citingArticles.py, and merge_ris.py
Acknowledgements
- Thanks for inspiration for the crawler from @tomleung1996.
- Thanks for a tool from GitHub Wiki TOC generator.