Giter VIP home page Giter VIP logo

scopus_crawler's Introduction

Scopus_crawler

Crawl information of papers (and citing articles) searched byadvancedqueryonScopus (www.scopus.com) via Selenium.

Table of Contents

Introduction

Crawl information (citation, bibliography, abstract, fund and other information) of papers & citing articles searched by advanced query on Scopus (www.scopus.com) via Selenium.

Specially, this program subdivides YEARS into subyear(s), then combines your QUERY and a subyear repeatedly for advanced search, given that we can only manually download at most 2000 results per batch on Scopus.

Workflow of this program:

try: 
    # Main workflow
except: #(i.e. If failed)
    # Try main workflow again

Main_workflow.png

Crawling process shown by the console (when you successfully run this program): articles_citingArticles.png

Installation

  1. Clone via GitHub Desktop or Download .zip CloneOrDownload.png

  2. Clone from git

git clone https://github.com/matrixChimera/Scopus_crawler.git

Tutorial

  1. Before executing this program, please manually log in Scopus (you have to be able to manually log in, or rather, this program cannnot help you escape the access requirement on Scopus).

  2. Before executing this program, please install packages within your python environment:

  1. Before executing this program, please define/modify parameters in settings.py:

3.1 Define the way you log in:

ACCESS = 'institution'  # or 'cookies'

3.2 Define information about logging in:

  • If you log in Scopus via your institution (ACCESS = 'institution'), please define your username, password, and institution name:
# ★★★Define your username:
USERNAME = ''
# ★★★Define your password:
PASSWORD = ''
# ★★★Define your institution:
INSTITUTION = ''
  • If you log in Scopus via cookies (ACCESS = 'cookies'), please define Chrome cookies of the URL with Scopus' access:
chrome_cookies = 'scopusSessionUUID=9461fa86-9c06-4f0c-b;screenInfo="640:1024";SCSessionID=9D8BCB4DD7A64A57C24BFAE9B43FF959.wsnAw8kcdt7IPYLO0V48gA;# ... #;xmlHttpRequest=true'

(You can extract cookies via EditThisCookie (an extension of Chrome) after you manually logged in Scopus.)

3.3 Define the permissible query (without PUBYEAR) for advanced search:

QUERY = 'TITLE-ABS-KEY(neurofibroma) AND LANGUAGE(english) AND DOCTYPE(ar)'

3.4 Define start_year & end_year for advanced search:

start_year = 2018
end_year = 2019

3.5 If necessary (especially when your network is slowed down by the Great Firewall in Mainland China), modify time limits according to your aims and the performance of this program (after you tried this program):

# Define the longest time (second) of implicitly wait of Selenium' execution:
WAIT_TIME = 30
# Define the time limit to downloading:
DOWNLOAD_TIMEOUT = 90
# Define the sleep time for waiting for the rendering (of HTML/JavaScript):
# (If necessary, please prolong the sleep time,
# especially when your network is slowed down by the Great Firewall in Mainland China)
SLEEPTIME_LONG = 10  # Generally for waiting for redirecting/loading of the search page of Scopus
SLEEPTIME_MEDIUM = 5  # Generally for waiting for interacting with elements shown via rendering of JavaScript
SLEEPTIME_SHORT = 2  # Generally for waiting for interacting with elements shown via rendering of HTML
# Times limit to trying (to download) again:
TRY_AGAIN_TIMES = 5
  1. Run advanced_query_articles.py to crawl information of papers. Output: one folder, named articles, including .ris files will appear.

  2. Run advanced_query_citingArticles.py to crawl information of citing articles. Output: one folder, named citingArticles, including .ris files will appear.

  3. If neccesary (after you crawled .ris files), you can run merge_ris.py to merge all .ris files in the articles/citingArticles folder into one .ris file.

Troubleshooting

  1. You may see the following tips when you failed to log in (perhaps due to wrong information of logging in, low speed of your network, or low time limits): FailedToLogIn.png

  2. You may see the following tips when you triggered the TryAgain (perhaps due to low speed of your network, or low time limits): TryAgain_console.png

Release History

  • 1.0
    • 2020/04/24
    • Create: settings.py, advanced_query_articles.py, advanced_query_citingArticles.py, and merge_ris.py

Acknowledgements

scopus_crawler's People

Contributors

matrixchimera avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.