Giter VIP home page Giter VIP logo

sort-google-scholar's Introduction

Sort Google Scholar by the Number of Citations

PyPI Version

sortgs is a Python tool for ranking Google Scholar publications by the number of citations. It is useful for finding relevant papers in a specific field. The data acquired from Google Scholar includes Title, Citations, Links, Rank, and a new column with the number of citations per year. In the background, it first try to fetch results using python requests. If it fails, it will use selenium to fetch the results.

Try on Google Colab:

  • No install requirements! Limitations: Can't handle robot checking, so use it carefully.

Installation

You can now install sortgs directly using pip:

pip install sortgs

This will install the latest version of sortgs and its dependencies.

Usage

Once installed, you can run sortgs directly from the command line:

sortgs "your keyword"

Replace "your keyword" with any keyword you'd like to search for. A CSV file with the name your_keyword.csv will be created in your current directory.

Misc

For a feedback, send me an email: fernando [dot] wittmann [at] gmail [dot] com

Command Line Arguments

usage: sortgs [-h] [--sortby SORTBY] [--nresults NRESULTS] [--csvpath CSVPATH]
              [--notsavecsv] [--plotresults] [--startyear STARTYEAR]
              [--endyear ENDYEAR] [--debug] kw

positional arguments:
  kw                    Keyword to be searched. Use double quote followed by
                        simple quote for an exact keyword. 
                        Example: sortgs "'exact keyword'"

optional arguments:
  -h, --help            show this help message and exit
  --sortby SORTBY       Column to be sorted by. Default is "Citations". To sort
                        by citations per year, use --sortby "cit/year"
  --nresults NRESULTS   Number of articles to search on Google Scholar. Default
                        is 100. (careful with robot checking if value is high)
  --csvpath CSVPATH     Path to save the exported csv file. Default is the 
                        current folder
  --notsavecsv          By default, results are exported to a csv file. Select
                        this option to just print results but not store them
  --plotresults         Use this flag to plot results with the original rank on
                        the x-axis and the number of citations on the y-axis.
                        Default is False
  --startyear STARTYEAR
                        Start year when searching. Default is None
  --endyear ENDYEAR     End year when searching. Default is current year
  --debug               Debug mode. Used for unit testing. It will get pages
                        stored on web archive

Examples

  1. Default Search:

    sortgs "machine learning"

    This command searches for the top 100 results related to "machine learning" and saves them as a CSV file.

  2. Sort by Citations per Year:

    sortgs "machine learning" --sortby "cit/year"

    Search for "machine learning" and sort by the number of citations per year.

  3. Specify Date Range:

    sortgs "machine learning" --startyear 2005 --endyear 2015

    Search for papers from 2005 to 2015.

  4. Search for an Exact Keyword:

    sortgs "'machine learning'"
  5. Save Results in a Specific Path:

    sortgs 'neural networks' --csvpath './examples/'

    This will save the results under a subfolder called 'examples'.

  6. Multiple Keywords:

    sortgs '"deep learning" OR "neural networks" OR "machine learning"' --sortby "cit/year"

Output Example

While running, sortgs will provide updates in the terminal:

❯ sortgs "'machine learning'"
Running with the following parameters:
Keyword: 'machine learning', Number of results: 100, Save database: True, Path: /Users/wittmann/sort-google-scholar, Sort by: Citations, Plot results: False, Start year: None, End year: 2023, Debug: False
Loading next 10 results
Loading next 20 results
...

Step-by-Step Installation

  1. Install Python 3 and its dependencies from Requirements (suggestion: use Ananconda https://www.anaconda.com/distribution/)
  2. In the terminal (or cmd if using Windows), run pip install sortgs
  3. Use the command sortgs "your keyword" (replace "your keyword" to any keyword that you'd like to search)
  4. A CSV file with the name your_keyword.csv should be created.

If those steps are too complicated for you, send me an email with a list of keyworks that you'd like them ranked to: fernando [dot] wittmann [at] gmail [dot] com

Running Project Using Docker

This guide will walk you through the process of installing Docker, pulling the fernandowittmann/sort-google-scholar Docker image, and running the project.

Step 1: Install Docker

Windows or Mac

  1. Download Docker Desktop: Go to the Docker Desktop website and download the appropriate installer for your operating system.
  2. Install Docker Desktop: Run the installer and follow the on-screen instructions.
  3. Verify Installation: Open a terminal (or command prompt on Windows) and run docker --version to verify that Docker has been installed successfully.

Linux

  1. Update Package Index: Run sudo apt-get update to update your package index.
  2. Install Docker: Run sudo apt-get install docker-ce docker-ce-cli containerd.io to install Docker.
  3. Start Docker: Run sudo systemctl start docker to start the Docker daemon.
  4. Verify Installation: Run docker --version to ensure Docker is installed correctly.

Step 2: Pull the Docker Image

  1. Pull Image: Run the following command to pull the fernandowittmann/sort-google-scholar image from Docker Hub:

    docker pull fernandowittmann/sort-google-scholar

Step 3: Run the Project

  1. Create a Results Directory: Create a directory on your host machine where you want the results to be saved. For example, mkdir ~/results.

  2. Run the Docker Container: Use the following command to run the container. This command mounts your results directory to the /results directory in the container and starts the sorting process for Google Scholar results based on your specified parameters.

    docker run -v "$PWD/results:/results" -it fernandowittmann/sort-google-scholar ./sortgs.py --kw "machine learning" --sortby "cit/year" --csvpath /results

    Replace $PWD/results with the absolute path to your results directory if you are not in the parent directory of results.

Contributing

Just run:

$python -m unittest

And check if all tests passes. Alternativelly send a PR, github actions will run the tests for you.

LICENSE

  • MIT

sort-google-scholar's People

Contributors

anjukan avatar gardner avatar hadisfr avatar j-planet avatar mahmoudelsayad avatar mielverkerken avatar syoukera avatar wittmannf avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sort-google-scholar's Issues

Does not work on colab

Hello,

I'd love to incorporate this in a project I'm working on.

On colab you need to pip install selenium to make it work, and then chromedriver does not succeed, even after grabbing the binary and adding to $PATH in Colab. Would be great if this worked on colab as is, or could work out of the box on CMD line. Could be an amazing tool for the academic field.

I understand if you're no longer working on this though.

Thanks.

What's wrong?

Sometimes i get errors like:

File "C:\sort-google-scholar\sortgs.py", line 283, in
main()
File "C:\sort-google-scholar\sortgs.py", line 268, in main
print(data_ranked)
File "C:\Users\user1\AppData\Local\Programs\Python\Python39\lib\encodings\cp1253.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\xe4' in position 2451: character maps to

or

File "C:\sort-google-scholar\sortgs.py", line 283, in
main()
File "C:\sort-google-scholar\sortgs.py", line 268, in main
print(data_ranked)
File "C:\Users\user1\AppData\Local\Programs\Python\Python39\lib\encodings\cp1253.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u0131' in position 204: character maps to

Add requirements.txt

Hi,

Could you please add requirements.txt for more convenient setup?
So, anyone can easily use just 1 command to finish it:
pip3 install -r requirements.txt

Problem run in Colab

Loading next 10 results
Robot checking detected, handling with selenium (if installed)
Loading...
sortgs.py:137: DeprecationWarning: use options instead of chrome_options
driver = webdriver.Chrome(chrome_options=chrome_options)
No success. The following error was raised:
Message: unknown error: Chrome failed to start: exited abnormally.
(unknown error: DevToolsActivePort file doesn't exist)
(The process started from chrome location /usr/bin/chromium-browser is no longer running, so ChromeDriver is assuming that Chrome has crashed.)

No selenium module

I run test sortgs.py on Colab and had succeeded to pip install sortgs.
when run sortgs "deep learning" --sortby "cit/year", and then

Robot checking detected, handling with selenium (if installed)
No module named 'selenium'
Please install Selenium and chrome webdriver for manual checking of captchas
Loading...
No success. The following error was raised:
local variable 'Options' referenced before assignment

Thank you for maintaining.

pandas.errors.IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer

input:
python sortgs.py --kw "recommender system survey" OR "recommentation system survey" --startyear 2021

output:
Loading next 10 results
Loading next 20 results
Loading next 30 results
Loading next 40 results
Loading next 50 results
Loading next 60 results
Loading next 70 results
Loading next 80 results
Loading next 90 results
Loading next 100 results
Traceback (most recent call last):
File "C:\Users\ktash\Downloads\sort-google-scholar-master\sort-google-scholar-master\sortgs.py", line 313, in
main()
File "C:\Users\ktash\Downloads\sort-google-scholar-master\sort-google-scholar-master\sortgs.py", line 285, in main
data['cit/year']=data['cit/year'].round(0).astype(int)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py", line 5815, in astype
new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals\managers.py", line 418, in astype
return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals\managers.py", line 327, in apply
applied = getattr(b, f)(**kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals\blocks.py", line 591, in astype
new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\dtypes\cast.py", line 1309, in astype_array_safe
new_values = astype_array(values, dtype, copy=copy)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\dtypes\cast.py", line 1257, in astype_array
values = astype_nansafe(values, dtype, copy=copy)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\dtypes\cast.py", line 1168, in astype_nansafe
return astype_float_to_int_nansafe(arr, dtype, copy)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\dtypes\cast.py", line 1213, in astype_float_to_int_nansafe
raise IntCastingNaNError(
pandas.errors.IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer

My OS is windows. I run code by anaconda.
Thank you.

Maximum results returned

For every search I make it nevers returns more then 111 results. Is this a bug? I'm using Colab

Missing a parenthesis in the jupyter notebook

In the file Example-Python27.ipynb, this line

            links.append('Look manually at: https://scholar.google.com/scholar?start='+str(n)+'&q'+keyword.replace(' ','+')

A ) is missed at the end.

Can't run the code on Colab

Not sure what happened, but I get an error when running on Colab:

No success. The following error was raised:
local variable 'Options' referenced before assignment
Loading next 20 results

I don't have enough experience with this, but it could be just that

chrome_options = Options()
should be inside the try above:
try:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import StaleElementReferenceException
except Exception as e:
print(e)
print("Please install Selenium and chrome webdriver for manual checking of captchas")

Cannot run advanced searches using source feature

I wish to get all articles from Neural Information Processing Systems 31 (NIPS 2018) sorted by citations.

Google Scholar has an advanced search feature with Return articles published in field, where typing Advances in Neural Information Processing modifies the search query to:
source:Advances source:in source:Neural source:Information source:Processing
I also modified the dates to custom range 2018-2018 and I can correctly see 1020 results returned (there are supposed to be 1011, close enough). These seem to be already somewhat sorted by citations in Google Scholar, but I was still hoping to use your code to have these articles sorted exactly and saved in a csv. Yet, I can only get about 110 results when executing:

python3 sortgs.py --kw "source:Advances source:in source:Neural source:Information source:Processing" --sortby "cit/year" --startyear 2018 --endyear 2018

Any idea if I can achieve this? Thanks!

Sorting a paper's citations according to their own citation/year ratio

Hi,
I'm going over a google scholar citation list for a specific paper. In case there are many ciations its difficult to find the most relevent ones.
It will be usful to sort this citation list according to their own citation/year ratio.

Can the Sort-Google-Scholar tool perform this operation?

thanks

A couple of issues with Python 3.X?

Hi,

I received an error and changed line 97-99 to:
# Create a dataset and sort by the number of citations
data = pd.DataFrame(list(zip(author, title, citations, year, links)), index = rank[1:],
columns=['Author', 'Title', 'Citations', 'Year', 'Source'])
by following the advice from https://stackoverflow.com/questions/26121009/python-3-zip-is-an-iterator-in-a-pandas-dataframe

There were then further errors:

Python sortby_citations_google_scholar.py
Traceback (most recent call last):
File "sortby_citations_google_scholar.py", line 102, in <module>
...
...
AttributeError: 'DataFrame' object has no attribute 'sort'

I commented out lines related the the variable data_ranked and it then worked and output a csv.

Thanks,
Sam

Install incomplete and UnicodeEncodeError

Thank you for sharing this project. I encountered two issues:

  1. The installation of sortgs in a freshly created anaconda environment was incomplete: requests, bs4, pandas... were missing, so I installed them manually using the provided requirements.
  2. UnicodeEncodeError occurs when trying to print pandas df:
    sortgs "impedance telemetry cochlear implant" returns
    File "C:\Users\whoami\.conda\envs\literature\Lib\site-packages\sortgs.py", line 305, in main
    print(data_ranked)
    File "C:\Users\whoami\.conda\envs\literature\Lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    UnicodeEncodeError: 'charmap' codec can't encode character '\u0218' in position 830: character maps to

Conda 24.3.0
Python 3.12.3
sortgs==1.0.3
pandas==2.2.2

No module named 'selenium'. Please install Selenium and chrome webdriver for manual checking of captchas

Seems that Google scholar has configured some captcha to avoid these requests... Any ideas on how to bypass this?

$ python3 sortgs.py --kw "microcredentials" --csvpath /Users/tmp
Loading next 10 results
Robot checking detected, handling with selenium (if installed)
No module named 'selenium'
Please install Selenium and chrome webdriver for manual checking of captchas
Loading...
No success. The following error was raised:
local variable 'Options' referenced before assignment
Loading next 20 results
Robot checking detected, handling with selenium (if installed)
No module named 'selenium'
Please install Selenium and chrome webdriver for manual checking of captchas
Loading...
...

Sorting by "cit/year" doesn't work

The script says

Column name to be sorted not found. Sorting by the number of citations...
'cit/year'

Command: python sortgs.py --kw '...' --sortby "cit/year"
And the CSV output is indeed sorted by number of citations.


Python 3.8.2 (Arch Linux), fresh clone of the repo

Searching for keyword in abstract

Dear Fernando,
Thank you for having deleoped this tool. I have run the code in colab smoothly, but there is something I believe would mak your tool much stronger.
Would it be possible to search for the given keywords also in the abstracts?
Thank you in advance,
Irene

exclude words

I am trying to add the option to exclude words in the search. The key used in google scholar is &as_eq=. So I changed the GSCHOLAR_URL for this values

GSCHOLAR_URL = 'https://scholar.google.com/scholar?start={}&q={}&as_eq={}&hl=en&as_sdt=0,5'

Then when you define the url address inside main, I changed the line for this one

url = GSCHOLAR_MAIN_URL.format(str(n), keyword.replace(' ','+'), exclude.replace(' ','+'))

where exclude is passed from argparse in a similar way as for the other input parameteres. If I print the url at that point, I get this string

https://scholar.google.com/scholar?start=90&q=bankruptcy+prediction+neural+networks&as_eq=review&hl=en&as_sdt=0,5&as_ylo=2005

but the search is not excluding the word review in this case. Do you know what I am doing wrong?

Sort by date

Thanks for the tool, would it be possible to add more sorting mechanisms, I am particularly thinking of sorting by date - something which is extremely valuable for tracking publications -- moreover can the year end and year start be made more specific i.e. 12-04-2021 - 18-04-2021: for the same reason of running a scraper to pick up the latest content Thanks again for the nice tool.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.