Giter VIP home page Giter VIP logo

mlcites's Introduction

MLcites

Repository Overview

This repository contains:

  1. Citation and author affiliation data of published papers in NeurIPS (2014 to 2018) and ICML (2017, 2018), collected March 31st 2019.
  2. The script used to collect this data, which can be rerun or repurposed to collect data from other publication venues and years.

Blogpost and Jupyter Notebook

For a detailed overview of the data and statistics on top cited papers, citation distributions, topic trends and academia/industry splits, check out the blog post!

The repository also contains a jupyter notebook with the code for the simple analyses outlined in the blogpost.

Some Example Results: Accepted Papers, Top Cited Papers, Topic Trends

Below are some of the simpel results we can compute, taken from the blog post and jupyter notebook.

Accepted Papers and Finding Citation Information

Collecting the data is a somewhat involved and noisy process (overviewed in detail below), and we can't find citation information for all the papers. But we don't do too badly -- below is a plot of the total number of accepted papers, and the number we find citation information for:

alt text

There's approximately 50 or so papers in any conference that we don't find citation information for.

Top Cited Papers

We can also see what the top cited papers in our collected data are. Below is a picture of the top cited papers in our dataset from NeurIPS 2014, 2015 and 2016:

alt text

We can see the original GAN paper, seq2seq, Faster R-CNN for object detection and several others.

Topic Trends

We can use the title information to study trends in different topics through the years. GANs are a nice case study, because as we saw above, the original paper was published in NeurIPS 2014, and since then the topic has been of huge interest to the community.

alt text

Above are plots showing the total number of GAN papers and fraction of GAN papers at different years in NeurIPS. In 2014, there is only one paper, the original GAN paper, but by 2017, there are around 25 GAN papers, comprising 3.5% of the conference!

Further Results

Additional results on citation distributions and academia/industry breakdowns can be found in the blogpost and jupyter notebook.

Details on the Data and Scraping Method

Here we overview the data as well as the scraping method and code.

Data

The data was collected on March 31st 2019 (from running get_statistics_allyears.py) and is in files of form results_<conference name>_<year>.p, which are pickled pandas dataframes.

Scraping the Data

For each paper, the citation information and author affiliation is scraped from Google Scholar. It is somewhat tricky to scrape paper information directly (involves parsing javascript), but author profiles are much easier to search. Therefore, the script sequentially looks up the authors of a paper on Google Scholar, and goes through their publication list to try and see if they have a publication matching the name of the conference paper. Note the number of papers to search also has to be specificed, and currently the script looks at 200 of the most recent papers.

If a match is found, the script updates the paper with the citation count and the found author's affiliation. To save time, the script does not look up all the authors of a paper, only the minimum number needed to find a citation. Note that the affiliation of the author is their current affiliation, not necessarily the affiliation they had when they wrote the paper. We study this in detail when analyzing academia/industry breakdowns in the blogpost.

The script to perform this scraping, get_statistics_allyears.py broadly consists of two parts. Two of the functions, get_html and get_name_and_authors are used to parse the accepted papers page of NeurIPS and ICML. The get_citations function then tries to find citation and affilation information for each paper in turn.

Note that it should be relatively easy to scrape paper lists from other conferences, such as CVPR or ICLR. The get_citations function can then be used in conjunction with these to find citation information about these papers too.

Code Dependencies and Setup

The code to scrape the data is written for Python 3, and besides standard packages such as pandas, relies on the scholarly package. Note: unfortunately, several functions in scholarly don't seem to be working anymore, due to changes in Google Scholar. We will make some edits to the installation to ensure it can retrieve author information.

Instructions for Scraping the Data:

  1. Install scholarly and its dependencies.
  2. In the Author class in /usr/local/lib/python3.5/dist-packages/scholarly.py, edit all occurences of gsc_oai_<text> to gs_ai_<text>.
  3. Run the scraping script: e.g. python3 get_statistics_allyears.py --conference=NeurIPS --year=2016

Open Questions and Future Work

We overview several open questions in the blogpost. Some scraper specific ones: it would be very interesting to see this analysis applied to other publishing venues. It would also be interesting to see if affiliations at time of publishing could be scraped to get a better sense of the shift from academia/industry.

I hope this is a useful resource for the community!

mlcites's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

mlcites's Issues

CSV Format?

Hi Maithra, I'm using your citation data for a talk at ICLR reproducibility workshop tomorrow. I was going to push this to my fork as a csv and was just curious why they were pickled. Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.