Giter VIP home page Giter VIP logo

openreviewcrawler's Introduction

OpenReviewCrawler

This project is a crawler for OpenReview submissions. It is the base for another project that wants to match comments and reviews to specific changes between revisions.

This project was created by Lena Held, Erik Schwan and Gregor Geigle, three students of the Technische Universität Darmstadt as a course project for the UKP lab.

Table of Contents

  1. About the OpenReview API Model
  2. Setup
  3. Usage
  4. Config
  5. Output
  6. Labeling Approach
  7. Relational Database
  8. Comments as Tree
  9. Statistics of the the Data

License for the Data

There exists no license for the data by OpenReview as far as we are aware. Neither https://openreview.net/terms nor https://openreview.net/about give any licensing information.

About the OpenReview API Model

We present the way OpenReview API models the data with invitations, notes and content in-depth here.

TL;DR:
The data model with notes, invitations and content allows customization for each venue.
Notes are container for data (submissions, reviews, comments), invitations decide the type of a note and the content field of a note contains its textual content (title, abstract, authors for a paper; review text, score, confidence for a review and so on).
Invitations and content are specified by each venue as needed or wanted and can thus vary between venues in name and extend.

Setup

We recommend at least Python 3.6 to use this project.

Run pip install -r requirements.txt to install all required packages

Usage

Run python crawler.py to start the crawler with the default config ./config.json.

If you want to specify the path to the config, use -c | --config {path}. If you don't want your password in the config, use -p | --password {password}.

To get a list of all possible venues, run python crawler --help_venues

python crawler.py --help will display all possible arguments.

Config

You specify the venues and years to crawl in the config. Check out the example config here.

Use year: "all" to crawl an entire venue.

Leaving username and password empty uses the guest access.

The boolean variable acceptance_labeling determines if the data will be annotated with the acceptance decision prior to storing.

The boolean variable output_json determines if the output will be saved as json file in the outdir with as filename.

The boolean variable output_SQL determines if the data will be inserted in the database which is configured here.

output_SQL and output_json are independent from each other. Both can be true.

The boolean variable skip_pdf_download determines if the PDFs will be downloaded.

The boolean variable threaded_download if the PDFs will be downloaded with threads. This increases the speed of the download significantly. However, this feature is developed to run robustly on linux machines. We advise Windows and OSX users to switch it off.

logging_level changes the logging level. Default is INFO.

Output

The program goes over the venues by year, download the PDFs for the revisions and output a dictionary of the data. This data can then be stored as JSON or SQL-Database.

For the JSON, the information about submissions, comments and reviews are formatted as Notes as specified by OpenReview (for more information, see About the OpenReview API Model). An example JSON can be found here.

In the JSON, each submission also has the field revisions which contains a list of all previous revisions of the submissions and notes which contains a list of all comments, reviews and decisions of the submissions. Each note in this list also has a field revisions for previous iterations of it.

PDFs are stored in the format {forum}_{revision_number}.pdf. revisions_number is the position in the revision array of a submissions. Note that the array is sorted from newest to oldest.

The database uses a similar format. More on this in the section about the database.

Acceptance Labeling

How a submissions is marked as accepted or rejected varies from venue to venue, year to year and even within one venue year.

We present acceptance_labeling.py which uses string matching rules to label the acceptance status of all submissions.

Usage

Run python acceptance_labeling.py -f {output_from_crawler.json} to label the output JSON from crawler.py.

Each submission will receive a new field acceptance_tag that indicates its status. acceptance_tag can have any value from accepted, rejected, withdrawn, unknown.

The result is written in the input JSON by default, but a new output can be specified with the -w or --write_new_file argument.

python acceptance_labeling.py --help will display all possible arguments.

If an existing json datafile is placed in the output path, this file is reloaded and all existing venues are skipped.

Otherwise if venues are already stored in the SQL Database, these venues will be overwritten and not skipped. However we made sure that the ID Keys remain the same. This enables us to update only certain venues without loading all venues from the database.

Labeling Approach

The labeling is done with string matching of the invitation names and note content.

A submission is labeled as withdrawn, if its invitation contains 'withdraw'.

Otherwise we search for a decision note in one of the notes. Decision notes contain 'decision' or 'acceptance' in their invitation.

We search each content field (except title) of the decision note for 'accept' or 'reject'. If both are found in a field, it is labeled unknown, otherwise it is labeled accepted or rejected. If both words are not found in a decision note, we also label it as accepted. This is due to many venues, which only write for what a submission is accepted (e.g. poster or workshop). All rejected papers that we have seen contain 'reject' in their decision note (except ICLR 2014) , so we do not expect false positives from this.

Results

We present the results of our approach on all venues and years with config_all.json [Date: 03.02.2020]

Total Unknown Withdrawn Accepted Rejected
9801 2412 784 3008 3597

We manually verified that the label is correct for each venue. The approach and results can be found here

Relational Database

Database Connection

The crawler is able to convert the dictionary data into a structured SQL Database format. Therefore it utilizes the popular SQLAlchemy Library which can interface various popular database systems.

This database can be configured here.

Since the database uses the normalization standards, it cannot store lists. Therefore it is theoretically possible that the data-dictionary input contains more data than the database is storing. However the implementation is quite robust to not omit the key data about the paper-review process.

Database Values

Most values are intuitive or might be looked up here. However for the following variables an explanation is quite useful:

cdate (int, optional) – Creation date

tcdate (int, optional) – True creation date

tmdate (int, optional) – Modification date

ddate (int, optional) – Deletion date

UML Diagramm

UML_Diagramm

Error Explaination

Request Error for ID ...

During our test period, this error has been raised when the internet bandwidth was to low to handle all Requests in a appropriate time. The Request gets a timeout and therefore raises an error.

If this error get raised, no data will be downloaded. Therefore the PDF for the Submission will be missing.

Threading

If the threading is turned on in the configuration file, we are downloading the PDFs with multiple threads. The number of threads is not limited so the download-capacity will be limited by your bandwidth and/or your hardware. OpenReview confirmed on request that we don't need to limit the requests per second for their servers.

Furthermore, our crawler has a separate thread to communicate with the database. This database communicator works queue-based and inserts data with a FIFO paradigm.

All PDF-download threads are pushing their resulting data into the queue. The data is then inserted into the database by the database thread. We are waiting until all PDF data is inserted into the database before we are inserting the JSON Data. This avoids insertion conflicts.

Comments as Tree

The comments of a submission are stored as a list.
However, in reality they form a forest of top-level comments and their replies.

For use cases where the comments should be stored as this forest have we created comment_tree.py.

It is used by executing python comment_tree.py -f {input.json}.
{input.json} is the output JSON from crawler.py and the output is a new JSON {input}_comment_tree.json with changes to the content of the notes field of each submission to contain the comment forest. Each comment note there then also contains a new field replies which holds a list of all replying comments notes.

Statistics of the the Data

We present a graphical exploration of distributions behind the submissions, comments and revisions here.

openreviewcrawler's People

Contributors

erikbird avatar gregor-ge avatar acid-horny avatar

Watchers

 avatar Moritz Bock avatar  avatar  avatar  avatar Gözde Gül avatar  avatar  avatar

openreviewcrawler's Issues

Presentation

Create a presentation about the project.
Dates are: 23.1., 30.1., 6.2.
Express preference for first slot.

Aquired Data

Hand in acquired data in a suitable format. Should be fed to the Chat Bot.

Required Data (as stated in project goal):

  • comments,
  • paper revisions
  • timestamp of the discussion comment
  • timestamp of revisions of the paper
  • review content
  • comment content
  • review author
  • review scores
  • metadata about the authors
    Data should include rejected papers, full discussion threads and author information.

Parse comments into tree

Eine Funktion wäre nicht schlecht, die Notes nimmt und daraus den Kommentarbaum macht. Wird für 2B hilfreich sein vermutlich.
Außerdem sind aktuell Kommentare, Reviews, Autor Kommentare etc in eigenen Invitations. Bei einem Baum hat man die dann zusammen.

TypeError: iterget_invitations() got an unexpected keyword argument 'expired'

I am getting an error after cloning, installing requirements and adding username and password to config.json:

$ python crawler.py

/anaconda3/lib/python3.7/site-packages/fuzzywuzzy/fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning
warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')
Login as was successful
Current Download: NIPS.cc in 2018
Traceback (most recent call last):
File "crawler.py", line 172, in
results = crawl(client, config, log)
File "crawler.py", line 39, in crawl
invitations_iterator = openreview.tools.iterget_invitations(client, regex="{}/{}/".format(venue, year), expired=True)
TypeError: iterget_invitations() got an unexpected keyword argument 'expired'

Code Documentation

  • Comment Code with Inline Style

  • Update README.md

  • Explain usage with example

Define Output Format

Communicate with Group B about the output format of the data.
Defining what the json should look like in next meeting.

Project Report

End Report for the whole project.
Details needed: format, scope, minimum pages

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.