Giter VIP home page Giter VIP logo

harshdeep1996 / cite-classifications-wiki Goto Github PK

View Code? Open in Web Editor NEW
24.0 2.0 5.0 749.38 MB

Citation Classification using hybrid neural network model for Wikipedia References

License: Creative Commons Zero v1.0 Universal

Python 35.39% Jupyter Notebook 39.22% Makefile 0.28% C 9.11% C++ 0.08% Shell 0.25% Batchfile 0.10% Lua 15.57%
dataset dataset-generation citations deep-learning wikipedia

cite-classifications-wiki's Introduction

Wikipedia Citations:A comprehensive dataset of citations with identifiers extracted from English Wikipedia

DOI

The documentation is written as WIKI in: DOCUMENTATION

A dataset of citations is extracted from English Wikipedia (date: May 2020) which comprises of 35 different templates such as cite news, cite web.

The dataset contains 29.276 million citations and then a subset is prepared which contains citations with identifiers which is 3.92 millions in size. The citations with identifiers dataset only covers the DOI, ISBN, PMC, PMID, ArXIV identifiers.

Along with the 2 dataset of citations, 2 frameworks are written to train the citations and get the classification - if the citation is scientific or not. Anyone is open to build models or do experiments using the extracted datasets and improve our results!

Please use the notebook minimal_dataset_demo.ipynb to play with the minimal_dataset.zip by downloading it from Zenodo (http://doi.org/10.5281/zenodo.3940692).

Running the repository

Assuming that Python is already installed (tested with version >= 2.7), the list of dependenices is written in requirements.txt, the libraries can be installed using:

pip install requirements.txt

The notebooks can be accessed using:

jupyter notebook

Contents

  • README.md this file.
  • data/
    • citations_separated: Dataset containing all citations from Wikipedia with each of the column keys separated and compress in parquet format (pre lookup).
    • citations_ids Subset of the above dataset but containing all citation which have a valid identifier such as DOI, ISBN, PMC, PMID or ArXIV.
    • top300_templates A CSV file which contains the TOP 300 csv templates as calculated by DLAB-EPFL.
  • libraries/: Contains the libraries mwparserfromhell and wikiciteparser which have been changed for the scope of the project. To get all the datasets, the user would need to install these versions of the libraries.
  • lookup: Contains two scripts run_metadata.py and get_apis.py which can be used to query CrossRef and Google books. run_metadata.py script is run asynchronously and right now can only be used for CrossRef. get_apis.py uses the requests library and can be used for querying short loads of metadata. Other files are related to the crossref evaluation to known what is the best heuristic and confidence threshold.
  • notebooks: Contains the notebooks which --
    • do analysis against some other similar work (Sanity_Check_Citations.ipynb)
    • play with features (Feature_Data_Analysis.ipynb)
    • the hybrid network model which does the classification for the examples and contains all the steps (citation_network_model_3_labels.ipynb)
    • some of the results and which we get from the lookup -- and the corresponding label we classify them into (results_predication_lookup.ipynb)
    • doing post lookup steps such as linking the potential journal labeled citations with their corresponding metadata (wild_examples_lookup_journal.ipynb)
  • scripts: Contains all the scripts to generate the dataset and features. For each script, a description is given at the top. All the paths to files are currently absolute paths used to run the script -- so please remember while running these scripts to change them.
  • tests: Some tests to check if the scripts for the data generation do what they are supposed to. Multiple tests would be added in the future to check the whole pipeline.

How to cite

@misc{singh2020wikipedia,
    title={Wikipedia Citations: A comprehensive dataset of citations with identifiers extracted from English Wikipedia},
    author={Harshdeep Singh and Robert West and Giovanni Colavizza},
    year={2020},
    eprint={2007.07022},
    archivePrefix={arXiv},
    primaryClass={cs.DL}
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.