Giter VIP home page Giter VIP logo

harshdeep1996 / cite-classifications-wiki Goto Github PK

View Code? Open in Web Editor NEW
24.0 2.0 5.0 749.38 MB

Citation Classification using hybrid neural network model for Wikipedia References

License: Creative Commons Zero v1.0 Universal

Python 35.39% Jupyter Notebook 39.22% Makefile 0.28% C 9.11% C++ 0.08% Shell 0.25% Batchfile 0.10% Lua 15.57%
dataset dataset-generation citations deep-learning wikipedia

cite-classifications-wiki's People

Contributors

harshdeep1996 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

cite-classifications-wiki's Issues

Align licenses

Please use CC BY 4.0 here and in Zenodo consistently.

Different charset encodings for parquet output columns

Hi @Harshdeep1996 !
I recently discovered some annoying problems in the output dataset, which is stored in the parquet format.

First problem

The 'metadata_file' column is stored as a byte array instead of a proper Python unicode str.
At least, this is what I get when importing that column.
Example: b"_1234.json" instead of "_1234.json"

Second problem

In some columns (I'm not sure having identified all of them) strings are stored with unicode escape sequences inside.
What I mean is that the actual bytes that are stored persistently in the parquet file correspond to unicode string like this one 12\u201345 (where the \u2013 sequence is not interpreted as an em-dash char but as the string composed by the following chars: ['\', 'u', '2', '0', '1', '3']). It should instead be stored as a proper Python unicode str.

From my findings, these are the columns affected by this problem (but I'm not 100% sure, I need your help here):

  • Authors
  • Chapter
  • Date
  • ID_list
  • Issue
  • Pages
  • Periodical
  • PublisherName
  • Title
  • Volume

I currently assume every other column to be stored as a proper Python unicode str.

Duplicate rows found in the parent dataset

Hi @Harshdeep1996 , I'm working on the parent dataset (the 'citations_from_wikipedia.zip' file available on Zenodo).

I found some duplicated rows (approx. 2 thousands for each parquet partition file), meaning that they have the same 'id' and the same 'citations' value. As a result of the workflow of this project, the entire lines are completely equal.

Those duplicated lines should be removed from the next edition of the dataset.
As a suggestion, these lines of code could be used at some point during the workflow.

Add labels to citations

Add two columns to the full and minimal datasets:

  • label (book, journal article, web content)
  • source (ground truth, classifier)
  • perhaps also add the confidence from the classifier, if applicable

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.