harshdeep1996 / cite-classifications-wiki Goto Github PK

View Code? Open in Web Editor NEW

24.0 2.0 5.0 749.38 MB

Citation Classification using hybrid neural network model for Wikipedia References

License: Creative Commons Zero v1.0 Universal

Python 35.39% Jupyter Notebook 39.22% Makefile 0.28% C 9.11% C++ 0.08% Shell 0.25% Batchfile 0.10% Lua 15.57%

dataset dataset-generation citations deep-learning wikipedia

cite-classifications-wiki's People

Contributors

Stargazers

Watchers

Forkers

dh-davidhuang mx781 alsowbdxa harith1996 zhangbeibei1991

cite-classifications-wiki's Issues

Align licenses

Please use CC BY 4.0 here and in Zenodo consistently.

Add page_id to the columns extracted

Same to Zenodo

Different charset encodings for parquet output columns

Hi @Harshdeep1996 !
I recently discovered some annoying problems in the output dataset, which is stored in the parquet format.

First problem

The 'metadata_file' column is stored as a byte array instead of a proper Python unicode str.
At least, this is what I get when importing that column.
Example: b"_1234.json" instead of "_1234.json"

Second problem

In some columns (I'm not sure having identified all of them) strings are stored with unicode escape sequences inside.
What I mean is that the actual bytes that are stored persistently in the parquet file correspond to unicode string like this one 12\u201345 (where the \u2013 sequence is not interpreted as an em-dash char but as the string composed by the following chars: ['\', 'u', '2', '0', '1', '3']). It should instead be stored as a proper Python unicode str.

From my findings, these are the columns affected by this problem (but I'm not 100% sure, I need your help here):

Authors
Chapter
Date
ID_list
Issue
Pages
Periodical
PublisherName
Title
Volume

I currently assume every other column to be stored as a proper Python unicode str.

Duplicate rows found in the parent dataset

Hi @Harshdeep1996 , I'm working on the parent dataset (the 'citations_from_wikipedia.zip' file available on Zenodo).

I found some duplicated rows (approx. 2 thousands for each parquet partition file), meaning that they have the same 'id' and the same 'citations' value. As a result of the workflow of this project, the entire lines are completely equal.

Those duplicated lines should be removed from the next edition of the dataset.
As a suggestion, these lines of code could be used at some point during the workflow.

Add labels to citations

Add two columns to the full and minimal datasets:

label (book, journal article, web content)
source (ground truth, classifier)
perhaps also add the confidence from the classifier, if applicable

harshdeep1996 / cite-classifications-wiki Goto Github PK

cite-classifications-wiki's People

Contributors

Stargazers

Watchers

Forkers

cite-classifications-wiki's Issues

Align licenses

Add CC 4.0 license to repo

Add page_id to the columns extracted

Different charset encodings for parquet output columns

First problem

Second problem

Duplicate rows found in the parent dataset

Add labels to citations

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent