harshdeep1996 / cite-classifications-wiki Goto Github PK
View Code? Open in Web Editor NEWCitation Classification using hybrid neural network model for Wikipedia References
License: Creative Commons Zero v1.0 Universal
Citation Classification using hybrid neural network model for Wikipedia References
License: Creative Commons Zero v1.0 Universal
Please use CC BY 4.0 here and in Zenodo consistently.
Same to Zenodo
Same to Zenodo
Hi @Harshdeep1996 !
I recently discovered some annoying problems in the output dataset, which is stored in the parquet format.
The 'metadata_file' column is stored as a byte
array instead of a proper Python unicode str
.
At least, this is what I get when importing that column.
Example: b"_1234.json"
instead of "_1234.json"
In some columns (I'm not sure having identified all of them) strings are stored with unicode escape sequences inside.
What I mean is that the actual bytes that are stored persistently in the parquet file correspond to unicode string like this one 12\u201345
(where the \u2013
sequence is not interpreted as an em-dash char but as the string composed by the following chars: ['\', 'u', '2', '0', '1', '3']
). It should instead be stored as a proper Python unicode str
.
From my findings, these are the columns affected by this problem (but I'm not 100% sure, I need your help here):
I currently assume every other column to be stored as a proper Python unicode str
.
Hi @Harshdeep1996 , I'm working on the parent dataset (the 'citations_from_wikipedia.zip' file available on Zenodo).
I found some duplicated rows (approx. 2 thousands for each parquet partition file), meaning that they have the same 'id' and the same 'citations' value. As a result of the workflow of this project, the entire lines are completely equal.
Those duplicated lines should be removed from the next edition of the dataset.
As a suggestion, these lines of code could be used at some point during the workflow.
Add two columns to the full and minimal datasets:
book
, journal article
, web content
)ground truth
, classifier
)A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.