This repository is used to publish all the code used for the following article:
The code is not yet completely released. Will update here when it is done.
If anyone sees a number in our paper, there is a script one can execute to reproduce it. No responsibility should be imposed on the user to figure out any experimental parameter barried in the paper's content.
The data
directory contains the preprocessing scripts for all the datasets used in the paper. These datasets are released separately of their processing source code. See below for details.
The following table is a summary of the datasets. Most of them have millions of samples for training.
Dataset | Language | Classes | Train | Test |
---|---|---|---|---|
Dianping | Chinese | 2 | 2,000,000 | 500,000 |
JD full | Chinese | 5 | 3,000,000 | 250,000 |
JD binary | Chinese | 2 | 4,000,000 | 360,000 |
Rakuten full | Japanese | 5 | 4,000,000 | 500,000 |
Rakuten binary | Japanese | 2 | 3,400,000 | 400,000 |
11st full | Korean | 5 | 750,000 | 100,000 |
11st binary | Korean | 2 | 4,000,000 | 400,000 |
Amazon full | English | 5 | 3,000,000 | 650,000 |
Amazon binary | English | 2 | 3,600,000 | 400,000 |
Ifeng | Chinese | 5 | 800,000 | 50,000 |
Chinanews | Chinese | 7 | 1,400,000 | 112,000 |
NYTimes | English | 7 | 1,400,000 | 105,000 |
Joint full | Multilingual | 5 | 10,750,000 | 1,500,000 |
Joint binary | Multilingual | 2 | 15,000,000 | 1,560,000 |
Datasets are released separtely of the source code via links from Google Drive. These datasets should only be used for the purpose of research.
Dataset | Train | Test |
---|---|---|
Dianping | Link | Link |
JD full | Link | Link |
JD binary | Link | Link |
Rakuten full | Link | Link |
Rakuten binary | Link | Link |
11st full | Link | Link |
11st binary | Link | Link |
Amazon full | Link | Link |
Amazon binary | Link | Link |
Ifeng | Link | Link |
Chinanews | Link | Link |
NYTimes | Link | Link |
Joint full | Link | Link |
Joint binary | Link | Link |
The glyphnet
scripts require the GNU Unifont character images to run. The file unifont-8.0.01.t7b.xz
can be downloaded via this link.