Giter VIP home page Giter VIP logo

italkicorpus's Introduction

italkiCorpus example workflow

Dataset for our work: On the Development of a Large Scale Corpus for Native Language Identification.

Note: The italki website has moved away from the notebooks used in this project. This code probably wont work anymore (at least till updated)

Gathering data

Due to copyright reasons we don't publish the raw data. Instead, tools are provided to recreate NLI corpus from the italki website.

To recreate the exact same dataset as collected in 2017, pass the ID list file:

python3 scrape.py recreate 2017_ids.txt

Collect your own new data using:

python3 scrape.py scrape arabic chinese french german hindi italian japanese korean russian spanish turkish

By default, this will make a new folder italki_data with .txt files named with their document id, as well as a label csv file:

document_id, author_id, L1, english_proficiency
142576, 32162, Turkish, 2
248781, 12987, French, 4
...

A simple benchmark (WIP)

In the benchmark folder there are 2 scripts:

  1. italki/italki.py - Loads the data using the Huggingface Datasets library. You can reuse this for your own models.
  2. train.py - Trains a simple bert model using the dataset.

Feel free to use and adapt these for your own research. To include the huggingface datasets version in your own script, you can write:

>>> import datasets
>>> ds = datasets.load_dataset("./benchmark/italki", data="../italki_data")
>>> print(ds["train"][0])
{"document": "Today I went to...", "native_language": "French", "proficiency": 5, ...}

Citation

If you use this dataset in your work, please cite:

@inproceedings{hudson2018development,
  title={On the Development of a Large Scale Corpus for Native Language Identification},
  author={Hudson, Thomas G and Jaf, Sardar},
  booktitle={Proceedings of the 17th International Workshop on Treebanks and Linguistic Theories (TLT 2018), December 13--14, 2018, Oslo University, Norway},
  number={155},
  pages={115--129},
  year={2018},
  organization={Link{\"o}ping University Electronic Press}
}

Dataset Metadata

The following table is necessary for this dataset to be indexed by search engines such as Google Dataset Search.

property value
name Italki Native Language Identification Dataset
alternateName Italki
url
description Native Language Identification (NLI) is the task of identifying an author’s native language from their writings in a second language. This dataset (italki) consists of large quantities of text from the language learning website italki. The italki website creates a community for language learners to access teaching resources, practice speaking, discuss topics and ask questions in their target language (the English language). We gather free-form ‘Notebook’ documents, which are mainly autobiographical diary entries with connected profiles describing the native language of the author.

This repository contains scripts to download the data along with the ids to recreate the 2017 dataset.

citation https://ep.liu.se/ecp/article.asp?issue=155&article=012

italkicorpus's People

Contributors

fohlen avatar ghomashudson avatar kritigupta13 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

italkicorpus's Issues

`XXX not found`

It seems like the italki api has changed, so the python code doesn't work anymore. Either that or the IDs have changed, resulting in 404

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.