Giter VIP home page Giter VIP logo

porsimplessent's Introduction

PorSimplesSent

A Portuguese corpus of aligned sentences pairs to investigate sentence readability assessment

NILC

This corpus was created during my master's degree at ICMC-USP, and made possible thanks to the Interinstitutional Center for Computational Linguistics - NILC (Núcleo Interestitucional de Linguística Computacional), represented by my advisor Dra. Sandra Maria Aluísio and the linguistics specialist Dra. Magali Sanches Duran.

http://www.nilc.icmc.usp.br/nilc/index.php

License

CC BY 4.0

Citation

@inproceedings{leal2018pss,
    author = {Sidney Evaldo Leal and Magali Sanches Duran and Sandra Maria Aluísio},
    title = {A Nontrivial Sentence Corpus for the Task of Sentence Readability Assessment in Portuguese},
    booktitle = {Proceedings of the 27th International Conference on Computational Linguistics (COLING 2018)},
    year = {2018},
    pages = {401–413},
    month = {August},
    date = {20-26},
    address = {Santa Fe, New Mexico, USA},
}

TSV format

All files are in Tab Separated Values (TSV) format, it means that fields are separated by tab (Also knows as char(9) or \t), and newline (char(10) or \n) for the rows.

PorSimples

In this folder you'll find the source corpus used to extract the sentence pairs, already exportaded in TSV format:

porsimples_sentences.tsv

  • production_id: Each triplet of texts (original, natural, strong) has an unique id, called production_id.
  • level: ORI (1 - Original), level NAT (2 - Natural) or STR (3 - Strong)
  • text_id: Unique id for each text.
  • sentence_id: Unique id for each sentence.
  • paragraph: Sequential id for the paragraph in text.
  • sentence_text: The raw text from the sentence.

porsimples_aligns.tsv

  • production_id: See porsimples_sentences.tsv.
  • level: Simplification level ORI->NAT or NAT->STR.
  • text_id_from: Text id from source side of simplification.
  • sentence_id_from: sentence id from source side of simplification.
  • text_id_to: Text id for target side of simplification.
  • sentence_id_to: Sentence id for target side of simplification.

PorSimplesSent (pss)

In this folder are the files with aligned pairs from pss0 to pss3, it all have the same layout:

  • production_id: See porsimples_sentences.tsv.
  • level: Simplification level ORI->NAT, NAT->STR or ORI->STR.
  • changed: If the sentence has changes in this simplification level.
  • split: If the sentence suffers split in this simplification level.
  • sentence_text_from: The raw text of the source sentence.
  • sentence_text_to: The raw text of the target sentence.

pss0 - Split sentences concatenated

Concatenate all resulting split sentences on the right side, may be usefull to study the simplification process.

  • pss0_align_concat_ori_nat.tsv
  • pss0_align_concat_nat_str.tsv

pss1 - All splits (1 to n)

Repeats left side sentence to each one resulting split

  • pss1_align_all_splits_ori_nat.tsv
  • pss1_align_all_splits_nat_str.tsv
  • pss1_align_all_splits_ori_str.tsv

pss2 - Major Length splits (1 to major(n))

Only the sentence with bigger length and most overlap of tokens. Repeats left side sentence when two resulting split sentences has the same size and overlap.

  • pss2_align_length_ori_nat.tsv
  • pss2_align_length_nat_str.tsv
  • pss2_align_length_ori_str.tsv

pss3 - No split sentences (1 to 1)

Only the sentences that not suffered split.

  • pss3_align_no_splits_ori_nat.tsv
  • pss3_align_no_splits_nat_str.tsv
  • pss3_align_no_splits_ori_str.tsv

PorSimplesSent - Triplets

In the file triplets_length.tsv, are sentences from the 3 levels, generated from the pss2_length pairs, in the following layout:

  • production_id: See porsimples_sentences.tsv.
  • level: Fixed - ORI->NAT->STR.
  • changed_ori_nat: If the sentence has changes from the original to the natural level.
  • changed_nat_str: If the sentence has changes from the natural to the strong level.
  • original_text: The raw text of the original sentence.
  • natural_text: The raw text of the natural sentence.
  • strong_text: The raw text of the strong sentence.

Statistics

Total sentences Original: 2907
      Zero Hora: 2067
      Caderno Ciencia FSP: 840
Total sentences Natural: 4066
Total sentences Strong: 4971
Total sentences ALL: 11944

Total sentences NO SIMPLIFICATION Original->Natural: 565
Total sentences NO SIMPLIFICATION Natural->Strong: 2619

Total sentences SPLIT Original->Natural: 826
Total sentences SPLIT Natural->Strong: 721

Total sentences Natural from split: 1990
Total sentences Strong from split: 1625

Total sentences SIMPLIFIED (no split) Original->Natural: 1515
Total sentences SIMPLIFIED (no split) Natural->Strong: 729

Total pairs simplified Original->Natural: 2340
Total pairs simplified Natural->Strong: 1450
Total pairs simplified Original->Strong: 1101
Total all pairs simplified: 4891

Total triplets NO SIMPLIFICATION 3 Levels: 393
Total triplets Simplified Only Original->Natural: 1297
Total triplets Simplified Only Natural->Strong: 181
Total triplets Simplified 3 Levels: 1099
Total triplets: 2970

Mean token size of sentences - simplified (no split) - Ori->Nat: 20
Min token size of sentences - simplified (no split) - Ori->Nat: 3
Max token size tokens of sentences - simplified (no split) - Ori->Nat: 69

Mean token size of sentences - simplified (with split) - Ori->Nat: 33
Min token size of sentences - simplified (with split) - Ori->Nat: 6
Max token size tokens of sentences - simplified (with split) - Ori->Nat: 54

Mean token size of sentences - simplified (no split) - Nat->Str: 22
Min token size of sentences - simplified (no split) - Nat->Str: 4
Max token size tokens of sentences - simplified (no split) - Nat->Str: 57

Mean token size of sentences - simplified (with split) - Nat->Str: 24
Min token size of sentences - simplified (with split) - Nat->Str: 5
Max token size tokens of sentences - simplified (with split) - Nat->Str: 49

Mean tokens size diff of sentences - Originals vs simplified (no split) - Ori->Nat: 6
Min tokens size diff of sentences - Originals vs simplified (no split) - Ori->Nat: 1
Max tokens size diff of sentences - Originals vs simplified (no split) - Ori->Nat: 26

Mean tokens size diff of sentences - Originals vs simplified (with split) - Ori->Nat: 9
Min tokens size diff of sentences - Originals vs simplified (with split) - Ori->Nat: 1
Max tokens size diff of sentences - Originals vs simplified (with split) - Ori->Nat: 64

Total PSS1 Original->Natural: 3504
Total PSS1 Natural->Strong: 4971
Total PSS1 Original->Strong: 2047
Total geral PSS1: 10522

Total PSS2 Original->Natural: 2370
Total PSS2 Natural->Strong: 1491
Total PSS2 Original->Strong: 1101
Total geral PSS2: 4962

Total PSS3 Original->Natural: 1515
Total PSS3 Natural->Strong: 729
Total PSS3 Original->Strong: 260
Total geral PSS3: 2504

porsimplessent's People

Contributors

sidleal avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.