PorSimplesSent

A Portuguese corpus of aligned sentences pairs to investigate sentence readability assessment

NILC

This corpus was created during my master's degree at ICMC-USP, and made possible thanks to the Interinstitutional Center for Computational Linguistics - NILC (Núcleo Interestitucional de Linguística Computacional), represented by my advisor Dra. Sandra Maria Aluísio and the linguistics specialist Dra. Magali Sanches Duran.

http://www.nilc.icmc.usp.br/nilc/index.php

License

CC BY 4.0

Citation

@inproceedings{leal2018pss,
    author = {Sidney Evaldo Leal and Magali Sanches Duran and Sandra Maria Aluísio},
    title = {A Nontrivial Sentence Corpus for the Task of Sentence Readability Assessment in Portuguese},
    booktitle = {Proceedings of the 27th International Conference on Computational Linguistics (COLING 2018)},
    year = {2018},
    pages = {401–413},
    month = {August},
    date = {20-26},
    address = {Santa Fe, New Mexico, USA},
}

TSV format

All files are in Tab Separated Values (TSV) format, it means that fields are separated by tab (Also knows as char(9) or \t), and newline (char(10) or \n) for the rows.

PorSimples

In this folder you'll find the source corpus used to extract the sentence pairs, already exportaded in TSV format:

porsimples_sentences.tsv

production_id: Each triplet of texts (original, natural, strong) has an unique id, called production_id.
level: ORI (1 - Original), level NAT (2 - Natural) or STR (3 - Strong)
text_id: Unique id for each text.
sentence_id: Unique id for each sentence.
paragraph: Sequential id for the paragraph in text.
sentence_text: The raw text from the sentence.

porsimples_aligns.tsv

production_id: See porsimples_sentences.tsv.
level: Simplification level ORI->NAT or NAT->STR.
text_id_from: Text id from source side of simplification.
sentence_id_from: sentence id from source side of simplification.
text_id_to: Text id for target side of simplification.
sentence_id_to: Sentence id for target side of simplification.

PorSimplesSent (pss)

In this folder are the files with aligned pairs from pss0 to pss3, it all have the same layout:

production_id: See porsimples_sentences.tsv.
level: Simplification level ORI->NAT, NAT->STR or ORI->STR.
changed: If the sentence has changes in this simplification level.
split: If the sentence suffers split in this simplification level.
sentence_text_from: The raw text of the source sentence.
sentence_text_to: The raw text of the target sentence.

pss0 - Split sentences concatenated

Concatenate all resulting split sentences on the right side, may be usefull to study the simplification process.

pss0_align_concat_ori_nat.tsv
pss0_align_concat_nat_str.tsv

pss1 - All splits (1 to n)

Repeats left side sentence to each one resulting split

pss1_align_all_splits_ori_nat.tsv
pss1_align_all_splits_nat_str.tsv
pss1_align_all_splits_ori_str.tsv

pss2 - Major Length splits (1 to major(n))

Only the sentence with bigger length and most overlap of tokens. Repeats left side sentence when two resulting split sentences has the same size and overlap.

pss2_align_length_ori_nat.tsv
pss2_align_length_nat_str.tsv
pss2_align_length_ori_str.tsv

pss3 - No split sentences (1 to 1)

Only the sentences that not suffered split.

pss3_align_no_splits_ori_nat.tsv
pss3_align_no_splits_nat_str.tsv
pss3_align_no_splits_ori_str.tsv

PorSimplesSent - Triplets

In the file triplets_length.tsv, are sentences from the 3 levels, generated from the pss2_length pairs, in the following layout:

production_id: See porsimples_sentences.tsv.
level: Fixed - ORI->NAT->STR.
changed_ori_nat: If the sentence has changes from the original to the natural level.
changed_nat_str: If the sentence has changes from the natural to the strong level.
original_text: The raw text of the original sentence.
natural_text: The raw text of the natural sentence.
strong_text: The raw text of the strong sentence.

Statistics

Total sentences Original: 2907
      Zero Hora: 2067
      Caderno Ciencia FSP: 840
Total sentences Natural: 4066
Total sentences Strong: 4971
Total sentences ALL: 11944

Total sentences NO SIMPLIFICATION Original->Natural: 565
Total sentences NO SIMPLIFICATION Natural->Strong: 2619

Total sentences SPLIT Original->Natural: 826
Total sentences SPLIT Natural->Strong: 721

Total sentences Natural from split: 1990
Total sentences Strong from split: 1625

Total sentences SIMPLIFIED (no split) Original->Natural: 1515
Total sentences SIMPLIFIED (no split) Natural->Strong: 729

Total pairs simplified Original->Natural: 2340
Total pairs simplified Natural->Strong: 1450
Total pairs simplified Original->Strong: 1101
Total all pairs simplified: 4891

Total triplets NO SIMPLIFICATION 3 Levels: 393
Total triplets Simplified Only Original->Natural: 1297
Total triplets Simplified Only Natural->Strong: 181
Total triplets Simplified 3 Levels: 1099
Total triplets: 2970

Mean token size of sentences - simplified (no split) - Ori->Nat: 20
Min token size of sentences - simplified (no split) - Ori->Nat: 3
Max token size tokens of sentences - simplified (no split) - Ori->Nat: 69

Mean token size of sentences - simplified (with split) - Ori->Nat: 33
Min token size of sentences - simplified (with split) - Ori->Nat: 6
Max token size tokens of sentences - simplified (with split) - Ori->Nat: 54

Mean token size of sentences - simplified (no split) - Nat->Str: 22
Min token size of sentences - simplified (no split) - Nat->Str: 4
Max token size tokens of sentences - simplified (no split) - Nat->Str: 57

Mean token size of sentences - simplified (with split) - Nat->Str: 24
Min token size of sentences - simplified (with split) - Nat->Str: 5
Max token size tokens of sentences - simplified (with split) - Nat->Str: 49

Mean tokens size diff of sentences - Originals vs simplified (no split) - Ori->Nat: 6
Min tokens size diff of sentences - Originals vs simplified (no split) - Ori->Nat: 1
Max tokens size diff of sentences - Originals vs simplified (no split) - Ori->Nat: 26

Mean tokens size diff of sentences - Originals vs simplified (with split) - Ori->Nat: 9
Min tokens size diff of sentences - Originals vs simplified (with split) - Ori->Nat: 1
Max tokens size diff of sentences - Originals vs simplified (with split) - Ori->Nat: 64

Total PSS1 Original->Natural: 3504
Total PSS1 Natural->Strong: 4971
Total PSS1 Original->Strong: 2047
Total geral PSS1: 10522

Total PSS2 Original->Natural: 2370
Total PSS2 Natural->Strong: 1491
Total PSS2 Original->Strong: 1101
Total geral PSS2: 4962

Total PSS3 Original->Natural: 1515
Total PSS3 Natural->Strong: 729
Total PSS3 Original->Strong: 260
Total geral PSS3: 2504

danillolino / porsimplessent Goto Github PK

porsimplessent's Introduction