mtresearcher / terminology_dataset Goto Github PK

View Code? Open in Web Editor NEW

22.0 22.0 5.0 45 KB

Terminology Dataset

License: Apache License 2.0

Python 100.00%

terminology_dataset's People

Contributors

Stargazers

Watchers

Forkers

xmhzz2018 fagan2888 mjpost johnson7788 mrzhengxin

terminology_dataset's Issues

How to limit the amount of data added amounting to approximately 10% of the original data

I am deeply interested in your research paper, and I would like to make an additional research.
Could you tell me how did you limit the amount of data added amounting to approximately 10% of the original data, in detail?

Did you ignore all the matched terms which are the same with specific entries?
---For example, if the word "thank" is decided to be the entry to be ignored, ignore all the word "thank" in the corpus.
Or, did you ignore matched term depending on the opportunity of matching?
-- For example, the word "thank" in the first sentence is ignored, but the word "thank" in the second sentence may be not ignored.
Did you decide sentences in which you ignored the matched terms in advance?
In other words, before the term matching, did you split sentences into 90% sentences and 10% sentences, and matched terms only to 10%? sentences?
Or, As a result of ignoring term match, did sentences contain term annotations were added amounting to approximately 10% of the original data?
Is it possible to ignore specific match terms when there are multiple match terms in one sentence?
--For example, if the word "thank", "common", "vote" are matched in one sentence, is it possible to ignore only "thank"?

Problems solving steps while getting the test set

git clone https://github.com/mtresearcher/terminology_dataset.git && cd terminology_dataset

The print_lines.py file still consists of Python 2.x syntax. In order to run this file with with Python 3.x, you need to replace the following line print line to print (line).

Then you need to run this corrected line:

wget http://data.statmt.org/wmt17/translation-task/test.tgz && tar -xvzf test.tgz

Move the print_lines.py all the data from the iata and wiktionary folders in the test folder.

Enter the test folder and run the following commands:

cat newstest2017-ende-src.en.sgm | grep "seg id" | perl -pe "s/<seg id=\"[0-9]*\">//g" | perl -pe "s/<\/seg>//g" > newstest2017.en
cat newstest2017-ende-ref.de.sgm | grep "seg id" | perl -pe "s/<seg id=\"[0-9]*\">//g" | perl -pe "s/<\/seg>//g" > newstest2017.de

for term_file in iate.{414,581}.terminology.tsv
do
python print_lines.py -l=<(cut -f2 ${term_file}) < newstest2017.en > newstest2017-${term_file}.en
python print_lines.py -l=<(cut -f2 ${term_file}) < newstest2017.de > newstest2017-${term_file}.de
done

for term_file in wikt.{727,975}.terminology.tsv
do
python print_lines.py -l=<(cut -f2 ${term_file}) < newstest2017.en > newstest2017-${term_file}.en
python print_lines.py -l=<(cut -f2 ${term_file}) < newstest2017.de > newstest2017-${term_file}.de
done

Now you have the iate and wiktionary newstest2017 set.

How to get entries list occurring in the top 500 most frequent English words

I am deeply interested in your research paper, and I would like to make an additional research.
Could you share us a specific list of the top 500 most frequent English words or how to make them?
I tried the following three methods, but every methods failed.

Search for well-known lists of the top 500 most frequent English words on the Internet.
Top 500 most frequently English words appearing in Europarl and news commentary data, for a total 2.2 million sentences.
The top 500 most frequently English entries (made from both Wiktionary and IATE) matched to Europarl and news commentary data.

mtresearcher / terminology_dataset Goto Github PK

terminology_dataset's People

Contributors

Stargazers

Watchers

Forkers

terminology_dataset's Issues

How to limit the amount of data added amounting to approximately 10% of the original data

Problems solving steps while getting the test set

How to get entries list occurring in the top 500 most frequent English words

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent