mtresearcher / terminology_dataset Goto Github PK
View Code? Open in Web Editor NEWTerminology Dataset
License: Apache License 2.0
Terminology Dataset
License: Apache License 2.0
I am deeply interested in your research paper, and I would like to make an additional research.
Could you tell me how did you limit the amount of data added amounting to approximately 10% of the original data, in detail?
Did you ignore all the matched terms which are the same with specific entries?
---For example, if the word "thank" is decided to be the entry to be ignored, ignore all the word "thank" in the corpus.
Or, did you ignore matched term depending on the opportunity of matching?
-- For example, the word "thank" in the first sentence is ignored, but the word "thank" in the second sentence may be not ignored.
Did you decide sentences in which you ignored the matched terms in advance?
In other words, before the term matching, did you split sentences into 90% sentences and 10% sentences, and matched terms only to 10%? sentences?
Or, As a result of ignoring term match, did sentences contain term annotations were added amounting to approximately 10% of the original data?
Is it possible to ignore specific match terms when there are multiple match terms in one sentence?
--For example, if the word "thank", "common", "vote" are matched in one sentence, is it possible to ignore only "thank"?
git clone https://github.com/mtresearcher/terminology_dataset.git && cd terminology_dataset
The print_lines.py file still consists of Python 2.x syntax. In order to run this file with with Python 3.x, you need to replace the following line print line
to print (line)
.
Then you need to run this corrected line:
wget http://data.statmt.org/wmt17/translation-task/test.tgz && tar -xvzf test.tgz
Move the print_lines.py all the data from the iata and wiktionary folders in the test folder.
Enter the test folder and run the following commands:
cat newstest2017-ende-src.en.sgm | grep "seg id" | perl -pe "s/<seg id=\"[0-9]*\">//g" | perl -pe "s/<\/seg>//g" > newstest2017.en
cat newstest2017-ende-ref.de.sgm | grep "seg id" | perl -pe "s/<seg id=\"[0-9]*\">//g" | perl -pe "s/<\/seg>//g" > newstest2017.de
for term_file in iate.{414,581}.terminology.tsv
do
python print_lines.py -l=<(cut -f2 ${term_file}) < newstest2017.en > newstest2017-${term_file}.en
python print_lines.py -l=<(cut -f2 ${term_file}) < newstest2017.de > newstest2017-${term_file}.de
done
for term_file in wikt.{727,975}.terminology.tsv
do
python print_lines.py -l=<(cut -f2 ${term_file}) < newstest2017.en > newstest2017-${term_file}.en
python print_lines.py -l=<(cut -f2 ${term_file}) < newstest2017.de > newstest2017-${term_file}.de
done
Now you have the iate and wiktionary newstest2017 set.
I am deeply interested in your research paper, and I would like to make an additional research.
Could you share us a specific list of the top 500 most frequent English words or how to make them?
I tried the following three methods, but every methods failed.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.