Giter VIP home page Giter VIP logo

terminology_dataset's People

Contributors

mtresearcher avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

terminology_dataset's Issues

How to limit the amount of data added amounting to approximately 10% of the original data

I am deeply interested in your research paper, and I would like to make an additional research.
Could you tell me how did you limit the amount of data added amounting to approximately 10% of the original data, in detail?

  1. Did you ignore all the matched terms which are the same with specific entries?
    ---For example, if the word "thank" is decided to be the entry to be ignored, ignore all the word "thank" in the corpus.
    Or, did you ignore matched term depending on the opportunity of matching?
    -- For example, the word "thank" in the first sentence is ignored, but the word "thank" in the second sentence may be not ignored.

  2. Did you decide sentences in which you ignored the matched terms in advance?
    In other words, before the term matching, did you split sentences into 90% sentences and 10% sentences, and matched terms only to 10%? sentences?
    Or, As a result of ignoring term match, did sentences contain term annotations were added amounting to approximately 10% of the original data?

  3. Is it possible to ignore specific match terms when there are multiple match terms in one sentence?
    --For example, if the word "thank", "common", "vote" are matched in one sentence, is it possible to ignore only "thank"?

Problems solving steps while getting the test set

git clone https://github.com/mtresearcher/terminology_dataset.git && cd terminology_dataset

The print_lines.py file still consists of Python 2.x syntax. In order to run this file with with Python 3.x, you need to replace the following line print line to print (line).

Then you need to run this corrected line:

wget http://data.statmt.org/wmt17/translation-task/test.tgz && tar -xvzf test.tgz

Move the print_lines.py all the data from the iata and wiktionary folders in the test folder.

Enter the test folder and run the following commands:

cat newstest2017-ende-src.en.sgm | grep "seg id" | perl -pe "s/<seg id=\"[0-9]*\">//g" | perl -pe "s/<\/seg>//g" > newstest2017.en
cat newstest2017-ende-ref.de.sgm | grep "seg id" | perl -pe "s/<seg id=\"[0-9]*\">//g" | perl -pe "s/<\/seg>//g" > newstest2017.de

for term_file in iate.{414,581}.terminology.tsv
do
python print_lines.py -l=<(cut -f2 ${term_file}) < newstest2017.en > newstest2017-${term_file}.en
python print_lines.py -l=<(cut -f2 ${term_file}) < newstest2017.de > newstest2017-${term_file}.de
done

for term_file in wikt.{727,975}.terminology.tsv
do
python print_lines.py -l=<(cut -f2 ${term_file}) < newstest2017.en > newstest2017-${term_file}.en
python print_lines.py -l=<(cut -f2 ${term_file}) < newstest2017.de > newstest2017-${term_file}.de
done

Now you have the iate and wiktionary newstest2017 set.

How to get entries list occurring in the top 500 most frequent English words

I am deeply interested in your research paper, and I would like to make an additional research.
Could you share us a specific list of the top 500 most frequent English words or how to make them?
I tried the following three methods, but every methods failed.

  1. Search for well-known lists of the top 500 most frequent English words on the Internet.
  2. Top 500 most frequently English words appearing in Europarl and news commentary data, for a total 2.2 million sentences.
  3. The top 500 most frequently English entries (made from both Wiktionary and IATE) matched to Europarl and news commentary data.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.