Giter VIP home page Giter VIP logo

bitextor / bitextor Goto Github PK

View Code? Open in Web Editor NEW
281.0 30.0 43.0 181.24 MB

Bitextor generates translation memories from multilingual websites

Home Page: https://bitextor.readthedocs.io/en/latest/

License: GNU General Public License v3.0

Shell 16.61% Perl 0.80% Python 75.02% CMake 0.77% C++ 5.98% Dockerfile 0.82%
document-aligner apertium dictionaries crawler wget hunalign sentence-segmentation tokenizer bicleaner tmx

bitextor's Introduction

Bitextor

License Chat on Discord Snakemake

Bitextor is a tool to automatically harvest bitexts from multilingual websites. To run it, it is necessary to provide:

  1. The source where the parallel data will be searched: one or more websites (namely, Bitextor needs website hostnames or WARC files)
  2. The two languages on which the user is interested: language IDs must be provided following the ISO 639-1
  3. A source of bilingual information between these two languages: either a bilingual lexicon (such as those available at the bitextor-data repository), a machine translation (MT) system, or a parallel corpus to be used to produce either a lexicon or an MT system (depending on the alignment strategy chosen, see below)

Installation

Bitextor can be installed via Docker, Conda or built from source. See instructions here.

Usage

usage: bitextor [-C FILE [FILE ...]] [-c KEY=VALUE [KEY=VALUE ...]]
                [-j JOBS] [-k] [--notemp] [--dry-run]
                [--forceall] [--forcerun [TARGET [TARGET ...]]]
                [-q] [-h]

launch Bitextor

Bitextor config::
  -C FILE [FILE ...], --configfile FILE [FILE ...]
                        Bitextor YAML configuration file
  -c KEY=VALUE [KEY=VALUE ...], --config KEY=VALUE [KEY=VALUE ...]
                        Set or overwrite values for Bitextor config

Optional arguments::
  -j JOBS, --jobs JOBS  Number of provided cores
  -k, --keep-going      Go on with independent jobs if a job fails
  --notemp              Disable deletion of intermediate files marked as temporary
  --dry-run             Do not execute anything and display what would be done
  --forceall            Force rerun every job
  --forcerun TARGET [TARGET ...]
                        List of files and rules that shall be re-created/re-executed
  -q, --quiet           Do not print job information
  -h, --help            Show this help message and exit

Advanced usage

Bitextor uses Snakemake to define Bitextor's workflow and manage its execution. Snakemake provides a lot of flexibility in terms of configuring the execution of the pipeline. For advanced users that want to make the most out of this tool, bitextor-full command is provided that calls Snakemake CLI with Bitextor's workflow and exposes all of Snakemake's parameters.

Execution on a cluster

To run Bitextor on a cluster with a software that allows to manage job queues, it is recommended to use bitextor-full command and use Snakemake's cluster configuration.

Bitextor configuration

Bitextor uses a configuration file to define the variables required by the pipeline. Depending on the options defined in this configuration file the pipeline can behave differently, running alternative tools and functionalities. For more information consult this exhaustive overview of all the options that can be set in the configuration file and how they affect the pipeline.

Suggestion: A configuration wizard called bitextor-config gets installed with Bitextor to help with this task. Furthermore, a minimalist configuration file sample is provided in this repository. You can take it as an starting point by changing all the paths to match your environment.

Bitextor output

Bitextor generates the final parallel corpora in multiple formats. These files will be placed in permanentDir folder and will have the following format: {lang1}-{lang2}.{prefix}.gz, where {prefix} corresponds to a descriptor of the corresponding format. The list of files that may be produced is the following:

  • {lang1}-{lang2}.raw.gz - default (always generated)
  • {lang1}-{lang2}.sent.gz - default
  • {lang1}-{lang2}.not-deduped.tmx.gz - generated if tmx: true
  • {lang1}-{lang2}.deduped.tmx.gz - generated if deduped: true
  • {lang1}-{lang2}.deduped.txt.gz - generated if deduped: true
  • {lang1}-{lang2}.not-deduped.roamed.tmx.gz - generated if biroamer: true and tmx: true
  • {lang1}-{lang2}.deduped.roamed.tmx.gz - generated if biroamer: true and deduped: true

See detailed description of the output files.

Pipeline description

Bitextor is a pipeline that runs a collection of scripts to produce a parallel corpus from a collection of multilingual websites. The pipeline is divided in five stages:

  1. Crawling: documents are downloaded from the specified websites
  2. Pre-processing: downloaded documents are normalized, boilerplates are removed, plain text is extracted, and language is identified
  3. Document alignment: parallel documents are identified. Two strategies are implemented for this stage:
    • one using bilingual lexica and a collection of features extracted from HTML; a linear regressor combines these resources to produce a score in [0,1], and
    • another using machine translation and a TF/IDF strategy to score document pairs
  4. Segment alignment: each pair of documents is processed to identify parallel segments. Again, two strategies are implemented:
    • one using the tool Hunalign, and
    • another using Bleualign, that can only be used if the MT-based-document-alignment strategy is used (machine translations are used for both methods)
  5. Post-processing: final steps that allow to clean the parallel corpus obtained using the tool Bicleaner, deduplicates translation units, and computes additional quality metrics

The following diagram shows the structure of the pipeline and the different scripts that are used in each stage:

Banner

Connecting Europe Facility

All documents and software contained in this repository reflect only the authors' view. The Innovation and Networks Executive Agency of the European Union is not responsible for any use that may be made of the information it contains.

bitextor's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bitextor's Issues

bleualign executable

At the moment, bleualign is not intergrated into the makefile so the scripts expects the execuable to be in
bleualign-cpp/build
It would be better to copy the exec to
bin
where all the other binary executable are

Build fails on zipporah

Clean checkout, running make gives

make[1]: Entering directory '/mnt/saga0/bhaddow/code/bitextor/test'
Making all in zipporah
make[2]: Entering directory '/mnt/saga0/bhaddow/code/bitextor/test/zipporah'
make[2]: *** No rule to make target 'generate-bow-xent', needed by 'all-am'.  Stop.
make[2]: Leaving directory '/mnt/saga0/bhaddow/code/bitextor/test/zipporah'
Makefile:507: recipe for target 'all-recursive' failed
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory '/mnt/saga0/bhaddow/code/bitextor/test'
Makefile:390: recipe for target 'all' failed
make: *** [all] Error 2

Removing generate-bow-xent from zipporah/Makefile.am enables build to succeed, but not sure if that is the right solution.

INSTALL file says run configure but there isn't one.

According to the INSTALL file:

Basic Installation
==================

   Briefly, the shell commands `./configure; make; make install' should
configure, build, and install this package.  The following
more-detailed instructions are generic; see the `README' file for
instructions specific to this package.  Some packages provide this
`INSTALL' file but do not implement all of the features documented
below.  The lack of an optional feature in a given package is not
necessarily a bug.  More recommendations for GNU packages can be found
in *note Makefile Conventions: (standards)Makefile Conventions.

I think one needs to run autogen.sh.

Installation problem

Dear devs

I cloned the code and trying to ./configure I get the error "bash: ./configure: No such file or directory"

Could you please help to install?

Thanks

Reference files by path relative to self

Bitextor scripts sometimes need to call other scripts. The best practice is that an executable determines where it is located then calls relative to that path.

For example foo.sh can call bar.sh in the same source directory:

exec "$(dirname "$0")"/bar.sh

or in python using __file__

There should be no __PREFIX__ in the source.

error in bitextor-identifyMIME.py

Input file attached

bitextor-identifyMIME.py < 18thcenturyblog.tt

...
Traceback (most recent call last):
File "/home/hieu/permanent/software/bitextor/bitextor-identifyMIME.py", line 33, in
magicoutput=m.buffer(base64.b64decode(content)).split(" ")
File "/usr/lib/python3.6/base64.py", line 87, in b64decode
return binascii.a2b_base64(s)
binascii.Error: Incorrect padding

Can't install bitextor Ubuntu 16.04

Hello ,

Can't install bitextor using Ubuntu 16.04 via virtual env "I'm using python3 -m venv myenv"

Sometimes give sudo issues , sometimes require a specific version for python-Levenshtein or keras ...etc

Could you please provide clear step by step to install using virtual env . many thanks

Best regards

JY_some_problem_in_script_bitextor-lett2lettr

In the script bitextor-lett2lettr, the handle_data( self, data ) function of the class Parser(html.parser.HTMLParser) is not for common purpose? Because the logic of this function split the contents by " "(space charater), but in some language, the words of the contents doesn't split by " ", such as Chinese, which has no " " between the contents.

i get this message "line 590: [: too many arguments" when i try to crawl tutorials-point website

when executed this command
bitextor -u https://www.tutorialspoint.com/numpy/numpy_advanced_indexing.htm -b 1 -v /home/omar/dict.bitextor -x en es

i got /usr/local/bin/bitextor: line 590: [: too many arguments /usr/local/bin/bitextor: line 594: [: too many arguments
it takes too long time and then outputs empty TMX string.
is this problem with my installation ? or it is just a warning ?

bitextor-get-html-text.py error

bitextor-get-html-text.py < 0xcc.ttmime
output:

Traceback (most recent call last):
File "/home/hieu/permanent/software/bitextor/bitextor-get-html-text.py", line 81, in
document = html5lib.parse(ftfy.fix_text(Cleaner(style=True, links=True, add_nofollow=True,page_structure=False, safe_attrs_only=False).clean_html(base64.b64decode(fields[3]).decode("utf8"))),treebuilder="lxml",namespaceHTMLElements=False)
File "src/lxml/html/clean.py", line 518, in lxml.html.clean.Cleaner.clean_html
File "/usr/local/lib/python3.6/dist-packages/lxml/html/init.py", line 876, in fromstring
doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
File "/usr/local/lib/python3.6/dist-packages/lxml/html/init.py", line 762, in document_fromstring
value = etree.fromstring(html, parser, **kw)
File "src/lxml/etree.pyx", line 3213, in lxml.etree.fromstring
File "src/lxml/parser.pxi", line 1872, in lxml.etree._parseMemoryDocument
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

0xcc.zip

Error in rule mkcls:

/home/hieu/permanent/software/bitextor/bin/mkcls -c50 -n2 -p/tmp/transient/tempcorpuspreproc.fr-en/corpus.clean.fr -V/tmp/transient/tempgizamodel.fr-en/corpus.fr.vcb.classes opt

/home/hieu/permanent/software/bitextor/bin/mkcls: 1: eval: /home/hieu/permanent/software/bitextor/bin/clustercat: not found

pip installation advice incorrect on Ubuntu 16.04

Ubuntu 16.04 is rather inconsistent when it comes to Python versioning.

Just calling python yields 2.7.12:

python --version
Python 2.7.12

But when I call pip it's version 3:

head -n 1 $(which pip)
#!/usr/bin/python3

So when I get this error message:

configure: error: this package requires iso639 for Python. Please, install with 'sudo pip install iso-639' and try again.

running that command installs the package for Python 3, but not 2.7. But configure is checking for 2.7 :-(. To be pedantic, I need to run

sudo pip2.7 install iso-639
cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.4 LTS"

Post-pipeline tasks

It looks like the pipeline is complete. Yay!!! The BLEU scores don't look good for my realistic en-so run so I'm checking the steps again.

Would like to start thinking what we should do to make things better. Feel free to add your own

  1. Non-NLTK tokenizer. I'll be making sure that the marek aligners can have other tokenizers. Alicante should ensure that their existing tools can use other tokenizers/detoks.
  2. Add SMT. For BLEU score comparison. For use in marek's aligners
  3. Make NMT rules more general. At the moment, it is hardcoded to use RNN/GRU in Marian. Encourage other people to use the NMT Snakemake file

Keras not detected

Hello Bitextor team,

I am trying to install Bitextor5, but I am running in a problem. First of all "./configure" does not work, I have to use "./autogen.sh", but then I get an error message saying Keras is not installed. However, as you can see from the attached image, it is installed.
image

Can you help me with this issue?

Thanks

bitextor-identifyMIME.py error

cat 1 | ~/permanent/software/bitextor/bitextor-identifyMIME.py

Traceback (most recent call last):
File "/home/hieu/permanent/software/bitextor/bitextor-identifyMIME.py", line 41, in
magicoutput.append(base64.b64encode(base64.b64decode(content + "==").decode(magicoutput[1].split("=")[1].replace("unknown-8bit","iso-8859-1").replace('us-ascii','iso-8859-1')).encode("utf8")).decode("utf8"))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 16161: invalid continuation byte
1.gz

User-specified tokenizer

The user should be able to specify their own tokenizer for each language. Not the current mishmash of NLTK or moses tokenizer. This would be necessary for e.g. Chinese to customize segmentation. But also low-resource languages are unlikely to have a prepacked tokenizer.

httrack error

Running
bitextor-webdir2warc.sh website > /dev/null
I get this error
Traceback (most recent call last):
File "/home/hieu/permanent/software/bitextor/bitextor-dir2warc.py", line 23, in
warc_record = warc.WARCRecord(payload=content,headers={"WARC-Target-URI":url.decode("utf8")})
AttributeError: 'NoneType' object has no attribute 'decode'

Tar file of website folder attached

website.zip

More fluent tagline

"Bitextor is a translation memories generator using multilingual websites as a corpus source." -> "Bitextor generates translation memories from multilingual websites."

Remove checks for SRILM

The version of zipporah you have depends on kenlm's query program. However, bitextor's build system still checks for SRILM:

checking for ngram... no
configure: WARNING: You don't have ngram (SRILM) installed. You can't use Zipporah

Neural config variables are mandatory

When trying my config.json without nmt doc aligner error pops requesting 'marianDir' variable. Bitextor shouldn't use any nmt code if not specified, probably any of the last folder renaming changes made it mandatory accidentally?

Duplicated and non-upstream split-sentences.perl

I noticed that some resources in bitextor repository are duplicated and not linked to the original upstream repository:

utils/split-sentences.perl line 77 says chop instead of chomp.
This causes the last character of the last line to be dropped if the file does not end in a newline.
I fixed this upstream in Moses on May 3.

The split-sentences.perl can be replaced with:
https://github.com/moses-smt/mosesdecoder/blob/8c5eaa1a122236bbf927bde4ec610906fea599e6/scripts/ems/support/split-sentences.perl

Not able to compile bleualign

I am trying to compile bleualign-cpp, but I got an error related to autotools. I have tried with libtools 1.62, 1.65, and 1.67, but get the same error. I have also tried on Ubuntu 16 and 18. Tried both with the version installed with apt install and with conda. The error I get is:

libbleualign_cpp_lib.a(CompressedWriter.cpp.o): In function utils::CompressedWriter::CompressedWriter(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)': CompressedWriter.cpp:(.text+0x282): undefined reference to boost::iostreams::lzma::best_compression'
libbleualign_cpp_lib.a(CompressedWriter.cpp.o): In function boost::iostreams::detail::lzma_compressor_impl<std::allocator<char> >::lzma_compressor_impl(boost::iostreams::lzma_params const&)': CompressedWriter.cpp:(.text._ZN5boost9iostreams6detail20lzma_compressor_implISaIcEEC2ERKNS0_11lzma_paramsE[_ZN5boost9iostreams6detail20lzma_compressor_implISaIcEEC5ERKNS0_11lzma_paramsE]+0x19): undefined reference to boost::iostreams::detail::lzma_base::lzma_base()'
CompressedWriter.cpp:(.text._ZN5boost9iostreams6detail20lzma_compressor_implISaIcEEC2ERKNS0_11lzma_paramsE[_ZN5boost9iostreams6detail20lzma_compressor_implISaIcEEC5ERKNS0_11lzma_paramsE]+0x5e): undefined reference to boost::iostreams::detail::lzma_base::~lzma_base()' libbleualign_cpp_lib.a(CompressedWriter.cpp.o): In function boost::iostreams::detail::lzma_compressor_impl<std::allocator >::~lzma_compressor_impl()':
CompressedWriter.cpp:(.text._ZN5boost9iostreams6detail20lzma_compressor_implISaIcEED2Ev[_ZN5boost9iostreams6detail20lzma_compressor_implISaIcEED5Ev]+0x1e): undefined reference to boost::iostreams::detail::lzma_base::reset(bool, bool)' CompressedWriter.cpp:(.text._ZN5boost9iostreams6detail20lzma_compressor_implISaIcEED2Ev[_ZN5boost9iostreams6detail20lzma_compressor_implISaIcEED5Ev]+0x36): undefined reference to boost::iostreams::detail::lzma_base::~lzma_base()'
libbleualign_cpp_lib.a(CompressedWriter.cpp.o): In function void boost::iostreams::detail::lzma_base::init<std::allocator<char> >(boost::iostreams::lzma_params const&, bool, boost::iostreams::detail::lzma_allocator<std::allocator<char>, boost::iostreams::detail::lzma_allocator_traits<std::allocator<char> >::type>&)': CompressedWriter.cpp:(.text._ZN5boost9iostreams6detail9lzma_base4initISaIcEEEvRKNS0_11lzma_paramsEbRNS1_14lzma_allocatorIT_NS1_21lzma_allocator_traitsIS9_E4typeEEE[_ZN5boost9iostreams6detail9lzma_base4initISaIcEEEvRKNS0_11lzma_paramsEbRNS1_14lzma_allocatorIT_NS1_21lzma_allocator_traitsIS9_E4typeEEE]+0x66): undefined reference to boost::iostreams::detail::lzma_base::do_init(boost::iostreams::lzma_params const&, bool, void* ()(void, unsigned long, unsigned long), void ()(void, void*), void*)'
libbleualign_cpp_lib.a(CompressedWriter.cpp.o): In function boost::iostreams::detail::lzma_compressor_impl<std::allocator<char> >::filter(char const*&, char const*, char*&, char*, bool)': CompressedWriter.cpp:(.text._ZN5boost9iostreams6detail20lzma_compressor_implISaIcEE6filterERPKcS6_RPcS8_b[_ZN5boost9iostreams6detail20lzma_compressor_implISaIcEE6filterERPKcS6_RPcS8_b]+0x3d): undefined reference to boost::iostreams::detail::lzma_base::before(char const*&, char const*, char*&, char*)'
CompressedWriter.cpp:(.text._ZN5boost9iostreams6detail20lzma_compressor_implISaIcEE6filterERPKcS6_RPcS8_b[_ZN5boost9iostreams6detail20lzma_compressor_implISaIcEE6filterERPKcS6_RPcS8_b]+0x4e): undefined reference to boost::iostreams::lzma::finish' CompressedWriter.cpp:(.text._ZN5boost9iostreams6detail20lzma_compressor_implISaIcEE6filterERPKcS6_RPcS8_b[_ZN5boost9iostreams6detail20lzma_compressor_implISaIcEE6filterERPKcS6_RPcS8_b]+0x59): undefined reference to boost::iostreams::lzma::run'
CompressedWriter.cpp:(.text._ZN5boost9iostreams6detail20lzma_compressor_implISaIcEE6filterERPKcS6_RPcS8_b[_ZN5boost9iostreams6detail20lzma_compressor_implISaIcEE6filterERPKcS6_RPcS8_b]+0x65): undefined reference to boost::iostreams::detail::lzma_base::deflate(int)' CompressedWriter.cpp:(.text._ZN5boost9iostreams6detail20lzma_compressor_implISaIcEE6filterERPKcS6_RPcS8_b[_ZN5boost9iostreams6detail20lzma_compressor_implISaIcEE6filterERPKcS6_RPcS8_b]+0x81): undefined reference to boost::iostreams::detail::lzma_base::after(char const*&, char*&, bool)'
CompressedWriter.cpp:(.text._ZN5boost9iostreams6detail20lzma_compressor_implISaIcEE6filterERPKcS6_RPcS8_b[_ZN5boost9iostreams6detail20lzma_compressor_implISaIcEE6filterERPKcS6_RPcS8_b]+0x8b): undefined reference to boost::iostreams::lzma_error::check(int)' CompressedWriter.cpp:(.text._ZN5boost9iostreams6detail20lzma_compressor_implISaIcEE6filterERPKcS6_RPcS8_b[_ZN5boost9iostreams6detail20lzma_compressor_implISaIcEE6filterERPKcS6_RPcS8_b]+0x92): undefined reference to boost::iostreams::lzma::stream_end'
libbleualign_cpp_lib.a(CompressedWriter.cpp.o): In function boost::iostreams::detail::lzma_compressor_impl<std::allocator<char> >::close()': CompressedWriter.cpp:(.text._ZN5boost9iostreams6detail20lzma_compressor_implISaIcEE5closeEv[_ZN5boost9iostreams6detail20lzma_compressor_implISaIcEE5closeEv]+0x1e): undefined reference to boost::iostreams::detail::lzma_base::reset(bool, bool)'
collect2: error: ld returned 1 exit status
CMakeFiles/bleualign_cpp.dir/build.make:102: recipe for target 'bleualign_cpp' failed
make[2]: *** [bleualign_cpp] Error 1
CMakeFiles/Makefile2:73: recipe for target 'CMakeFiles/bleualign_cpp.dir/all' failed
make[1]: *** [CMakeFiles/bleualign_cpp.dir/all] Error 2
Makefile:140: recipe for target 'all' failed
make: *** [all] Error 2

Any ideas of what may be wrong?

{BITEXTORINSTALL} install error

I'm getting this error

RuleException:
CalledProcessError in line 539 of /home/hieu/workspace/github/paracrawl/bitextor.malign/snakemake/Snakefile:
Command ' set -euo pipefail; /home/hieu/permanent/software/bitextor/bitextor-rank.py -m /home/hieu/permanent/software/bitextor/{BITEXTORINSTALL}/share/bitextor/model/keras.model -w /home/hieu/permanent/software/bitextor/{BITEXTORINSTALL}/share/bitextor/model/keras.weights < /home/hieu/transient/en-fr/activeboard.1.urlsoverlap > /home/hieu/transient/en-fr/activeboard.1.rank ' returned non-zero exit status 1.

I would suggest we leave {BITEXTORINSTALL} until after the pipeline has been completed and thoroughly test. Installing bitextor should be invisible to the code, at the moment its causing headache

Buggy tokenisation when using NLTK 3.1

In bitextor-lett2idx, bitextor calls the NLTK WordPunctTokenizer. When using version 3.1 of NLTK, this does not handle non-ascii characters, resulting in incorrect tokenisation. Version 3.3 appears to work correctly.

Either bitextor should ensure v3.3 (can this be checked in configure?) or use a different tokeniser.

the bitextor.sh script problem : if [ $(echo $line | grep '\s' | wc -l) -eq 0 ];

May be I found a problem in the bitextor.sh (I am not sure if I understand correctly):
The object of the following command in the bitextor.sh is to find the blank line in the file which does not meet the required format.
in line num 888 of the bitextor.sh : if [ $(echo $line | grep '\s' | wc -l) -eq 0 ];

but I think if there is a line that has the '\s' but not in the format such as "url ett_file_path", but in the format such as "url " (missing the ett_file_path, but it also match the if sentence)

Maybe this is just a small problem, rarely encountered in practical applications

Problems with sentence splitting in ParaCrawl releases (v1.2 and v2.0)

Per https://github.com/paracrawl/Malign/issues/11

There are some issues with sentence splitting in the English-German v1.2 and v2.0 releases. I looked at the sets that were cleaned with Bicleaner. I noticed that the TMX files contain segments that are not properly sentence split:

  • There are segments that contain 2 or more sentences (I have seen up to 5). The ones that I noticed have the same number of sentences on the source and target side, so they could be split into multiple segments. They are marked as 1:1 alignments in the property type. Examples are the TUIDs 97,101,112 and 116 in paracrawl-release1.2.en-de.withstats.filtered-bicleaner.tmx.gz and TUID 368175616 in paracrawl-release2.en-de.withstats.filtered-bicleaner.tmx.gz. There are many more - I ran the a sentence splitter that splits segments with an equal number of sentences on the ParaCrawl v1.2 data and obtained about 20 million sentence pairs instead of the 17 million directly extracted from the TMX.
  • There is also the reverse case where a single sentence is distributed over a number of segments - e.g. the last 6 segments in paracrawl-release1.2.en-de.withstats.filtered-bicleaner.tmx.gz 317883035-317883040 contain a single sentence. I understand of course that with the difficulties of source document formatting and other non-sentence data on HTML pages these are hard to merge automatically.

bitextor-get-html-text.py error

cat in.1| .../bitextor-get-html-text.py -x
....
doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
File "/usr/local/lib/python3.6/dist-packages/lxml/html/init.py", line 762, in document_fromstring
value = etree.fromstring(html, parser, **kw)
File "src/lxml/etree.pyx", line 3213, in lxml.etree.fromstring
File "src/lxml/parser.pxi", line 1872, in lxml.etree._parseMemoryDocument
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

in.zip

bitextor-get-html-text.py error

Running
bitextor-get-html-text.py -x < goreapparel.ttmime > goreapparel.xtt
error:
Traceback (most recent call last):
File "/home/hieu/permanent/software/bitextor/bitextor-get-html-text.py", line 83, in
cleanhtml=cleaner.clean_html(b64t)
File "src/lxml/html/clean.py", line 518, in lxml.html.clean.Cleaner.clean_html
File "/usr/local/lib/python3.6/dist-packages/lxml/html/init.py", line 876, in fromstring
doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
File "/usr/local/lib/python3.6/dist-packages/lxml/html/init.py", line 765, in document_fromstring
"Document is empty")
lxml.etree.ParserError: Document is empty

goreapparel.ttmime.tar.gz

bitextor-elrc-filtering.py error

bitextor-elrc-filtering.py -c "url1,url2,seg1,seg2,hunalign" -s < en-fr.segclean > en-fr.elrc

/home/hieu/transient/en-fr/en-fr.elrc
Traceback (most recent call last):
File "/home/hieu/permanent/software/bitextor/bitextor-elrc-filtering.py", line 33, in
if len(fieldsdict["seg2"]) == 0:
KeyError: 'seg2'

en-fr.segclean.zip

Handle submodules in build system

The build system should check out submodules if needed.
Currently things just fail with:

make  all-recursive
make[1]: Entering directory '/home/heafield/bitextor'
Making all in zipporah
make[2]: Entering directory '/home/heafield/bitextor/zipporah'
g++ zipporah/tools/generate-bow-xent.cc -o generate-bow-xent -std=c++11
g++: error: zipporah/tools/generate-bow-xent.cc: No such file or directory
g++: fatal error: no input files
compilation terminated.
Makefile:493: recipe for target 'generate-bow-xent' failed
make[2]: *** [generate-bow-xent] Error 1
make[2]: Leaving directory '/home/heafield/bitextor/zipporah'
Makefile:507: recipe for target 'all-recursive' failed
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory '/home/heafield/bitextor'
Makefile:390: recipe for target 'all' failed
make: *** [all] Error 2

Killing script compilation: __ENV__ __PYTHON__

Now thaw #30 is closed, let's talk about removing script compilation entirely. Here's a summary of how bitextor genericizes over env and python:

https://github.com/bitextor/bitextor/blob/master/textsanitizer.py
#!/usr/bin/env python

https://github.com/bitextor/bitextor/blob/master/bitextor-identifyMIME.py
#!__ENV__ __PYTHON__

https://github.com/bitextor/bitextor/blob/master/bitextor-dir2warc.py
#!__ENV__ python3

So bitextor already doesn't genericize over these variables. Can we just assume /usr/bin/env ?

JY_about_regex_problem_in_bitextor_script

In the align_documents_and_segments() function of the bitextor script, I found that the regex expression in the following has some problem(is it the '\s' should be replaced by '\S' ๏ผŸ, which match the non-null characters):
tail -n +2 $VOCABULARY | sed -r 's/^([^\s]+)\t([^\s]+)$/\2 @ \1/g' > $HUNALIGN_DIC

Vocabulary file is still obligatory with the ParaCrawl aligner

Using the --paracrawl-aligner-command argument to specify the ParaCrawl aligner, I still get the message THE VOCABULARY FILE WAS NOT SET: USE OPTION -v if I do not supply a dictionary. Supplying an empty dictionary (just the language codes) gets around this, but I do not think a dictionary should be required. It can be used by hunalign, but there it's optional (afaik)

Add a brief "Getting Started" guide

I read through the README, installed the long list of dependencies, and then I got stuck trying to figure out how to crawl a sample website. It would be nice to have some examples of working crawls.

I downloaded a dictionary from the sourceforge site (I could not find the ~/bitextor-dictionaries directory mentioned) and first tried crawling lemonde.fr (yes, probably no parallel data here, but I wanted to see what would happen). bitextor produced no output, but seemed to be writing stuff in /tmp, and I eventually stopped it. I also tried some French tourist sites, and EU press releases, but bitextor found nothing.

Could you post some working examples to get started on?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.