Giter VIP home page Giter VIP logo

languagemachines / foliautils Goto Github PK

View Code? Open in Web Editor NEW
4.0 9.0 3.0 44.15 MB

Command-line utilities for working with the Format for Linguistic Annotation (FoLiA), powered by libfolia (C++), written by Ko van der Sloot (CLST, Radboud University)

Home Page: https://proycon.github.io/folia

License: GNU General Public License v3.0

Shell 7.74% C++ 87.39% Makefile 0.47% XSLT 0.19% M4 3.82% Dockerfile 0.39% Lex 0.01%
nlp computational-linguistics folia

foliautils's Introduction

Build Status Language Machines Badge DOI GitHub release Project Status: Active – The project has reached a stable, usable state and is being actively developed.

FoliAutils

(c) CLST/TiCC/CLiPS 2024 https://github.com/LanguageMachines/foliautils

Centre for Language and Speech Technology, Radboud University Nijmegen
Tilburg centre for Cognition and Communication, Tilburg University and
Centre for Dutch Language and Speech, University of Antwerp

This file is part of foliautils foliautils provides a series of programs to make FoLiA processsing more easy.

This includes:

  • FoLiA-2text : convert FoLiA documents into plain text.

  • FoLiA-txt : convert plain text documents into FoLiA.

  • FoLiA-page : convert PAGE documents into FoLiA.

  • FoLiA-abby : convert Abbyy finereader documents into FoLiA.

  • FoLiA-hocr : convert hocr documents into FoLiA.

  • FoLiA-alto : convert ALTO DIDL files into series of FoLiA documents.

  • FoLiA-langcat : assign language tags to the words in a FoLiA document.

  • FoLiA-idf : count words in a serie of FoLiA documents and generate a .tsv files describing the IDF.

  • FoLiA-stats : gather n-gram statistics from series of FoLiA files.

  • FoLiA-collect : collect n-gram statistics of .tsv files produced by FoLiA-stats.

  • FoLiA-clean : cleanup FoLiA documents, removing unused declarations etc.

  • FoLia-pm : convert Political Mashup documents into FoLiA.

  • FoliA-correct : correct FoLiA files using correction candidates generated by TICCL-rank. (from the ticcltools package)

foliautils is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.

Comments and bug-reports are welcome at our issue tracker at https://github.com/LanguageMachines/foliautils/issues or by mailing lamasoftware (at) science.ru.nl.


This software has been tested on:

  • Intel platforms running several versions of Linux, including Ubuntu, Debian, Arch Linux, Fedora (both 32 and 64 bits)
  • MAC platform running OS X 10.10

Contents of this distribution:

  • Sources
  • Licensing information ( COPYING )
  • Build system based on GNU Autotools
  • Dockerfile

Dependencies: To be able to succesfully build foliautils from source, you need the following dependencies:

  • ticcutils
  • libfolia
  • ucto
  • libicu-dev
  • libxml2-dev
  • libexttextcat-dev OR libtextcat-dev (OS dependant)
  • A sane C++ build environment with autoconf, automake, autoconf-archive, make, gcc or clang, libtool, pkg-config

To install ticcutils, libfolia and ucto, first consult whether your distribution's package manager has an up-to-date package. If not, you can use the supplied build-deps.sh script to automatically download and install the latest stable versions of these dependencis dependencies. You can pass a target directory prefix as first argument and you may need to prepend sudo to ensure you can install there.

To compile and install FoLiA-utils manually from source, provided you have all the dependencies installed, do:

$ bash bootstrap.sh
$ ./configure
$ make
$ make install

Container Usage

A pre-made container image can be obtained from Docker Hub as follows:

docker pull proycon/foliautils

You can build a docker container as follows, make sure you are in the root of this repository:

docker build -t proycon/foliautils .

This builds the latest stable release, if you want to use the latest development version from the git repository instead, do:

docker build -t proycon/foliautils --build-arg VERSION=development .

Run the container interactively as follows:

docker run -t -i proycon/foliautils

Or invoke the tool you want:

docker run proycon/foliautils FoLiA-page

Add the -v /path/to/your/data:/data parameter (before -t) if you want to mount your data volume into the container at /data .

foliautils's People

Contributors

kosloot avatar martinreynaert avatar proycon avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

foliautils's Issues

FoLiA-hocr does not produce an output FoLiA: Found no OCR_LINE nodes

When invoked as part of PICCL/ocr.nf as follows:

$ FoLiA-hocr --prefix "FH-" -O ./ -t 1 "OllevierGeets-5.hocr"
found no OCR_LINE nodes in OllevierGeets-5.hocr
skipped empty result : .//FH-OllevierGeets-5.tif.folia.xml
done
FoLiA-hocr --prefix "FH-" -O ./ -t 1 "OllevierGeets-5.hocr"

Input file is attached and does contain lines with "ocr_line", but perhaps something changed in the upstream format (this is produced by Tesseract 4.1.1):

OllevierGeets-5.hocr.gz

FoLiA-correct cannot be run twice on same file

We discussed this last week in IRC and the conclusion was that this should be a bug.

I would like to run FoLiA-correct twice over the same file, using different input files and outputting to a new, separate --outputclass.

This currently does not work. What I tried was this (starting from empty output directories):

reynaert@violet:/reddata/NA/proto/FOLIA/TICCL$ /exp/sloot/usr/local/bin/FoLiA-correct -t 2 --inputclass HTR --outputclass TICCLA --nums 1 --ngram 2 -e folia.xml -O /reddata/NA/proto/FOLIA/TICCL/TRAPTEST/FOLIAcorrect/ --unk /reddata/NA/proto/FOLIA/TICCL/NAK.wordfreqlist.1to2.tsv.CLEAN.unk --punct /reddata/NA/proto/FOLIA/TICCL/NAK.wordfreqlist.1to2.tsv.CLEAN.punct --rank NAK.wordfreqlist.1to2.tsv.CLEAN.clean.ANAHASH.anahash.INDEXER.index.LDCALC.ldcalc.RANK.ranked.CHAIN.REVERSEDSELECTION.underscoretoplusminus.chained /reddata/NA/proto/FOLIA/TICCL/TRAPTEST/FOLIA >TRAPTEST.20191107.UNRIGGED.stdout 2>TRAPTEST.20191107.UNRIGGED.stderr

reynaert@violet:/reddata/NA/proto/FOLIA/TICCL$ ls -l /reddata/NA/proto/FOLIA/TICCL/TRAPTEST/FOLIAcorrect/
total 64
-rw-r--r-- 1 reynaert reynaert 64575 Nov 7 22:56 NL-HlmNHA_1972_469_0527.xml.ticcl.folia.xml

reynaert@violet:/reddata/NA/proto/FOLIA/TICCL$ /exp/sloot/usr/local/bin/FoLiA-correct --setname="Ticcl2-set" -t 2 --inputclass TICCLA --outputclass TICCLB --nums 1 --ngram 2 -e folia.xml -O /reddata/NA/proto/FOLIA/TICCL/TRAPTEST/FOLIAcorrectB/ --unk /reddata/NA/proto/FOLIA/TICCL/NAK.wordfreqlist.1to2.tsv.CLEAN.unk --punct /reddata/NA/proto/FOLIA/TICCL/NAK.wordfreqlist.1to2.tsv.CLEAN.punct --rank /reddata/NA/proto/FOLIA/TICCL/NAK.wordfreqlist.1to2.tsv.CLEAN.clean.ANAHASH.anahash.INDEXER.index.LDCALC.ldcalc.RANK.ranked.CHAIN.UnderscoreToPlusmin.chained /reddata/NA/proto/FOLIA/TICCL/TRAPTEST/FOLIAcorrect/ >TRAPTESTB.20191107.UNRIGGED.stdout 2>TRAPTESTB.20191107.UNRIGGED.stderr

reynaert@violet:/reddata/NA/proto/FOLIA/TICCL$ ls -l /reddata/NA/proto/FOLIA/TICCL/TRAPTEST/FOLIAcorrectB/
total 0
-rw-r--r-- 1 reynaert reynaert 0 Nov 7 22:57 NL-HlmNHA_1972_469_0527.xml.ticcl.ticcl.folia.xml

reynaert@violet:/reddata/NA/proto/FOLIA/TICCL$ mkdir /reddata/NA/proto/FOLIA/TICCL/TRAPTEST/FOLIAcorrectBclear

reynaert@violet:/reddata/NA/proto/FOLIA/TICCL$ /exp/sloot/usr/local/bin/FoLiA-correct --setname="Ticcl2-set" --clear -t 2 --inputclass TICCLA --outputclass TICCLB --nums 1 --ngram 2 -e folia.xml -O /reddata/NA/proto/FOLIA/TICCL/TRAPTEST/FOLIAcorrectBclear --unk /reddata/NA/proto/FOLIA/TICCL/NAK.wordfreqlist.1to2.tsv.CLEAN.unk --punct /reddata/NA/proto/FOLIA/TICCL/NAK.wordfreqlist.1to2.tsv.CLEAN.punct --rank /reddata/NA/proto/FOLIA/TICCL/NAK.wordfreqlist.1to2.tsv.CLEAN.clean.ANAHASH.anahash.INDEXER.index.LDCALC.ldcalc.RANK.ranked.CHAIN.UnderscoreToPlusmin.chained /reddata/NA/proto/FOLIA/TICCL/TRAPTEST/FOLIAcorrect/ >TRAPTESTB.20191107.UNRIGGED.clear.stdout 2>TRAPTESTB.20191107.UNRIGGED.clear.stderr

reynaert@violet:/reddata/NA/proto/FOLIA/TICCL$ ls -l /reddata/NA/proto/FOLIA/TICCL/TRAPTEST/FOLIAcorrectBclear/
total 0
-rw-r--r-- 1 reynaert reynaert 0 Nov 7 22:59 NL-HlmNHA_1972_469_0527.xml.ticcl.ticcl.folia.xml

reynaert@violet:/reddata/NA/proto/FOLIA/TICCL$ history >UNRIGGED.unsuccessful.20191107.hist

I want to use this for another purpose too, after which I will need a functioning FoLiA-2text.

Small typo in FoLiA-stats error messages

FoLiA-stats failed on a wrong input format. No problem.

However: it reported:

"no documents were successfull handled!"

Please correct 'successfull' to 'successfully'.

Optional output option for FoLiA-correct

In order to be able to preserve the one-to-one alignment between original and (possibly) multiple TICCL correction text classes on the paragraph level, I would optionally like FoLiA-correct to output another character than a regular space character when working with the *punct file.

My optional character for now would be the plus-minus sign '±'.

FoLiA-stats discards tokenization

When performing FoLiA-stats on a sentence like this:

      <s xml:id="VanGinniken.p.1.s.487">
         <w xml:id="VanGinniken.p.1.s.487.w.1" class="PUNCTUATION" space="no">
           <t>†</t>
         </w>
         <w xml:id="VanGinniken.p.1.s.487.w.2" class="WORD">
           <t>Stomper</t>
         </w>
         <w xml:id="VanGinniken.p.1.s.487.w.3" class="PUNCTUATION" space="no">
           <t>(</t>
         </w>
         <w xml:id="VanGinniken.p.1.s.487.w.4" class="WORD" space="no">
           <t>knoeier</t>
         </w>
         <w xml:id="VanGinniken.p.1.s.487.w.5" class="WORD">
           <t>)</t>
         </w>

the folia::text() function is used to extract the sentence text, delivering:
†Stomper (knoeier) which FoLiA-stats sees as a bigram.

Maybe this is NOT what was intended!
A reasonable thing to do would be to keep the tokenization giving the 5-gram:
† Stomper ( knoeier )

I suggest adding a parameter to FoLiA-stats: --keep-tokens that does so, if desired.

FoLiA-2txt : naming output files

For some reason FoLiA-2txt repeats the input file's full path after the path one specifies for -o. E.g.
instead of:
/reddata/VlaamsCorpus/OCR-T3/zzz/TXT/h36_brugge_3/h36_brugge_3-154.tif.folia.xml.txt
we get:
/reddata/VlaamsCorpus$ cat OCR-T3/zzz/TXT/h36_brugge_3/reddata/VlaamsCorpus/OCR-T3/zzz/FOLIA/h36_brugge_3/h36_brugge_3-154.tif.folia.xml.txt

FoLiA-stats gives wrong counts for tokens and percentages with --max-ngram

FoLiA-stats: with parameter --max-ngram='3', the token counts for 2- and 3-grams and most certainly the percentages calculated for all ngrams are incorrect.
In comparison to the results with --ngram='1', token counts for unigrams appear to be correct.

Example command line: reynaert@black:~$ for year in 1932 ; do nohup /exp/sloot/usr/local/bin/FoLiA-stats -p --max-ngram='3' --class='OCR' -s --hemp=hempfile -t 60 -e gz$ -o /reddata/NLAB/KBkranten/FOLIAlangcat/FRQ/FOLIAstats.KBkranten$year /reddata/NLAB/KBkranten/FOLIAlangcat/$year/artikel >/reddata/NLAB/KBkranten/FOLIAlangcat/FRQ/FOLIAstats.KBkranten$year.20171106.RED.stdout 2>/reddata/NLAB/KBkranten/FOLIAlangcat/FRQ/FOLIAstats.KBkranten$year.RED.20171106.stderr ; done &

Example output:

reynaert@black:~$ tail -n 3 /reddata/NLAB/KBkranten/FOLIAlangcat/FRQ/FOLIAstats.KBkranten1900.*.tsv
==> /reddata/NLAB/KBkranten/FOLIAlangcat/FRQ/FOLIAstats.KBkranten1900.wordfreqlist.2-gram.tsv <==
✓■TT »g"«,"l«l«in«r. 1 76013696 33.3219
❖ * 1 76013697 33.3219
❖ Het 1 76013698 33.3219

==> /reddata/NLAB/KBkranten/FOLIAlangcat/FRQ/FOLIAstats.KBkranten1900.wordfreqlist.3-gram.tsv <==
✓■TT »g"«,"l«l«in«r. Re 1 75189431 32.9606
❖ * * 1 75189432 32.9606
❖ Het voorgenomen 1 75189433 32.9606

==> /reddata/NLAB/KBkranten/FOLIAlangcat/FRQ/FOLIAstats.KBkranten1900.wordfreqlist.tsv <==
✓• 1 76915900 33.7174
✓•15°. 1 76915901 33.7174
✓■TT 1 76915902 33.7174
reynaert@black:~$

Sample test files are to be found here (also accessible from Red):
reynaert@black:~$ ls -l /reddata/NLAB/KBkranten/FOLIAlangcat/TEST
total 332
-rw-r--r-- 1 reynaert reynaert 294944 Apr 25 2017 ddd.010126865.mpeg21.a0001.lang.folia.xml
-rw-r--r-- 1 reynaert reynaert 8354 Apr 25 2017 ddd.010126865.mpeg21.a0002.lang.folia.xml
-rw-r--r-- 1 reynaert reynaert 21523 Apr 25 2017 ddd.010128079.mpeg21.a0001.lang.folia.xml.gz
-rw-r--r-- 1 reynaert reynaert 788 Apr 25 2017 ddd.010128079.mpeg21.a0002.lang.folia.xml.gz

does FoLiA-benchmark a good job?

in line 207, FoLiA-benchmark selects all the Word* 's of a Document.
But the result is never used!
I assume the compiler will optimize this selection away, making this test overly optimistic about the speed.

This could be fixed by using the selection in for instance an output statement telling it's size.

FoLiA-stats: not yet implemented mode

When running FoLiA-stats like

FoLiA-stats -o something some/folder/*.folia.xml

it gives the following output message:

start processing of XXX files
FoLiA-stats: not yet implemented mode:

Perhaps a slightly more helpful message could be printed. It seems the root cause in this case is the absence of "tags" (I am an absolute noob user, so I don't even know what this means at this point, I just looked at the code here and deduced the tags list must be empty). What would be the best way around this problem? An additional check for correct input options?

FoLiA-2text aborts on a metadata issue

This happened on the Staten-Generaal Digitaal FoLiA which has been been processed by just about every other FoLiA tool before.

[1]+ Aborted nohup /exp/sloot/usr/local/bin/FoLiA-2text -t 120 --class=Ticcl -e ticcl.xml -o /reddata/POLMASH/FOLIALangCatTICCLTXT/d/nl/proc/sgd/ /reddata/POLMASH/FOLIALangCatTICCL/d/nl/proc/sgd/ > /reddata/POLMASH/FOLIALangCatTICCLTXT.sgd.20191112.stdout 2> /reddata/POLMASH/FOLIALangCatTICCLTXT.sgd.20191112.stderr

reynaert@maize:/reddata/POLMASH$ cat /reddata/POLMASH/FOLIALangCatTICCLTXT.sgd.20191112.stderr
nohup: ignoring input
WARNING: foreign-data found in metadata of type 'native'
changing type to 'foreign'
WARNING: foreign-data found in metadata of type 'native'
changing type to 'foreign'
WARNING: foreign-data found in metadata of type 'native'
changing type to 'foreign'
terminate called recursively
terminate called after throwing an instance of 'folia::NoSuchText'
reynaert@maize:/reddata/POLMASH$

I have no idea.

add --inputclass and --outputclass options to FoLiA-correct

FoLiA-correct has a parameter --class to set the output class. The default is 'Ticcl' which is an unwise remain from the old days...
The input class for text is always 'current'.

To be more flexible in the future we need both --inputclass to select any class we would like. (default 'current') and --outputclass to be as flexible as possible. (default also 'current')

FoLiA-abby for ABBYY Finereader 11 (Linux) version output

FoLiA-abby was developed on the basis of ABBYY output from the CLARIAH pilot project OpenGazAm. This worked as expected, as far as we are aware today.

In CLARIAH pilot project CCC:DB ABBYY Finereader 11 for Linux was used, which is another OCR-engine from the same stable. This produces different output and FoLiA-abby cannot yet handle this well.

An example:

      <p xml:id="ABBYYtoFOLIAlevi010geil01_01_Deel493.xml.gz.text.div1.p2">
         <part xml:id="ABBYYtoFOLIAlevi010geil01_01_Deel493.xml.gz.text.div1.p2.entry">
           <t class="OCR">gevonden.MetnieuwenmoedtrokhijnaarhetVaalgebied, dochwedertevergeefs.</t>
        </part>

Current output lacks spaces between the words and in at least one instance we have seen words in slightly different variants repeated.

Finereader produces a range of alternatives for each character and for each word, it seems. I take it they are ordered in descending order of probability, but cannot be sure about this.

When we grep on an output file:

reynaert@red:/reddata/PILOTS/DIABOR$ cat /reddata/PILOTS/DIABOR/OCR/levi010geil01_01_Deel500.xml |grep 'Variants'

This also tells you the location of our ABBYY Finereader 11 output. Current output of FoLiA-abby is in:

/reddata/PILOTS/DIABOR/ABBYYtoFOLIA/

We find in this 'grep':

<wordRecVariants>
<charRecVariants>
</charRecVariants>k</charParams><charParams l="661" t="2736" r="684" b="2770" charConfidence="78" serifProbability="255">
<charRecVariants>
</charRecVariants>o</charParams><charParams l="661" t="2736" r="684" b="2770" charConfidence="100" serifProbability="0">
<charRecVariants>
</charRecVariants>n</charParams>
<charRecVariants>
</charRecVariants>k</charParams><charParams l="661" t="2736" r="684" b="2770" charConfidence="78" serifProbability="255">
<charRecVariants>
</charRecVariants>o</charParams><charParams l="661" t="2736" r="684" b="2770" charConfidence="100" serifProbability="0">
<charRecVariants>
</charRecVariants>n</charParams>
</wordRecVariants>
<charRecVariants>
</charRecVariants>k</charParams>
<charRecVariants>
</charRecVariants>o</charParams>
<charRecVariants>
</charRecVariants>n</charParams>
<charRecVariants>
</charRecVariants> </charParams>
<wordRecVariants>

We see the word 'kon' repeated 3 times. The last time, no attributes are given, we take this to be Finereader's 'decision' about the word. This is also followed by the 'missing' space.

Hope this helps to adjust FoLiA-abby (the name is actually spelled with a double 'y') to Finereader 11 output!

FoLiA-stats enhancement towards processing text files per year (in Dutch)

Hi Ko,

In het kader van TICCLAT zit ik met een probleem. En ik wil graag jouw advies en hulp richting een oplossing.

In TICCLAT willen we dus graag een databank bouwen met zoveel mogelijk gedateerde woordenschat op basis van de diverse Nederlab FoLiA-corpora.

Van bepaalde subcorpora in Nederlab is bekend van welk jaar of welke periode de teksten zijn. Die informatie is echter niet uniform beschikbaar of direct voor mij toegankelijk (bv. de Nederlab metadata databank).

Voor belangrijke subcorpora zoals DBNL en EDBO heb ik echter wel al per boek beschikbaar uit welk jaar ze zijn, in lijstvorm: bestandsnaam - jaar.

Voor een ander groot corpus (de Staten-Generaal Digitaal1814-2013) is die informatie voor een heleboel bestanden aanwezig in de bestandsnaam. Voor andere teksttypes dan weer in de metadata in het FoLiA-bestand. Hiervan zijn dus vrij vlot vergelijkbare lijsten te maken.

Jaarlijsten lijken dus 'de weg te gaan'.

Op basis van alle bestanden uit 1 jaar zouden dus per corpus de frequentielijsten moeten opgebouwd worden. Binnen TICCLAT is afgesproken dat we die ook allemaal met TICCL-unk processen. De clean-files worden uiteindelijk in de databank ingevoerd, per jaar, op basis van het jaartal in hun bestandsnaam.

Nu zijn dit ook vaak grote corpora en zijn het flinke aantallen bestanden. Die wil ik liever niet fysiek verplaatsen, bijvoorbeeld naar directories per jaar.

Mijn eerste idee was een wrapper te bouwen die op basis van zo'n jaarlijst de bestanden voor dat jaar met hun volle bestandsnaam (+ directory path) aanbiedt aan FoLiA-stats. Het resultaat zou dan echter zijn: per bestand voor een bepaald jaar een aparte frequentiefile. Dat is niet de bedoeling, echter. Dat zou het te complex maken en onnodig uitgesplitst over al te veel frequentielijsten. We kunnen weliswaar al frequentielijsten cumuleren, maar dat lijkt hierbij een liever te vermijden extra stap.

Nu is het ook zo dat voor bv. de KB kranten de bestanden wel al netjes per jaar gesorteerd en opgeslagen zijn. Die bestrijken echter de jaren 1618 tot 1940 (zoals opgenomen in Nederlab) en vereisen dus nog steeds honderden malen apart opstarten van FoLiA-stats. (Dit is echter wel al grotendeels gedaan, practisch volledig). Het zou dus fijn zijn als je oplossing ook zou kunnen werken op gedateerde subdirectories, in een lijst beschikbaar gesteld aan het programma: pad - jaartal.

Het is dus ook de bedoeling dat de outputbestanden na zeg maar de prefix het uit de jaarlijst afgeleide jaartal in de bestandsnaam meekrijgen.

Het kan zijn dat ik hier nu eea over het hoofd zie, maar dat kan ik uiteraard later dan vast wel invullen.

Ik zou graag hebben dat je kijkt of het mogelijk is FoLiA-stats (en later ook misschien TICCL-stats) aan deze use case aan te passen. Hartelijk dank, alvast!

Martin

FoLiA-langcat does not support archived files

It is rather awkward to have one tool in the middle of a workflow that contrary to the rest does not support archived (*gz and *bz2) files.

I recommend this tool be equipped to also handle archived files forthwith.

MRE

Linker error related to textcat on Mac OS X

foliautils fails on Mac OS X due to:

/bin/sh ../libtool  --tag=CXX   --mode=link g++ -std=c++0x -W -Wall -pedantic -I/Users/proycon/LaMachine/lamachine/include -I/usr/include/libxml2 -I/Users/proycon/LaMachine/lamachine/include -g -O2 -I/usr/include/libxml2 -I/usr/local/Cellar/icu4c/55.1/include  -I/Users/proycon/LaMachine/lamachine/include   -o FoLiA-langcat FoLiA-langcat.o  -L/Users/proycon/LaMachine/lamachine/lib -lfolia -L/Users/proycon/LaMachine/lamachine/lib -lticcutils -ltextcat -L/usr/local/Cellar/icu4c/55.1/lib  -licui18n -licuuc -licudata   -licuio  -lxml2
libtool: link: g++ -std=c++0x -W -Wall -pedantic -I/Users/proycon/LaMachine/lamachine/include -I/usr/include/libxml2 -I/Users/proycon/LaMachine/lamachine/include -g -O2 -I/usr/include/libxml2 -I/usr/local/Cellar/icu4c/55.1/include -I/Users/proycon/LaMachine/lamachine/include -o FoLiA-langcat FoLiA-langcat.o -Wl,-bind_at_load  -L/Users/proycon/LaMachine/lamachine/lib /Users/proycon/LaMachine/lamachine/lib/libfolia.dylib -L/usr/local/lib -L/usr/local/Cellar/icu4c/55.1/lib -L/usr/local/opt/icu4c/lib /Users/proycon/LaMachine/lamachine/lib/libticcutils.dylib -lbz2 -lboost_regex-mt -ltextcat -licui18n -licuuc -licudata -licuio -lxml2
Undefined symbols for architecture x86_64:
  "textcat_Done(void*)", referenced from:
      _main in FoLiA-langcat.o
  "textcat_Init(char const*)", referenced from:
      _main in FoLiA-langcat.o
  "textcat_Classify(void*, char const*, unsigned long)", referenced from:
      TCdata::procesFile(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, bool, bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) in FoLiA-langcat.o
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make[2]: *** [FoLiA-langcat] Error 1
make[1]: *** [all-recursive] Error 1
make: *** [all] Error 2

question on using XML_PARSE_HUGE

Sebastian Pipping: @hartwork posted an e-mail questioning the (dangerous) use of XML_PARSE_HUGE:

I ran into your commit [1] using XML_PARSE_HUGE option of libxml2.

I am looking for files that hit related limits with libxml2 in context
of research about billion laughs attacks to add sound protection against
those to libexpat by the end of 2020.
Do you you remember why you added XML_PARSE_HUGE?  Do you have files
that run into libxml2 limits or know anyone dealing with "extreme XML
files" that you can can connect me to?

The billion of laughs is a know way to make XML parsers explode: https://en.wikipedia.org/wiki/Billion_laughs_attack

Change FoLiA extension for FoLiA-langcat

The current extension for FoLiA-langcat output is *folia.lc.xml

This is not in line with our convention to have *folia.xml as default extension for FoLiA XML documents.

Also: 'lc' for 'langcat' is confusable with 'lower case'.

I recommend changing the FoLiA-langcat extension to: *lang.folia.xml

MRE

FoLiA-correct should retain trailing punctuation

When correcting, we should take special care of 'PUNCT' words with (only?) trailing standard punctuation.
So 'appel?' should NOT be corrected to 'appel' but the '?' should be kept.
'ongel0felijk!' should be corrected to 'ongelofelijk!', including the '!'
etc.

We should limit this to 'true' punctuation probably. e.g. '.' ',' '?' and '!'

FoLia-correct: resolve HEMP's using FoLiA::Correction

This came up after issue #45

when resolving a HEMP, FoLiA-correct just adds the resolved text to one of the string/word nodes.
I assume using a real Correction would be better.

for example:

    <p xml:id="mwsel.p.1">
      <t class="OCR">•c c•</t>
      <str xml:id="mwsel.p.1.str.1">
        <t class="OCR">•c</t>
      </str>
      <str xml:id="mwsel.p.1.str.2">
        <t class="OCR">c•</t>
      </str>
    </p>

assuming •c c• is in the PUNCT file as •c c• cc this HEMP is resolved as:

   <p xml:id="mwsel.p.1">
      <t>cc</t>
      <t class="OCR">•c c•</t>
      <str xml:id="mwsel.p.1.str.1">
        <t class="OCR">•c</t>
      </str>
      <str xml:id="mwsel.p.1.str.2">
        <t offset="0">cc</t>
        <t class="OCR">c•</t>
      </str>
    </p>

IMHO a much better solution would be:

   <p xml:id="mwsel.p.1">
      <t>cc</t>
      <t class="OCR">•c c•</t>
      <correction xml:id="mwsel.p.1.correction.1">
        <new>
          <str xml:id="mwsel.p.1.str.edit.1">
            <t >cc</t>
          </str>
        </new>
         <original>
          <str xml:id="mwsel.p.1.str.1">
            <t class="OCR">•c</t>
          </str>
          <str xml:id="mwsel.p.1.str.2">
            <t class="OCR">c•</t>
          </str>
        </original>
      </correction>
    </p>

interesting point: HEMP resolution is done before other corrections. I assume that a real correction using the cc will not be performed.

FoLiA-langcat. add languages as <alt> nodes

At the moment, FoLiA-langcat assigns a list of detected languages tot the <lang> nodes when the --all option is specified.
It would be better tot create 1 <lang> node and a serie of <alt><lang>...</lang></alt> nodes.

So NOT:

    <lang class="fra|eng"/>

But

<lang class="fra"/>
<alt>
  <lang class="eng"/>
</alt>

FoLiA-correct fails with text consistency error

Running the TICCL pipeline on mwsel.pdf (via Piroska Lendvai), using text extracted from the PDF, FoLiA-correct fails with a text consistency error:

ticcl.nf --inputdir input --inputtype pdf --lexicon $LM_PREFIX/opt/PICCL/data/int/deu/deu.aspell.dict --alphabet $LM_PREFIX/opt/PICCL/data/int/deu/deu.aspell.dict.lc.chars --charconfus $LM_PREFIX/opt/PICCL/data/int/deu/deu.aspell.dict.c10.d2.confusion
Command output:
  Now using node v14.9.0 (npm v6.14.8)
  start reading variants
  read 180 variants
  start reading unknowns
  read 1 unknown words
  start reading puncts
  read 350 punctuated words
  verbosity = 0


  finished ./mwsel.folia.xml
  outputdir

Command error:
  start 1-gram correcting in file: ./mwsel.folia.xml
  FoLiA error in paragraph mwsel.p.7 of document mwsel
  inconsistent text: settext(cls=current): deeper text differs from attempted
  deeper='b'
  attempted='•a'

Leads to:

  Missing output file(s) `*.ticcl.folia.xml` expected by process `foliacorrect (1)`

Saner defaults for FoLiA-stats?

FoLiA-stats doesn't work out of the box as expected because of the default values. I suggest --lang none and --class current as saner defaults.

FoLiA-pm doesn't handle notes correctly

FoLiA-pm chokes on this note.xml.txt inputfile:

''saving to file test3/FPM-note.folia.xml failed: no such text: ref::text_content()''

This is due to the note node:

<p pm:id="pm11.2.2">AMENDEMENT VAN HET LID PECHTOLD C.S. TER VERVANGING VAN DAT GEDRUKT ONDER NR. 69
  <note pm:ref="v1.1" pm:id="pm11.2.2.2">
    <p pm:id="pm11.2.2.2.1">1</p>
    <p pm:id="pm11.2.2.2.2">Vervanging in verband met wijziging in de ondertekening.</p>
  </note>
<\p>

The software doesn't expect 2 paragraphs inside the note I assume
What should be done:
The first paragraph, with just a '1' must be inlined and the second paragraph, with the text of the note, should be at the bottom of the page.

FoLiA-collect: -t parameter does not work

FoLiA-collect appears to use only 1 thread, even when the -t switch is set to a higher number.

Some user feedback: Even so, it did manage to successfully collect the 40 unigram lists derived from the years 1900-1939 of the KB newspapers. With the bigram lists on server Red (256GB of RAM) it went into swap. So I killed it, noticing afterwards only that it had successfully written out 75% of the collected list.

FoLiA-hocr cannot handle tif filenames starting with a number

I am using FoLiA-hocr inside the Lamachine distribution. I start FoLiA-hocr with a hocr input file, which, inside the file, contains a reference to an original .tif file starting with a number.

The top of the hocr file looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
  <title>
</title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
  <meta name='ocr-system' content='tesseract 3.04.00' />
  <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word'/>
</head>
<body>
  <div class='ocr_page' id='page_1' title='image "00529957.tif"; bbox 0 0 1890 2598; ppageno 0'>

This gives the following output:

$ FoLiA-hocr -O /tmp 00529957.hocr
terminate called after throwing an instance of 'folia::XmlError'
what(): XML error: '00529957.tif' is not a valid NCName.
Afgebroken

When using tif files starting with a letter, there is no problem.

Error related to TextCat dependency

Is this related to #1?

STDERR:

FoLiA-langcat.cxx: In function ‘int main(int, char**)’:
FoLiA-langcat.cxx:303:22: error: no matching function for call to ‘TextCat::TextCat(std::__cxx11::string&)’
TextCat TC( config );
^
In file included from FoLiA-langcat.cxx:48:0:
/home/arte/acme/include/ucto/my_textcat.h:49:3: note: candidate: TextCat
https://github.com/LanguageMachines/foliautils/issues/new::TextCat(const TextCat&)
TextCat( const TextCat& );
^~~~~~~
/home/arte/acme/include/ucto/my_textcat.h:49:3: note: no known conversion for argument 1 from ‘std::__cxx11::string {aka std::__cxx11::basic_string}’ to ‘const TextCat&’
/home/arte/acme/include/ucto/my_textcat.h:48:12: note: candidate: TextCat::TextCat(const string&, TiCC::LogStream*)
explicit TextCat( const std::string&, TiCC::LogStream * );
^~~~~~~
/home/arte/acme/include/ucto/my_textcat.h:48:12: note: candidate expects 2 arguments, 1 provided
make[2]: *** [FoLiA-langcat.o] Error 1
make[1]: *** [all-recursive] Error 1
make: *** [all] Error 2

Operation 'hemp' parameter in FoLiA-stats

The hemp parameter in FoLiA-stats collects spaced words. It currently breaks on ligatures (see example). It also fails to collect the last letter if this has a trailing punctuation mark, which happens often.

reynaert@black:/reddata/PILOTS/LEVITICUS$ grep 'F r a n' /reddata/PILOTS/LEVITICUS/FOLIA/NOFOREIGN/levit.03.NoForeigns.folia.xml.txt
F r a n s c h zal
Z. F r a n k r ij k.
uitgeoefend. Z. F r a n k r ij k.
F r a n k r ij k.
reynaert@black:/reddata/PILOTS/LEVITICUS$ cat TESTFRQ/TESTFRQFOLIAtagdiv.hemp |grep 'F_r_a_n'
F_r_a_n_k_r

1/ ligatures should be seen as single characters.
2/ a final character with a trailing punctuation mark should also be collected.

Perhaps both little issues might be solved by allowing for the 'occasional' two character sequence, given repetitions of single characters in historically emphasised text.

FoLia-stats calculates wrong results for nested tags

I am under the impression that FoLiA-stats delivers wrong results for nested tags.
Needs confirmation though.

questionable input:

<text>
  <div>
    <t>Tekst totaal</t>
    <div>
      <t>Tekst</t>
    </div>
    <div>
      <t>totaal</t>
     </div>
  </div>
</text>

stats should count 1 'Tekst' and 1' totaal'

use the 'hemp' algorithm in FoLiA-correct to solve Historical Emphasis

FoLiA-stats generates a 'hemp' file containing Historical Emphasis like:
N_A_P_O_L_E_O_N
Which is the emphasized form of Napoleon.
Of course there will bet variants too:
N_AP_O_LE_ON etc.

We could (some magic involved) use the TICLL tools to generate a translation list of these emp words:

N_A_P_O_L_E_O_N  Napoleon
N_AP_O_LE_ON     Napoleon

etc.

When running FoLiA-correct, we then can use the same hemp-detector as FoLiA-stats has to (again) find these 'words' and directly replace them using the translation list.
Probably BEFORE applying other n-gram corrections?

FoLiA-page will fail if no filename is found in the metadata

FoLiA-page assumes that the name of thee original scanned document is stored in the metadata.
This is not always the case (see attached example)
0001.xml.txt
This name is used to generate an FoLiA output file-name.

A quick fix would be to use the name of the page file as a template for the output name.

MacOS 10.12.6 linker error icu

/bin/sh ../libtool  --tag=CXX   --mode=link g++ -std=c++11 -W -Wall -pedantic -g -I/usr/include/libxml2 -I/usr/local/Cellar/icu4c/58.1/include -I/usr/include/libxml2  -g -O2 -I/usr/include/libxml2 -I/usr/local/Cellar/icu4c/58.1/include  -I/Users/antalb/Software/lamachine/lamachine/include   -o FoLiA-stats FoLiA-stats.o  -L/Users/antalb/Software/lamachine/lamachine/lib -lfolia -L/Users/antalb/Software/lamachine/lamachine/lib -lucto -L/Users/antalb/Software/lamachine/lamachine/lib -lticcutils -ltextcat -L/usr/local/Cellar/icu4c/58.1/lib  -licui18n -licuuc -licudata   -licuio  -lxml2 
libtool: link: g++ -std=c++11 -W -Wall -pedantic -g -I/usr/include/libxml2 -I/usr/local/Cellar/icu4c/58.1/include -I/usr/include/libxml2 -g -O2 -I/usr/include/libxml2 -I/usr/local/Cellar/icu4c/58.1/include -I/Users/antalb/Software/lamachine/lamachine/include -o FoLiA-stats FoLiA-stats.o -Wl,-bind_at_load  -L/Users/antalb/Software/lamachine/lamachine/lib -L/usr/lib -L/usr/local/lib -L/usr/local/Cellar/icu4c/58.1/lib /Users/antalb/Software/lamachine/lamachine/lib/libucto.dylib -L/usr/local/opt/icu4c/lib /Users/antalb/Software/lamachine/lamachine/lib/libfolia.dylib -lreadline /Users/antalb/Software/lamachine/lamachine/lib/libticcutils.dylib -lz -lbz2 -lboost_regex-mt -ltextcat -licui18n -licuuc -licudata -licuio -lxml2 -pthread
clang: warning: argument unused during compilation: '-pthread' [-Wunused-command-line-argument]
Undefined symbols for architecture x86_64:
  "icu_58::UnicodeString::toLower()", referenced from:
      doc_sent_word_inventory(folia::Document const*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, int, bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::set<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::vector<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > >, std::__1::allocator<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > > > >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, std::__1::vector<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > >, std::__1::allocator<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > > > > > > >&, std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::vector<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > >, std::__1::allocator<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > > > >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, std::__1::vector<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > >, std::__1::allocator<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > > > > > > >&, std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::vector<std::__1::multimap<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, rec, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, rec> > >, std::__1::allocator<std::__1::multimap<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, rec, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, rec> > > > >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, std::__1::vector<std::__1::multimap<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, rec, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, rec> > >, std::__1::allocator<std::__1::multimap<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, rec, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, rec> > > > > > > >&, unsigned int&, unsigned int&, std::__1::set<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > >&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) in FoLiA-stats.o
      doc_str_inventory(folia::Document const*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, int, bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::set<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::vector<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > >, std::__1::allocator<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > > > >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, std::__1::vector<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > >, std::__1::allocator<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > > > > > > >&, std::__1::set<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > >&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) in FoLiA-stats.o
      par_str_inventory(folia::Document const*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, int, bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::set<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::vector<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > >, std::__1::allocator<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > > > >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, std::__1::vector<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > >, std::__1::allocator<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > > > > > > >&, std::__1::set<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > >&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) in FoLiA-stats.o
      par_text_inventory(folia::Document const*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, int, bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::set<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::vector<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > >, std::__1::allocator<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > > > >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, std::__1::vector<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > >, std::__1::allocator<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > > > > > > >&, std::__1::set<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > >&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) in FoLiA-stats.o
  "icu_58::UnicodeString::~UnicodeString()", referenced from:
      doc_sent_word_inventory(folia::Document const*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, int, bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::set<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::vector<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > >, std::__1::allocator<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > > > >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, std::__1::vector<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > >, std::__1::allocator<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > > > > > > >&, std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::vector<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > >, std::__1::allocator<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > > > >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, std::__1::vector<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > >, std::__1::allocator<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > > > > > > >&, std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::vector<std::__1::multimap<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, rec, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, rec> > >, std::__1::allocator<std::__1::multimap<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, rec, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, rec> > > > >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, std::__1::vector<std::__1::multimap<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, rec, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, rec> > >, std::__1::allocator<std::__1::multimap<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, rec, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, rec> > > > > > > >&, unsigned int&, unsigned int&, std::__1::set<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > >&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) in FoLiA-stats.o
      doc_str_inventory(folia::Document const*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, int, bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::set<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::vector<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > >, std::__1::allocator<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > > > >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, std::__1::vector<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > >, std::__1::allocator<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > > > > > > >&, std::__1::set<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > >&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) in FoLiA-stats.o
      par_str_inventory(folia::Document const*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, int, bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::set<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::vector<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > >, std::__1::allocator<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > > > >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, std::__1::vector<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > >, std::__1::allocator<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > > > > > > >&, std::__1::set<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > >&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) in FoLiA-stats.o
      par_text_inventory(folia::Document const*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, int, bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::set<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::vector<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > >, std::__1::allocator<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > > > >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, std::__1::vector<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > >, std::__1::allocator<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > > > > > > >&, std::__1::set<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > >&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) in FoLiA-stats.o
  "icu_58::UnicodeString::operator=(icu_58::UnicodeString const&)", referenced from:
      doc_str_inventory(folia::Document const*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, int, bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::set<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::vector<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > >, std::__1::allocator<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > > > >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, std::__1::vector<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > >, std::__1::allocator<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > > > > > > >&, std::__1::set<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > >&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) in FoLiA-stats.o
      par_str_inventory(folia::Document const*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, int, bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::set<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::vector<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > >, std::__1::allocator<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > > > >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, std::__1::vector<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > >, std::__1::allocator<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > > > > > > >&, std::__1::set<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > >&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) in FoLiA-stats.o
      par_text_inventory(folia::Document const*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, int, bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::set<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::vector<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > >, std::__1::allocator<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > > > >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, std::__1::vector<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > >, std::__1::allocator<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > > > > > > >&, std::__1::set<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > >&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) in FoLiA-stats.o
  "vtable for icu_58::UnicodeString", referenced from:
      doc_str_inventory(folia::Document const*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, int, bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::set<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::vector<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > >, std::__1::allocator<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > > > >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, std::__1::vector<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > >, std::__1::allocator<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > > > > > > >&, std::__1::set<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > >&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) in FoLiA-stats.o
      par_str_inventory(folia::Document const*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, int, bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::set<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::vector<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > >, std::__1::allocator<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > > > >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, std::__1::vector<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > >, std::__1::allocator<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > > > > > > >&, std::__1::set<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > >&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) in FoLiA-stats.o
      par_text_inventory(folia::Document const*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, int, bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::set<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::vector<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > >, std::__1::allocator<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > > > >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, std::__1::vector<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > >, std::__1::allocator<std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, unsigned int, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, unsigned int> > > > > > > >&, std::__1::set<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > >&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) in FoLiA-stats.o
  NOTE: a missing vtable usually means the first non-inline virtual member function has no definition.
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make[2]: *** [FoLiA-stats] Error 1
make[1]: *** [all-recursive] Error 1
make: *** [all] Error 2
================ FATAL ERROR ==============
An error occurred during installation!!
foliautils make failed
===========================================
ip-145-116-169-111:lamachine antalb$ 

FoLiA-stats problem with n-grams?

@martinreynaert meldt:
ik ben zelf onlangs gestoten op wat een backward compatibility probleem lijkt met FoLiA-stats: het leek mij onmogelijk nu om 2- en 3-gramlijsten te krijgen uit de SoNaR FoLiA-bestanden die in distributie zijn. Is dat mogelijk? Het lijkt mij althans een probleem.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.