... the ultimate solution for everything: frequency lists
the point: a unified approach based on frequency lists
-
🔥 create plan: how to tackle each cleanness aspect using frequency lists:
(1) fragment; (2) foreign; (3) spelling; (4) dedup -
🔥 implementation based on
book-index/freqlists.py
:
(1) take script; (2) correct it (identify problem); (3) split to several scripts?; (4) solve all problems with it :)
command:
make sample FILENAME=? SAMPLESIZE=100000 RANDOMSEED=42
Are there too much foreign text in the corpus?
first attempt: based on the freq of most frequent words in certain languages
command:
scripts/investigate_foreign.sh FILENAME 42
results:
language | MagyarSzo |
KiadokAkademiai |
arcj_teljes |
MNSZ_nowp |
---|---|---|---|---|
English | very few | some | very few | quite much |
German | very few | some | very few | very few |
very few > rank=5000 > some > rank=1000 > quite much
Result is stable = RANDOMSEED=42
and RANDOMSEED=43
gives essentially the same.
Looking at concordance of the in MNSZ, it can be the case that even such many English text is not too much, because most hits are part of small English excerpts! Hm..
command:
scripts/investigate_spelling.sh FILENAME 42
results:
MagyarSzo |
KiadokAkademiai |
arcj_teljes |
MNSZ_nowp |
|
---|---|---|---|---|
es/ugy/tobb/jo/ev% | 0,13% | 0,10% | 0,09% | 1,06% |
Result is stable = RANDOMSEED=42
and RANDOMSEED=43
gives essentially the same.
Idea: investigate MNSZ2 by subcorpora?
implementation: https://github.com/sassbalint/utils/blob/main/random_sampler.py (original source)
files (from arcj
):
name | rows | words | bytes |
---|---|---|---|
MagyarSzo_10percent |
5.8 M | 57 M | .4G |
MagyarSzo |
57.6 M | 627 M | 4.0G |
arcj_teljes |
1067.0 M | 10650 M | 67.0G |
command:
time make sample FILENAME=? SAMPLESIZE=?
results (average of 3 measures, in seconds):
samplesize | MagyarSzo_10percent |
MagyarSzo |
arcj_teljes |
---|---|---|---|
10 | 0 | 0 | 0 |
10000 | 0 | 0 | 0 |
1000000 | 5 | 6 | 8 |
10000000 | 37 | 48 | 90 |
- ➡️ time complexity: corpussize hardly matters (log?), samplesize below linear ➡️ extra efficient stuff! :)
- maybe slower first time = when loading corpus into memory
- there can be a limit if corpus is larger then the memory -- 🔥 to be tested
- con: stream not ok, whole file on disk is needed
- equally sized are the best for this algo -- solution can be: splitting long sentences into same sized chunks
OTKA pályázatbeli cleanness fogalom alapján, source
(1) é rtelmes = nincsenek széttöredezett szavak, szótöredékek, értelmetlen stringek
(2) m agyar = nincsenek benne idegen nyelvű többmondatos szakaszok;
(3) h elyesírás oké = ékezet, nagybetűsség, írásjelek okék (a normatív helyesírás nem követelmény);
(4) d eduplikált = nincsenek benne azonos többmondatos szakaszok.
(a boilerplate is megoldható dedup alapon?)
tok or split by space
sentsplit or split by ≈100 chars or ≈7 words
➡️ spl
➡️ base unit = "sentence"