Giter VIP home page Giter VIP logo

glot500's People

Contributors

ayyoobimani avatar kargaranamir avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

glot500's Issues

Inconsistent columns in arrow files on Hugging Face datasets

tl;dr: some shards of languages below (potentially more) have the extra column "__index_level_0__". The dataset thus cannot be fully loaded.

Thanks for providing a potentially super cool dataset for multilingual NLP research!

While my request for access to the full Glot500-c is still awaiting processing, I thought I would try to use what's available on Hugging Face and quickly ran into the issue already documented here https://huggingface.co/datasets/cis-lmu/Glot500/discussions/3

While I am still loading the dataset, of the train split of the first 139 languages, the arrow files of

afr_Latn
amh_Ethi
ara_Arab
en_Latn
fra_Latn
hau_Latn
mlg_Latn
nya_Latn
sna_Latn
som_Latn
sot_Latn
swa_Latn
zul_Latn

have inconsistent column names. That is, some shards have "__index_level_0__" as an extra column. The below python file slowly but eventually should fix the problem.

# not mega pretty but gets the job done
from datasets import load_dataset, concatenate_datasets
from pathlib import Path
from datasets.exceptions import DatasetGenerationError

CWD = Path.cwd() # inside Glot500 folder
BACKUP = Path("../Glot500_backup")
if not BACKUP.exists():
    BACKUP.mkdir()
SPLIT = "train"
langs = [p for p in CWD.glob("*") if p.is_dir() and "_" in str(p)]


def fix(lang: str, lang_split_dir: str, paths: list[Path]):
    datasets = []
    original_dir = BACKUP / lang / SPLIT
    if not original_dir.exists():
        original_dir.mkdir(parents=True)

    for path in paths:
        new_path = original_dir.joinpath(path.name)
        path.rename(new_path)

    for path in original_dir.glob("*.arrow"):
        datasets.append(
            load_dataset("arrow", data_files={"train": str(path)}, split="train")
        )
    col = "__index_level_0__"
    datasets_ = []
    counter = 0
    for d in datasets:
        if col in d.features:
            d_ = d.remove_columns(col)
            counter += 1
        else:
            d_ = d
        datasets_.append(d_)
    print(f"Cleaned up {counter} shards for {SPLIT} of {lang}")
    dataset = concatenate_datasets(datasets_)
    dataset.save_to_disk(lang_split_dir)


datasets = {}
for i, lang in enumerate(langs):
    print(f"Processing {i}/{len(langs)}: {lang}")
    lang_train = lang / "train"
    lang_train_arrow = list(map(str, lang_train.glob("*.arrow")))
    try:
        datasets[lang] = load_dataset(
            "arrow", data_files={"train": lang_train_arrow}, split="train"
        )
    except DatasetGenerationError as e:
        print(f"Fixing {lang}")
        fix(lang.stem, str(lang_train), list(lang_train.glob("*.arrow")))
        lang_train_arrow = list(map(str, lang_train.glob("*.arrow")))
        datasets[lang] = load_dataset(
            "arrow", data_files={"train": lang_train_arrow}, split="train"
        )

Proposed fix: I suppose it would be relatively straightforward for you to run a variant of the above script and reupload the fully loadable dataset.

I would highly appreciate also getting full access to the dataset :)

Thanks a lot in advance!

Access to Glot500-c

Hello,

I came across your work through a virtual talk by Prof. Schütze and found it to be a valuable resource. I'm particularly interested in the Glot500-c(Glot500 corpus) data.

At the moment, your README mentions that access to the corpus will be given after filling an online form, and the form will be available soon. Is there any tentative date for the release of the form? The multilingual corpus would greatly assist my research group with our research on language models.

Thank you for maintaining such a helpful repository, your work is greatly appreciated 😊

NER

How to reproduce the NER evaluation?

tel_Telu data appears to be missing on HF Glot500

Hi there!

With now having access to the full Glot500 and starting my experiments, I realized that the publicly available Glot500 on HF does not seem to comprise the 41580525 sentences (or a probably very large subset thereof) for tel_Telu.

glot500-large

Are there plans to train a large glot500 model?
Thanks for your work so far!

Dataset not deduplicated

I've re-trained a Mistral (~ Llama) language-specific tokenizer on the training portion of the Yoruba samples and noticed strange tokens. As an example

{"Ìròyìn▁tó▁ṣe▁kókó▁Àbámọ̀▁ni▁yóò▁gbẹ̀yin▁ẹgbẹ́▁Association▁of▁Stingy▁Men▁tí▁kò▁fẹ́▁náwó▁fóbìnrin-▁Akeugbagoldwákàtí▁9▁sẹ́yìn▁Gbọ́,▁Ìṣẹ́jú▁kan▁BBC▁07:00▁UTCwákàtí▁kan▁sẹ́yìn▁Wo▁ohun▁tí▁a▁mọ̀▁nípa▁gbèdéke▁ti▁Sunday▁Igboho▁fún▁àwọn▁Fulani▁ní▁Ibarapa▁àti▁èsì▁tí▁wọ́n▁fún▁unwákàtí▁3▁sẹ́yìn▁Ìwádìí▁kíkún▁lóríi▁kókó▁ìròyìn▁Amad▁Diallo▁darapọ̀▁mọ́▁Manchester▁United8▁Sẹ́rẹ́▁2021▁Èsíò!": 29494}

which is a token which occurs 1832 times in the training split (rg $STRING yoruba_textified.txt | wc -l) in ever so slightly different contexts (i.e., near duplicates).

I hence checked for duplicates in the dataset and found that there are abundant hard duplicates among 4.5M lines which reduce to 1.16M unique lines. I understand that datasets for low-resource languages are noisy, but I presume users expect that hard duplicates do not occur.

To reproduce:

from datasets import load_dataset
from collections import Counter
import numpy as np
import pandas as pd

dataset = load_dataset("cis-lmu/Glot500", "yor_Latn", split="train")
counter = Counter(dataset["text"])
c = sorted(counter.items(), key=lambda counts: counts[1])
_, counts = zip(*c)
counts = np.array(counts)
print(
    pd.DataFrame(counts)
    .round(0)
    .describe(percentiles=[0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99])
    .T
)
# original counts is 4.5M
#        count     mean       std  min  10%  25%  50%  75%  90%  95%  99%   max
#    1167327.0  3.84903  1.379304  2.0  2.0  2.0  5.0  5.0  5.0  5.0  5.0  10.0

I have briefly checked train splits for some other languages, which also to varying degree comprise duplicates.

kin_Latn: nearly correct

Original length:  415405
Deduplicated length:  401856
      count      mean       std  min  10%  25%  50%  75%  90%  95%  99%  max
0  401856.0  1.033716  0.180498  1.0  1.0  1.0  1.0  1.0  1.0  1.0  2.0  2.0

uzb_Latn: OK

Original length:  3182175
Deduplicated length:  3182175
       count  mean  std  min  10%  25%  50%  75%  90%  95%  99%  max
0  3182175.0   1.0  0.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0

ibo_Latn: bad

Original length:  5608630
Deduplicated length:  1526812
       count      mean       std  min  10%  25%  50%  75%  90%  95%  99%   max
0  1526812.0  3.673425  0.739383  2.0  2.0  4.0  4.0  4.0  4.0  4.0  4.0  20.0

wol_Latn: nearly correct

Original length:  92358
Deduplicated length:  92357
     count      mean       std  min  10%  25%  50%  75%  90%  95%  99%  max
0  92357.0  1.000011  0.003291  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  2.0

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.