masakhane-io / masakhane-ner Goto Github PK

View Code? Open in Web Editor NEW

101.0 101.0 51.0 73.94 MB

License: Other

Python 41.81% Jupyter Notebook 54.08% Shell 4.11%

masakhane-ner's People

Contributors

Stargazers

Watchers

masakhane-ner's Issues

Repository license

The arxiv document states the research as under open license CC-BY.

Can the authors confirm this code is open source, open license ?

I can then submit a PR with GNU + CC-BY license.

Faulty full stop character in the Amharic dataset

Thank you for providing such a nice dataset. We are currently working on integrating them into GERBIL to enable other researchers to use them more easily. However, while working with the Amharic dataset, we encountered a severe issue.

Problem description

The Amharic language uses punctuation characters that are not common in other languages. The two important characters for this issue are the word separator ፡ and the full stop ።.

This is an excerpt of the dataset (dev.txt):

አምቦ B-LOC
ከዚህ O
በኋላ O
የቱሪዝም O
የባህል O
እና O
የፖለቲካ O
ማዕከል O
ትሆናለች O
፡፡ O

The last character should be a full stop, i.e., ።. However, in this example and in other sentences in the dataset, the last line comprises two word separators (2x፡). I think that this is a mistake and should be fixed within the dataset.

Proposed fix

Replace ፡፡ with ። in all three files of the Amharic dataset.

No dataset for Luo

Hi,

I found that there is no data for the Luo language in this repository, and not included on the Huggingface page as well. Could you also public the data to make the dataset complete?

Many thanks!

Word separator character in the Amharic dataset

Problem description

The Amharic language uses the ፡ character to separate words from each other. This character is not used in the dataset, which looks like a reasonable decision, since it should be possible to add it automatically, when reading a document from file. However, in some situations, there is a single word separator character within the dataset, e.g., at https://github.com/masakhane-io/masakhane-ner/blob/main/data/amh/test.txt#L83. This seems to be wrong, since it makes it harder to process the dataset. Either the word separator should be present between all words, or it should be skipped completely and left to the consumer of the dataset to add it in the correct places.

Proposed fix

Either add the word separator to all places in which the Amharic language would put them, OR remove all word separators and expect the consumer of the dataset to add it while loading the dataset.

Truncated results for XLM-R and mBERT

Hi,

It seems that some of the prediction files truncate sentences too short. For example, here is a long sentence in Hausa in the test file:

https://github.com/masakhane-io/masakhane-ner/blob/main/data/hau/test.txt#L6216

but this sentence is truncated in the XLM-R results:

https://github.com/masakhane-io/masakhane-ner/blob/main/entity_analysis/XLM-R/hau_xlmr_test_predictions.txt#L6216

Here's a similar result for mBERT:

Maybe you need to increase the maximum sequence length in whatever software you're using to be able to handle the whole sentences?

Right now the names of some files/directories are full language names (e.g. https://github.com/masakhane-io/masakhane-ner/tree/main/ner_data) and some are three-letter language codes (e.g. https://github.com/masakhane-io/masakhane-ner/tree/main/entity_analysis/XLM-R). Maybe they could be standardized to consistently use one or the other? I prefer language codes personally, but either would be fine.

Poorly formatted line(s?) in masakhaner datasets

Hello!

It seems that there is at least one poorly formatted line in the masakhaner-sna train split (the middle one below):

...
Hospital I-ORG
1487 Doctors'I-ORG
Association I-ORG
...

And one in the swa dataset:

. O
248 '
248 "

@dadelani seems like this should be fixed?

masakhane-io / masakhane-ner Goto Github PK

masakhane-ner's People

Contributors

Stargazers

Watchers

Forkers

masakhane-ner's Issues

Problem description

Proposed fix

Problem description

Proposed fix

Recommend Projects

Recommend Topics

Recommend Org