masakhane-io / masakhane-ner Goto Github PK
View Code? Open in Web Editor NEWLicense: Other
License: Other
The arxiv document states the research as under open license CC-BY.
Can the authors confirm this code is open source, open license ?
I can then submit a PR with GNU + CC-BY license.
Thank you for providing such a nice dataset. We are currently working on integrating them into GERBIL to enable other researchers to use them more easily. However, while working with the Amharic dataset, we encountered a severe issue.
The Amharic language uses punctuation characters that are not common in other languages. The two important characters for this issue are the word separator ፡
and the full stop ።
.
This is an excerpt of the dataset (dev.txt):
አምቦ B-LOC
ከዚህ O
በኋላ O
የቱሪዝም O
የባህል O
እና O
የፖለቲካ O
ማዕከል O
ትሆናለች O
፡፡ O
The last character should be a full stop, i.e., ።
. However, in this example and in other sentences in the dataset, the last line comprises two word separators (2x፡
). I think that this is a mistake and should be fixed within the dataset.
Replace ፡፡
with ።
in all three files of the Amharic dataset.
Hi,
I found that there is no data for the Luo language in this repository, and not included on the Huggingface page as well. Could you also public the data to make the dataset complete?
Many thanks!
Thank you for providing such a nice dataset. We are currently working on integrating them into GERBIL to enable other researchers to use them more easily. However, while working with the Amharic dataset, we encountered a severe issue.
The Amharic language uses the ፡
character to separate words from each other. This character is not used in the dataset, which looks like a reasonable decision, since it should be possible to add it automatically, when reading a document from file. However, in some situations, there is a single word separator character within the dataset, e.g., at https://github.com/masakhane-io/masakhane-ner/blob/main/data/amh/test.txt#L83. This seems to be wrong, since it makes it harder to process the dataset. Either the word separator should be present between all words, or it should be skipped completely and left to the consumer of the dataset to add it in the correct places.
Either add the word separator to all places in which the Amharic language would put them, OR remove all word separators and expect the consumer of the dataset to add it while loading the dataset.
Hi,
It seems that some of the prediction files truncate sentences too short. For example, here is a long sentence in Hausa in the test file:
but this sentence is truncated in the XLM-R results:
Here's a similar result for mBERT:
Maybe you need to increase the maximum sequence length in whatever software you're using to be able to handle the whole sentences?
Definition : What this code does ?
Install: Required dependencies, commands to install ?
Run: demo run command.
Contribute: how to add a language ?
License.
Thank a lot for this project. 🙏🏼 African languages need more of those.
Are any of these languages related to Pular? My neighbors only speak Pular so this would be a game-changer to be able to converse with them.
Thanks for your consideration!
Right now the names of some files/directories are full language names (e.g. https://github.com/masakhane-io/masakhane-ner/tree/main/ner_data) and some are three-letter language codes (e.g. https://github.com/masakhane-io/masakhane-ner/tree/main/entity_analysis/XLM-R). Maybe they could be standardized to consistently use one or the other? I prefer language codes personally, but either would be fine.
Hello!
It seems that there is at least one poorly formatted line in the masakhaner-sna train split (the middle one below):
...
Hospital I-ORG
1487 Doctors'I-ORG
Association I-ORG
...
And one in the swa dataset:
. O
248 '
248 "
@dadelani seems like this should be fixed?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.