Giter VIP home page Giter VIP logo

techiaith / brawddegau-tagiedig Goto Github PK

View Code? Open in Web Editor NEW
3.0 7.0 0.0 6.44 MB

Corpws o frawddegau CC0 mewn fformat jsonl, gyda rhannau ymadrodd y tocynnau (geiriau etc.) wedi'u tagio â thagiau Universal Dependencies. | A Corpus of CC0 sentences in the jsonl format, tagged with Universal Dependency part-of-speech tags.

License: Creative Commons Zero v1.0 Universal

data annotated welsh nlp cc0 commonvoice

brawddegau-tagiedig's Introduction

Corpws-Brawddegau-Tagiedig

DOI

(see Corpus of Tagged Sentences below for English)

Dyma gorpws cychwynnol o frawddegau CC0 Cymraeg mewn fformat jsonl, gyda rhannau ymadrodd y tocynnau (geiriau etc.) wedi'u tagio â thagiau Universal Dependencies.

Fersiwn alffa o'r corpws yw hwn - bydd yn cynyddu o ran ei faint ac yn gwella o ran ei gywirdeb wrth i broject Iriaith fynd yn ei flaen.

Pwrpas y corpws hwn yw galluogi hyfforddi tagiwr ystadegol Cymraeg ar sail dulliau dysgu peirianyddol.

Ar hyn o bryd mae'n cynnwys 79,442 o 'frawddegau' hyfforddi tagiedig, sef 3,345 brawddeg unigryw gyflawn ac 76,097 o eiriau unigol er mwyn atgyfnerthu geirfa'r model (1% yn fwy cywir). Mae metadata pob brawddeg yn glwm wrth ei gofnod jsonl.

Mae'r data yn cynnwys brawddegau cyffredinol a awdurwyd gan aelodau'r Uned, yn ogystal â sgyrsiau ar-lein, cyfieithiadau o straeon ac erthyglau newyddiadurol CC0, trydariadau ac erthyglau gwyddoniadurol a gyfranwyd gan eu hawduron o dan drwydded CC0.

Ym mis Gorffennaf 2022 ychwanegwyd 9,109 brawddeg tagiedig ychwanegol yn y ffeil brawddegau_tagiedig_cc0_covost_ltu.jsonl. Mae'r brawddegau hyn yn gyfieithiadau i'r Gymraeg gan Facebook o frawddegau Saesneg Common Voice fel rhan o CoVOST2 (https://ai.facebook.com/blog/covost-v2-expanding-the-largest-most-diverse-multilingual-speech-to-text-translation-data-set/). Yn dilyn archwiliad ieithyddol, barnwyd gan ein ieithyddion eu bod o safon digonol i'w cynnwys yma.

Ym Mawrth 2024 ychwanegwyd detholiad o'r brawddegau byr mwyaf cyffredin yn y Gymraeg a brawddegau adnabod endidau.

Gyda'r data 79k brawddeg cynnar, rydym eisioes wedi llwyddo i hyfforddi tagiwr cychwynnol gyda chywirdeb o dros 91% ar destun a gasglwyd ar hap ac nad oedd yn y data hyfforddi.

Mae model parod y tagiwr hwnnw ar gael yma:

https://github.com/techiaith/model-tagiwr-spacy-cy

Bwriadwyd y ffeil o frawddegau tagiedig hon ar gyfer ei ddefnyddio gyd spaCy, un o'r llyfrgelloedd NLP cyfoes amlycaf.

I ddefnyddio'r data i hyfforddi model spaCy, yn gyntaf defnyddiwch y gorchymyn convert i drosi'r ffeil:

https://spacy.io/api/cli#convert

Yna defnyddiwch y gorchymyn train i hyfforddi'r model:

https://spacy.io/api/cli#train

Ar hyn o bryd, bydd angen i chi osod ffolder iaith lang Cymraeg yn spaCy i hyn weithio. Mae ffolder ddrafft a chyfarwyddiadau ar sut i'w osod yn spaCy ar gael gennym yma:

https://github.com/techiaith/spacy-cy-lang

Byddwn yn cyfrannu'r ffeiliau hyn i spaCy yn y dyfodol agos er mwyn gwneud i ffwrdd â'r angen am y cam hwnnw.

Ariannwyd y gwaith hwn gan Lywodraeth Cymru.

Corpus of Tagged Sentences

This is an initial release of a corpus of Welsh CC0 sentences that have been tagged with Universal Dependency part-of-speech tags and saved in jsonl format.

As this is an alpha version of the corpus, it will increase in size and accuracy as the Iriaith project progresses.

The purpose of this corpus is to enable the creation of statistical Welsh part-of-speech taggers based on machine learning methods.

The corpus currently contains 79,442 tagged training 'sentences', comprising of 3,345 complete unique sentences and 76,097 single word sentences included to boost the trained model's vocabulary (increases accuracy by 1%). The metadata for each sentence is included in the jsonl entry.

The data includes general sentences authored by the Unit's staff as well as online chats, translated CC0 stories and newspaper articles, and tweets, encylopeadic articles provided to us by their authors under a CC0 licence.

In July 2022, an additional 9,109 tagged sentences were added in the file brawddegau_tagiedig_cc0_covost_ltu.jsonl. These sentences are translations into Welsh by Facebook of English sentences from Common Voice which form part of of CoVOST2 (https://ai.facebook.com/blog/covost-v2-expanding-the-largest-most-diverse-multilingual -speech-to-text-translation-data-set/). Following a linguistic examination, our linguists judged that they were of a sufficient standard to be included here.

In March 2024 a selection of the most common short sentences in Welsh was added, as well as sentences for named entity recognition.

Using the early 79k sentence data, we have already succeeded in training a model with over 91% accuracy on randomly selected texts that were not part of the training data.

That pretrained model is available here:

https://github.com/techiaith/model-tagiwr-spacy-cy

This file of training sentences was intended for use with spaCy, one of the foremost modern NLP libraries.

To use the data to train a spaCy model, first use the convert command to convert the file:

https://spacy.io/api/cli#convert

Then use the train command to train the model:

https://spacy.io/api/cli#train

Currently, you will need to install a Welsh lang folder in spaCy for this to work. We've provided a draft folder and installation instructions here:

https://github.com/techiaith/spacy-cy-lang

We will be contribuiting these files to spaCy in the near future so that this step will no longer be required.

This work was funded by the Welsh Government.

brawddegau-tagiedig's People

Contributors

gruffprys avatar grw20blt avatar stefanoghazzali avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.