Giter VIP home page Giter VIP logo

Comments (8)

lastrucci01 avatar lastrucci01 commented on August 15, 2024 2

Update

Ran the code on a larger subset, in some rare cases where the tokenised files to be aligned are large, the index realignment fails which results in an infinite loop.
I added a conditional that checks if the updated indices are the same as the previous, if they are the loop breaks and moves on. I guess thats a place for future optimisations.

I am running the code on the entire dataset, when that is finished I will look to automate it back into the action, make a PR and we can close this issue 🥳

from gov-za-multilingual.

lastrucci01 avatar lastrucci01 commented on August 15, 2024

Update

Possible Problem

The action is timing out, I think it's either the code is doing redundant embeddings or it is trying to align more than just the most recent statements.

Attempted solution

I attempted to optimise the code but I didn't write it so it is a bit confusing to decipher so I opted to refactor instead. I am running some refactored code on a subset of the data and hoping to see similar results.
Only setback is I have to run it on a lab computer so I can't do it from home.

Pending the results I will then code it to run on the most recent cab statements, test the GitHub action and hopefully close this issue :)

from gov-za-multilingual.

lastrucci01 avatar lastrucci01 commented on August 15, 2024

Update

Poor results

I have modularised the code for gov-za but I am getting shocking results - I've tried fixing the preprocessing thereafter re-reviewed the original code and found some preprocessing steps outside the preprocess function which I attempted to incorporate them in but the results are still bad...

Plan

I think I am trying to fix the symptoms but I am just making things worse - a major setback is that I can only do testing on the lab computers due to the fairseq module requirements.
On Wednesday morning when I am on campus I will try recode the sentence alignment function from the ground up, which I feel is better and will be more efficient than trying to debug and change existing code.

from gov-za-multilingual.

lastrucci01 avatar lastrucci01 commented on August 15, 2024

Update

Getting somewhere!

I redid the preprocessing yesterday doing manual find & replace with regex until I got some really clean data, however it seems the nltk tokeniser thinks titles like Mr. & Mrs. are the end of sentences.
The same is true for email addresses and web links.

The sentence alignment performs really well until it comes across a sentence cut short by a title, weblink or email, as shown by line 19 of the following image.
IMG_9919

I think I should remove the email addresses and weblinks from the data entirely, don't see much purpose in machine translating an email xD
For the person titles, I will likely have to create a function for each that language that searches for 'Mr.' and replaces it with 'Mr' to stop the sentence from being cut short. The same techniques should be applied to Vukuzenzele as well.

from gov-za-multilingual.

vukosim avatar vukosim commented on August 15, 2024

I am agreed in removing the email and URL.

Might it be a good idea to then instead replace it with a token. Like

EMAILTOKEN and URLTOKEN?

from gov-za-multilingual.

lastrucci01 avatar lastrucci01 commented on August 15, 2024

Update

I have near perfected the preprocessing but the alignment still significantly drops off 10 or so sentences in.

I am working on redoing the alignment function:
I have an idea where you compare a source sentence to a small list of possible target sentences and align the sentences with a cosine score above a certain threshold. I have coded it and I can get really good results, but upon further inspection it is aligning sentences with a high cosine score but it's not the semantic equivalent.

I need to work on a way to effectively iterate over the same section in both texts so that the possible target sentences contain the exact match.

from gov-za-multilingual.

lastrucci01 avatar lastrucci01 commented on August 15, 2024

Update

The following screenshot brings me joy.
image

So the alignment function tries to align line src_sentences[i] with tgt_sentences[j], most of the files start out really strong but sometimes the tgt_sentence for i is at some j+y so if the score drops below 0.6 I run an update_indices function that will explore the scores around a i+x & j+y and return the indices with the best score and continue aligned from there, which is kind of like starting from the start of a file again, hooray!

There are potentially more optimisations, but I am happy for now.

Will run the new code on a larger subset of the data and report back :)

from gov-za-multilingual.

lastrucci01 avatar lastrucci01 commented on August 15, 2024

Update

So after running on the whole dataset we did lose some observations but I think the quality might be better because the algorithm doesn't write observations lower than 0.6 similarity. I will keep both old & new datasets in the repo and label them accordingly.

I tested out the github action today and it ran successfully :)

I am rerunning the script one last time as I think I might've created duplicates when I was testing.

When it's done I will make a PR and move onto updating Vukuzenzele.

from gov-za-multilingual.

Related Issues (8)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.