We are still having problems with Sentence alignment failures? Are we timing out?

Update The following screenshot brings me joy. <a target="_bla

Sentence Alignment Failures about gov-za-multilingual HOT 8 CLOSED

vukosim commented on August 15, 2024

Sentence Alignment Failures

from gov-za-multilingual.

Comments (8)

lastrucci01 commented on August 15, 2024 2

Update

Ran the code on a larger subset, in some rare cases where the tokenised files to be aligned are large, the index realignment fails which results in an infinite loop.
I added a conditional that checks if the updated indices are the same as the previous, if they are the loop breaks and moves on. I guess thats a place for future optimisations.

I am running the code on the entire dataset, when that is finished I will look to automate it back into the action, make a PR and we can close this issue 🥳

from gov-za-multilingual.

lastrucci01 commented on August 15, 2024

Update

Possible Problem

The action is timing out, I think it's either the code is doing redundant embeddings or it is trying to align more than just the most recent statements.

Attempted solution

I attempted to optimise the code but I didn't write it so it is a bit confusing to decipher so I opted to refactor instead. I am running some refactored code on a subset of the data and hoping to see similar results.
Only setback is I have to run it on a lab computer so I can't do it from home.

Pending the results I will then code it to run on the most recent cab statements, test the GitHub action and hopefully close this issue :)

from gov-za-multilingual.

lastrucci01 commented on August 15, 2024

Update

Poor results

I have modularised the code for gov-za but I am getting shocking results - I've tried fixing the preprocessing thereafter re-reviewed the original code and found some preprocessing steps outside the preprocess function which I attempted to incorporate them in but the results are still bad...

Plan

I think I am trying to fix the symptoms but I am just making things worse - a major setback is that I can only do testing on the lab computers due to the fairseq module requirements.
On Wednesday morning when I am on campus I will try recode the sentence alignment function from the ground up, which I feel is better and will be more efficient than trying to debug and change existing code.

from gov-za-multilingual.

lastrucci01 commented on August 15, 2024

Update

Getting somewhere!

I redid the preprocessing yesterday doing manual find & replace with regex until I got some really clean data, however it seems the nltk tokeniser thinks titles like Mr. & Mrs. are the end of sentences.
The same is true for email addresses and web links.

The sentence alignment performs really well until it comes across a sentence cut short by a title, weblink or email, as shown by line 19 of the following image.

I think I should remove the email addresses and weblinks from the data entirely, don't see much purpose in machine translating an email xD
For the person titles, I will likely have to create a function for each that language that searches for 'Mr.' and replaces it with 'Mr' to stop the sentence from being cut short. The same techniques should be applied to Vukuzenzele as well.

from gov-za-multilingual.

vukosim commented on August 15, 2024

I am agreed in removing the email and URL.

Might it be a good idea to then instead replace it with a token. Like

EMAILTOKEN and URLTOKEN?

from gov-za-multilingual.

lastrucci01 commented on August 15, 2024

Update

I have near perfected the preprocessing but the alignment still significantly drops off 10 or so sentences in.

I am working on redoing the alignment function:
I have an idea where you compare a source sentence to a small list of possible target sentences and align the sentences with a cosine score above a certain threshold. I have coded it and I can get really good results, but upon further inspection it is aligning sentences with a high cosine score but it's not the semantic equivalent.

I need to work on a way to effectively iterate over the same section in both texts so that the possible target sentences contain the exact match.

from gov-za-multilingual.

lastrucci01 commented on August 15, 2024

Update

The following screenshot brings me joy.

So the alignment function tries to align line src_sentences[i] with tgt_sentences[j], most of the files start out really strong but sometimes the tgt_sentence for i is at some j+y so if the score drops below 0.6 I run an update_indices function that will explore the scores around a i+x & j+y and return the indices with the best score and continue aligned from there, which is kind of like starting from the start of a file again, hooray!

There are potentially more optimisations, but I am happy for now.

Will run the new code on a larger subset of the data and report back :)

from gov-za-multilingual.

lastrucci01 commented on August 15, 2024

Update

So after running on the whole dataset we did lose some observations but I think the quality might be better because the algorithm doesn't write observations lower than 0.6 similarity. I will keep both old & new datasets in the repo and label them accordingly.

I tested out the github action today and it ran successfully :)

I am rerunning the script one last time as I think I might've created duplicates when I was testing.

When it's done I will make a PR and move onto updating Vukuzenzele.

from gov-za-multilingual.

Sentence Alignment Failures about gov-za-multilingual HOT 8 CLOSED

Comments (8)

Update

Update

Possible Problem

Attempted solution

Update

Poor results

Plan

Update

Getting somewhere!

Update

Update

Update

Related Issues (8)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent