Comments (8)
Update
Ran the code on a larger subset, in some rare cases where the tokenised files to be aligned are large, the index realignment fails which results in an infinite loop.
I added a conditional that checks if the updated indices are the same as the previous, if they are the loop breaks and moves on. I guess thats a place for future optimisations.
I am running the code on the entire dataset, when that is finished I will look to automate it back into the action, make a PR and we can close this issue 🥳
from gov-za-multilingual.
Update
Possible Problem
The action is timing out, I think it's either the code is doing redundant embeddings or it is trying to align more than just the most recent statements.
Attempted solution
I attempted to optimise the code but I didn't write it so it is a bit confusing to decipher so I opted to refactor instead. I am running some refactored code on a subset of the data and hoping to see similar results.
Only setback is I have to run it on a lab computer so I can't do it from home.
Pending the results I will then code it to run on the most recent cab statements, test the GitHub action and hopefully close this issue :)
from gov-za-multilingual.
Update
Poor results
I have modularised the code for gov-za but I am getting shocking results - I've tried fixing the preprocessing thereafter re-reviewed the original code and found some preprocessing steps outside the preprocess function which I attempted to incorporate them in but the results are still bad...
Plan
I think I am trying to fix the symptoms but I am just making things worse - a major setback is that I can only do testing on the lab computers due to the fairseq
module requirements.
On Wednesday morning when I am on campus I will try recode the sentence alignment function from the ground up, which I feel is better and will be more efficient than trying to debug and change existing code.
from gov-za-multilingual.
Update
Getting somewhere!
I redid the preprocessing yesterday doing manual find & replace with regex until I got some really clean data, however it seems the nltk tokeniser thinks titles like Mr. & Mrs. are the end of sentences.
The same is true for email addresses and web links.
The sentence alignment performs really well until it comes across a sentence cut short by a title, weblink or email, as shown by line 19 of the following image.
I think I should remove the email addresses and weblinks from the data entirely, don't see much purpose in machine translating an email xD
For the person titles, I will likely have to create a function for each that language that searches for 'Mr.' and replaces it with 'Mr' to stop the sentence from being cut short. The same techniques should be applied to Vukuzenzele as well.
from gov-za-multilingual.
I am agreed in removing the email and URL.
Might it be a good idea to then instead replace it with a token. Like
EMAILTOKEN and URLTOKEN?
from gov-za-multilingual.
Update
I have near perfected the preprocessing but the alignment still significantly drops off 10 or so sentences in.
I am working on redoing the alignment function:
I have an idea where you compare a source sentence to a small list of possible target sentences and align the sentences with a cosine score above a certain threshold. I have coded it and I can get really good results, but upon further inspection it is aligning sentences with a high cosine score but it's not the semantic equivalent.
I need to work on a way to effectively iterate over the same section in both texts so that the possible target sentences contain the exact match.
from gov-za-multilingual.
Update
The following screenshot brings me joy.
So the alignment function tries to align line src_sentences[i]
with tgt_sentences[j]
, most of the files start out really strong but sometimes the tgt_sentence for i
is at some j+y
so if the score drops below 0.6
I run an update_indices
function that will explore the scores around a i+x
& j+y
and return the indices with the best score and continue aligned from there, which is kind of like starting from the start of a file again, hooray!
There are potentially more optimisations, but I am happy for now.
Will run the new code on a larger subset of the data and report back :)
from gov-za-multilingual.
Update
So after running on the whole dataset we did lose some observations but I think the quality might be better because the algorithm doesn't write observations lower than 0.6 similarity. I will keep both old & new datasets in the repo and label them accordingly.
I tested out the github action today and it ran successfully :)
I am rerunning the script one last time as I think I might've created duplicates when I was testing.
When it's done I will make a PR and move onto updating Vukuzenzele.
from gov-za-multilingual.
Related Issues (8)
- Automated download and updates of Government Speeches
- Create individual csv files for each language and save in data/interim folder
- Update readme.md add License for CC 4.0 BY SA + data statement
- Sentence Alignment HOT 1
- Sentence Alignment Action crashes HOT 1
- Sentence align failure ... again HOT 1
- Fix broken Action: Fetch cabinet statement HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gov-za-multilingual.