ishefi / semantle-he Goto Github PK

View Code? Open in Web Editor NEW

51.0 51.0 20.0 569 KB

A Hebrew version of Semantle.

License: Other

Python 50.55% HTML 23.66% JavaScript 17.99% CSS 6.68% Procfile 0.06% Dockerfile 0.61% Mako 0.44%

hebrew nlp redis semantle word2vec

semantle-he's People

Contributors

Stargazers

Watchers

semantle-he's Issues

Show avarage guesses for yesterday word

Bug: `NaN` solver count

Steps to reproduce:

Guess the right word.
Guess another word.
Reload.

Spelling mistake in faq.html

faq.html
line 77
<p> ת: קוד מקור?</p>
Need to change 'ת' to 'ש' because it's a question

The game used to refresh (new secret word) at 2:00am (Israel time, GMT‎+3).
Since the switch to DST (Daylight Saving Time / שעון קיץ) it now refreshes at 3:00am.
If possible (and accepted) - I suggest to change the refresh time to midnight - 12:00am (along with all Wordle variations) or at least back to 2:00am.

Retrain W2V, use YAP before training

https://nlp.biu.ac.il/~rtsarfaty/onlp/hebrew/about?fbclid=IwAR2tbepIcHh8M5rcQNgu8lx_vfuGIBWQrYrdeCaVqoRayCe4WolPPx-rFwc

הצעה לשיפור

אם מישהו הצליח את הסמנטעל בפחות מ-20 נסיונות, כנראה שזה היה ניחוש אז אפשר לעשות שאם הצלחת מהר מדי תקבל אחד חדש של מילה אקראית.

marking progression milestones (WARNING: #31 spoiler)

Hi,
I suggest marking breakthrough words, that is, guesses which were the closest to to the target when first tried.
This might help advancing to the goal, and once the goal is achieved, it will provide a nice view of the road to victory.
I'm attaching an example, be aware that it might spoil today's riddle (#31, although so far I didn't solve the damn thing)

.

Stop working when using special characters

Step to reproduce:

Guess word with special characters (for example ^)
Try to guess other words.

Collaborate accross the web (feature request)

Would be nice to be able share guesses with one or more collaborators on the web so that each one can guess words and see the results of the other.
I do not have experience in such interfaces so I only have a vague idea of how this could be implemented (which may not be realistic) and I realize that this will require numerous additions. I think that it would be better not to keep any data that scales with the no. of players on the server. If each player that connects to the server has a unique id, the web page on the player's browser can keep a list of player ids to which the guesses will be shared with.

Share incomplete guess (feature request)

Great application, thank you!
Would be nice to be able to share the results of incomplete guesses in an analogous format to that of a complete guess.
Often, the secret word is difficult and I would have liked to share how close I was. For example, when the word was גלגלת my top guess was ידית 999/1000. A possible text could be:
לא פתרתי היום את סמנטעל #99. לאחר 731 ניחושים הגעתי ל-999/1000:
https://semantle-he.herokuapp.com

Yesterday's word + list of 1000 closest words stopped appearing

just gone a few days ago, a day after the horrible "איש" list that made 0 sense.

Personalized Word Embeddings from Game Statistics

Hi Itamar,

My name is Itay Nakash, and I'm an MSc student studying natural language processing at Technion. I find the game you developed very intriguing, and I believe its statistics could offer valuable insights for creating personalized word embeddings.

If you're interested in utilizing this platform and people's responses from the game, I'd love to collaborate with you on this project, at any scale you prefer.

It will require some changes in the code to collect the statistics, and some nlp work to try and match the new word embedding.
In addition, I believe that framing this game as a nlp task, with a significant dataset, with a new task/goal that utilize this data could be a great contribution to the community.

Before I begin implementing and developing the idea, I would like to check with you whether you are open to integrating it into the platform, given your background as an NLP researcher.

Thank you,
Itay

Migrate data to MongoDB

Use free mongodb server: https://www.mongodb.com/database/free
We can still use SQLAlchemy: https://www.cdata.com/kb/tech/mongodb-python-sqlalchemy.rst

Migrate the data to mongo
Remove word2vec.db from git: https://www.deployhq.com/git/faqs/removing-large-files-from-git-history
Remove git LFS buildpack from heroku

Refresh page when UTC date changes

To prevent mixup of yesterday's and today's guesses

Timing side channel attack

In handlers.py:133 there are the following lines:

if api_key != request.app.state.api_key:
    raise HTTPException(status_code=status.HTTP_403_FORBIDDEN)

This piece of code is vulnerable to a timing side channel attack and should be replaced with the constant time comparison method hmac.compare_digest

Add positional encodings for better consistency?

Something like they did in BERT. In the standard gensim Word2Vec they don't take into considerations order of the sentence.
Adding positional encodings maybe can improve consistency by giving some weight to order.

Missing word2vec.db

The readme says the db (word2vec.db) is part of the repo but it's not there

Apostrophes in words makes them distinct when they aren't

Adding an apostrophe (or apostrophes) anywhere in a recognizable word will be treated as a distinct word, but will have the same closeness value as the word without the apostrophes.

For example, all of the following words were accepted as distinct words, and they all had the exact same closeness value:
צבע
צבע'
'צבע
צ'בע
צב'ע
צב'ע'
צ''''בע

More correct behavior would probably be to either reject those words or not count them as distinct from the original.

Augment data using english language

Just a thought about how you can improve the precision and make your model better understand semantic relationship between words. Why not just use english? Then you can just add a layer of tramslation before generating the embedding of each guess. Assuming that hebrew to english translation is reliable, you'll be able to benefit from the abundance of work that has been done on english word2bec or any other word embedding technique. :)

Dealing with plene/deficient spelling

A couple of days ago the solution was "דעה". I guessed "דיעה" and it got only 996/1000 (66.54).

The same word in plene spelling (כתיב מלא) and in deficient spelling (כתיב חסר) should generate the same similarity ranking.
I thought of 2 possible solutions for this:
1 - Standardize the words (guesses) - turn all plene spelled words to deficient spelling or the other way around (just like the English version of the game automatically turns all the words to lower case and British to American spelling).
2 - Reject one form of spelling (plene / deficient).

Deal with apostrophe

Right now, words such as ג'קט are not accepted by the algorithm while גקט does.

We should either:

Migrate the data properly to include apostrophes
Sanitize user input and remove apostrophes before querying the db

Share does not represent the course of the game (WARNING: #48 spoiler)

After finishing the #48 semantle, I got the following share text:
פתרתי את סמנטעל #48 ב־70 ניחושים!
https://semantle-he.herokuapp.com
🟩🟩🟩🟩🟩 70 (1000/1000)
🟩🟩🟩🟩⬜ 67 (991/1000)
🟩🟩🟩🟩⬜ 50 (894/1000)
🟩🟩⬜⬜⬜ 52 (570/1000)
⬜⬜⬜⬜⬜ 49
⬜⬜⬜⬜⬜ 40

However, the game I played looked like this:

Is this maybe a bug?

Use a different data set, other than the Israeli Wikipedia?

Israeli Wikipedia is very uneven in its coverage of words. Is it the best freely-available Hebrew data set out there? What about ynet news archive, for example?

Phrases with more than one word are never accepted

The How To Play page says that a guess can be "מילה או ביטוי קצר". To date, I have not found a guess with more than one word that was accepted. Examples: ראש ממשלה, עמוד שדרה, קרוב משפחה.

I haven't looked at the code but I suspect there is no mechanism to ever add these kinds of words to the database. In that case, the How To Play text should be updated.

Generating the word2vec db

Hi,
Could you provide more detailed instructions on how to generate the word2vec db? E.g., at what part and how to use the HebPipe you mentioned in the faq.
Thanks

let users see past secrets

we can use /secrets page, and change current behavior to require both API key and future=true

solved

Rules page appears below the videos in the videos page

When clicking the question mark on the videos page, the rules <div> appears below the video <div>. One possible solution is to add position and z-index to both elements, the video <div> and the rules <div>.

semantle.py command line does not work

python scripts/semantle.py
  File "/Users/[email protected]/IdeaProjects/semantle-he/scripts/semantle.py", line 21
    secret = await logic.secret_logic.get_secret()
             ^
SyntaxError: 'await' outside async function

Negative similarity score

🤷‍♂️

Stattistics

Any chances to get some statistics of the game?
how often people succeeded with the number?
how the score change along the guessing?

Unable to reproduce model

Hi,

I've been playing around with Word2Vec and the model linked here, and I can't seem to reproduce the same distances.

For example:

Python 3.11.2 (main, Feb 12 2023, 00:48:52) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import gensim
>>> model = gensim.models.Word2Vec.load('./wiki_tokenized_model/model.mdl')
>>> model.wv.similar_by_word('אשליה')
[('אשליית', 0.7949888110160828), ('אשלייתי', 0.7358855605125427), ('תחושה', 0.7196317911148071), ('סימולקרה', 0.7147767543792725), ('מתעתעת', 0.7013854384422302), ('השתקפות', 0.6864952445030212), ('אסטרלית', 0.6836147308349609), ('אשלייתית', 0.6831943392753601), ('אילוזיה', 0.6829365491867065), ('סיראנית', 0.6813762784004211)]

Note the distances.

However, the distance Semantle gives is different:

Am I doing anything wrong? I'd love some feedback!

remove numpy from production requirements

large package, will reduce RAM. shouldn't be too hard

Missing word

המשחק אמר שהוא לא מכיר את המילה וניל, מנחש שאולי היא פורקה ל ו+ניל
בויקיפדיה מילים כתובות עם ניקוד אם אני לא טועה אז אפשר לקחת את זה בחשבון או בתהליך הפירוק של המילים או בכללי

On mobile devices, textbox should be wider

It's too narrow, and for long words, it's not convenient, for example, the word I wrote there was אתנומוזיקולוגיה:

"Give Up" button

text on button should be ״נכנעתי״.

Should appear after GIVEUP_THRESH (env var) good guesses (i.e., not nonexistent words)

Share to telegram with spoiler

It'll be nice to share my solve story with the actuall guesses I tried, to avoid spoilers, the "spoiler" markdown in telegram can be used. Since this feature avaible only on telegram, a seperate share button should be added.

The message should look like this:

and after clicking a spoiler:

telegram mardown for spoiler is leading and trailing '||'

proper names overrunning top word list

In the word for 20220320, which was "קרקס", what seems to be the majority of the close words had been proper names of people and fictional charterers. Such words should, generally, not appear in the word list in the first place. As removing them may be an annoying issue (I suspect there should be an easy way to filter them reasonably with the pipeline), at least it is worth verifying that the list is not overrun by such words when selecting the daily words, as it can be very frustrating to guess such words.

Negative mark?

Yesterday (the word was "Joke"), the word "GILUY" got a negative mark.
is it a bug or a feature?
Thanks,

XSS Vulnerability

There is an unlikely but possible XSS vulnerablity. If someone is convinced to paste a guess, then an attacker can execute arbitrary JS on the victim's browser.

How to reproduce:

Go to semantel
Paste the following in the text input (with the quotes in the end): היי&"><iframe src="/" onload="alert('PWND')" width="0px" height="0px" />"
Press the button and observe the result:

There are two parts to exploiting the vulnerability:

First, we have to make the server return a response that recognizes the guessed word. Since, if the server returns an empty response the guess row element is not generated and we get an error.

In order to do that, we can just write an actual word; e.g. היי, add the & character in order to make the server think it's a different parameter and add arbitrary text afterwards.
When we send היי&Malicious code here and get a response from the server as if we only sent היי.
Sanitizing the input before executing the following lines of code should solve this problem.

const url = "/api/distance" + '?word=' + word;
const response = await fetch(url);

The use of innerHTML in function guessRow, specifically here:

return `<tr><td>${guessNumber}</td>
<td style="color:${color}" onclick="select('${oldGuess}', secretVec);">${oldGuess}</td>
<td align="right" dir="ltr">${similarity.toFixed(2)}</td>
<td class="${cls}">${percentileText}${progress}
</td></tr>`;

Using an alternative to innerHTML or escaping the input should also help preventing this attack.

Combining these two vulnerabilities, when we enter the malicious input, the following dangerous HTML is generated:

<tr>
   <td>1</td>
   <td style="color:#c0c" onclick="select('היי&amp;">
      <iframe src="/" onload="alert('PWND')" width="0px" height="0px">"', secretVec);">היי&">
      <iframe src="/" onload="alert('PWND')" width="0px" height="0px" />
         "
   </td>
   <td align="right" dir="ltr">24.84</td>
   <td class="">(רחוק)
   </td>
</tr>
<tr><td colspan=4><hr></td></tr></iframe></td></tr>

For example the secret word is "צרידות"
And I guessed:

בננה 26.64
חולי 55.6
תפוח 23.6
סמוראי 16.81
צרידות 100
So it will present the following words in the following order
בננה 26.64
חולי 55.6
צרידות 100
In that way one can understand his progress and breakthroughs.

תוצאה מוזרה

בסמנטעל #74 הציון של המלה "כרבול" הוא 99.99 ומופיע אייקון של שני אנשים מחובקים.

ishefi / semantle-he Goto Github PK

semantle-he's People

Contributors

Stargazers

Watchers

Forkers

semantle-he's Issues

Recommend Projects

Recommend Topics

Recommend Org