Giter VIP home page Giter VIP logo

Comments (10)

anoopkunchukuttan avatar anoopkunchukuttan commented on September 12, 2024 1

Till I understand this issue further, I will stick to the current approach in indic_nlp_library: for chillus convert to atomic codepoints and remove remaining ZWJ/ZWNJ.

from indic_nlp_library.

anoopkunchukuttan avatar anoopkunchukuttan commented on September 12, 2024

Do you mean Malayalam text in Telugu text is removed when the Telugu text is normalized?

from indic_nlp_library.

patelrajnath avatar patelrajnath commented on September 12, 2024

No its in Malayalam text only..My mistake.. heading should be "Code
Normalization error for Malayalam", will change it.
On 16-Jan-2016 11:46 PM, "Anoop Kunchukuttan" [email protected]
wrote:

Do you mean Malayalam text in Telugu text is removed when the Telugu text
is normalized?


Reply to this email directly or view it on GitHub
#7 (comment)
.

from indic_nlp_library.

stultus avatar stultus commented on September 12, 2024

Randomly stumbled upon this repo. Was skimming through the Normalizer code.
I see that you are striping the ZWJ and ZWNJ characters. This is a bug. the ZWJ and ZWNJ are an inherent part of the text and shouldn't be removed. Removal of these characters will alter the text entirely and will have serious implications in areas like Search, Sort etc.

from indic_nlp_library.

anoopkunchukuttan avatar anoopkunchukuttan commented on September 12, 2024

For Malayalam, ZWJ and ZWNJ are needed for the chillu characters only. However, the chillu characters also have their own codepoints. So, I first convert chillu representations to these codepoints before deleting the remaining ZWJ and ZWNJ. This does not semantically alter the text. I don't know the impact on search/sort - but I guess it is better to have a consistent representation is better than having multiple representations. I hope that addresses your concern. Do point out if there is anything I am missing.

from indic_nlp_library.

stultus avatar stultus commented on September 12, 2024

For Malayalam, ZWJ and ZWNJ are needed for the chillu characters only

This is not correct. ZWJ is used to form Chillus and to force C2-conjoining forms ( This is applicable for all Indian languages), ZWNJ is used to indicate the explicit halant.
removing ZWJ & ZWNJ can produce erroneous text (For eg: താഴ്വാരം instead of താഴ്‌‌വാരം).

I don't know the impact on search/sort

Okay, you might get the results if you strip the characters from both the target text and the query string, but the result might not be accurate because of the above-mentioned behaviour. (I don't have any examples ready with me to cite here though)

but I guess it is better to have a consistent representation is better than having multiple representations.

I totally agree with this. So just converting the Chillus to one form should do the trick.

from indic_nlp_library.

anoopkunchukuttan avatar anoopkunchukuttan commented on September 12, 2024

ok, thanks. So, താഴ്വാരം instead of താഴ്‌‌വാരം are semantically the same. The ZWNJ only controls formatting in this case. The chillu is the only case, as I know, where ZWJ actually alters the meaning. That has been handled as mentioned above. The goal of this normalization is to ensure similar representation for similar words for NLP applications. We don't seek to retain formatting characters.

from indic_nlp_library.

stultus avatar stultus commented on September 12, 2024

താഴ്വാരം is non-existent in the dictionary.
To cite a more clear example, consider the following case.

  • സദ്‌വാരം - (good week) - 0D38 0D26 0D4D 200C 0D35 0D3E 0D30 0D02
  • സദ്വാരം - (with hole) - 0D38 0D26 0D4D 0D35 0D3E 0D30 0D02

These pairs have difference in meaning only with the difference of zwnj

from indic_nlp_library.

anoopkunchukuttan avatar anoopkunchukuttan commented on September 12, 2024

for each of these words, the only results Google gives are discussions on this issue :)
Looks like a case of schwa deletion to me. I don't know how prevalent the use of ZWNJ is to address schwa deletion issues in Malayalam - although it should not have been done in the first place. Other languages have accepted schwa deletion problems in the script - wonder why these hacky solutions are being proposed for Malayalam.

from indic_nlp_library.

anoopkunchukuttan avatar anoopkunchukuttan commented on September 12, 2024

Closing since issue pointed by Rajnath is not clear

from indic_nlp_library.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.