Giter VIP home page Giter VIP logo

indic_nlp_library's People

Contributors

ankunchu avatar anoopkunchukuttan avatar jaygala24 avatar neerajchhimwal avatar neerajvashistha avatar pranjalchitale avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

indic_nlp_library's Issues

Transliteration not proper for few characters in Tamil

Please find the below code for transliterating from Tamil to English.

from indicnlp.transliterate.unicode_transliterate import ItransTransliterator

input_text = u'ஒன்றுமட்டுமல்லாது'
lang='ta'
input_text = ItransTransliterator.to_itrans(input_text,lang)
print input_text
#OUTPUT : .oऩRumaTTumallAtu

from indicnlp.transliterate.unicode_transliterate import ItransTransliterator
lang='ta'
x=ItransTransliterator.from_itrans(input_text,lang)
print x
#OUTPUT :  ஒனறுமட்டுமல்லாது

unable to use indic_nlp_library

Traceback (most recent call last):
File "indic_nlp_library-master/src/indicnlp/tokenize/indic_tokenize.py", line 27, in
from indicnlp.common import IndicNlpException
ModuleNotFoundError: No module named 'indicnlp'

Even after exporting the path I am getting this error.

Wrong sentence tokenization of sentences with quotes

Example:

>>> sentence_tokenize.sentence_split('He said "Will you bring me some water?". She said "Sure!", and went away.', lang='en')
['He said "Will you bring me some water? ". She said "Sure!',
'", and went away.']

The correct output should have been:

['He said "Will you bring me some water?".',
'She said "Sure!", and went away.']

Undo wrong Moses tokenization

Some datasets have been pre-processed with Moses tokenizer (or some other tokenizer), which incorrectly handles halant, considering it to be punctuation and adding spaces around it. Add functionality in the normalizer to undo this behaviour.

Change Romanizer/Indicizer implementation

The current romanizer/indicizer implementation is based on Alan Little's code. This worked for Devanagari alone and some retrofitting had been done to make it work for other languages. Now doing an implementation from scratch for ITRANS.

Normalizer Not working with other Options

This is the error it is throwing when I try any other option other than "do_nothing" can you please check why this is happening
Traceback (most recent call last):
File "indic_nlp_library/src/indicnlp/normalize/indic_normalize.py", line 720, in
normalizer=factory.get_normalizer(language,remove_nuktas,normalize_nasals)
File "indic_nlp_library/src/indicnlp/normalize/indic_normalize.py", line 680, in get_normalizer
normalizer=TeluguNormalizer(lang=language, remove_nuktas=remove_nuktas, nasals_mode=nasals_mode)
File "indic_nlp_library/src/indicnlp/normalize/indic_normalize.py", line 542, in init
super(TeluguNormalizer,self).init(lang,remove_nuktas,nasals_mode)
File "indic_nlp_library/src/indicnlp/normalize/indic_normalize.py", line 69, in init
self._init_normalize_nasals()
File "indic_nlp_library/src/indicnlp/normalize/indic_normalize.py", line 183, in _init_normalize_nasals
self._init_to_anusvaara_strict()
File "indic_nlp_library/src/indicnlp/normalize/indic_normalize.py", line 93, in _init_to_anusvaara_strict
nasal=langinfo.offset_to_char(pat_signature[0],self.lang),
File "indic_nlp_library/src/indicnlp/langinfo.py", line 91, in offset_to_char
return chr(c+SCRIPT_RANGES[lang][0])
ValueError: chr() arg not in range(256)

BrahmiNet is down

I understand this is not a library issue, but the BrahmiNet API referenced in the Transliteration example in this tutorial notebok isn't working (404 error)

The website links and the web interface are also giving 404.

Thanks,
Sourya

loaderload() fails in latest pandas

Hi, I've Pandas 1.0.4 installed and if I try to execute the following code, it shows be the error AttributeError: 'DataFrame' object has no attribute 'ix', which is because ix is deprecated.

from indicnlp import loader
loader.load()

Full stacktrace of the error.

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-6-5f49d51c6132> in <module>
      1 from indicnlp import loader
----> 2 loader.load()

~/anaconda3/envs/nlp_bert/lib/python3.6/site-packages/indicnlp/loader.py in load()
     25 
     26     ## Initialization of Indic scripts module
---> 27     indic_scripts.init()
     28 
     29     ## Initialization of English scripts module

~/anaconda3/envs/nlp_bert/lib/python3.6/site-packages/indicnlp/script/indic_scripts.py in init()
    104     TAMIL_PHONETIC_DATA=pd.read_csv(os.path.join(common.get_resources_path(),'script','tamil_script_phonetic_data.csv'),encoding='utf-8')
    105 
--> 106     ALL_PHONETIC_VECTORS= ALL_PHONETIC_DATA.ix[:,PHONETIC_VECTOR_START_OFFSET:].as_matrix()
    107     TAMIL_PHONETIC_VECTORS=TAMIL_PHONETIC_DATA.ix[:,PHONETIC_VECTOR_START_OFFSET:].as_matrix()
    108 

~/anaconda3/envs/nlp_bert/lib/python3.6/site-packages/pandas/core/generic.py in __getattr__(self, name)
   5272             if self._info_axis._can_hold_identifiers_and_holds_name(name):
   5273                 return self[name]
-> 5274             return object.__getattribute__(self, name)
   5275 
   5276     def __setattr__(self, name: str, value) -> None:

AttributeError: 'DataFrame' object has no attribute 'ix'

Any chance of making it compatible with pandas >= 1.0 in near future?

One more issue is, I saw this line in the sample notebook that you're using: sys.path.append(r'{}\src'.format(INDIC_NLP_LIB_HOME)). But if I look into the repo there is no src folder in the top level dir. Probably it needs to get updated.

Text Normalisation

I am trying to run the script but getting an error message on the Text Normalisation part.

The error message is: "TypeError: get_normalizer() takes 2 positional arguments but 3 were given" on line "normalizer=factory.get_normalizer("hi",remove_nuktas)". 

For reference, I have attached the snapshot of the error message.

Screenshot 2021-03-27 at 1 22 57 PM

Introduction to the CLTK

Hi @anoopkunchukuttan,

I just learned of your fantastic library from @Akirato, a contributor to my project, the CLTK. We share many of the same goals, including offering good NLP functionality for students of Indian languages.

I'm writing to introduce myself and let you know that we may have some questions for you, if we should port parts of your code to the CLTK. Of course, all of your work will be fully credited by us.

Thank you for your great work!

Kyle

Wrong/incomplete mapping for script conversion to Tamil

Wrong mappings

  1. nasals are incorrectly mapped to unvoiced, unaspirated plosive
  2. Mapping in 5th consonant row of varnamala si wrong due to extra consonant in the preceding row

Incomplete mappings

  1. Tamil has a character for 'ja' to which explicit mapping has to be done
  2. 'sa' becomes 'Sa'

Script Conversion

  • Does not handle velar nasal plosive
  • For Tamil the additional labial nasal plosive is not handled.

Inappropriate Hindi English Transliteration

Code:

from indicnlp.transliterate.unicode_transliterate import ItransTransliterator
from indicnlp import loader
from indicnlp import common
common.set_resources_path(INDIC_RESOURCES_PATH)
loader.load()
ItransTransliterator.to_itrans('मैं आज आपकी किस प्रकार सहायता कर सकता हूँ?', 'hi')

Output:

mai.m aaja aapakii kisa prakaara sahaayataa kara sakataa huuँ?

Output using google translator:

Main aaj aapakee kis prakaar sahaayata kar sakata hoon?

There is unnecessary use of '.' and 'uँ' in romanization. What can be the best solution that gives appropriate and presentable transliterated output.

Preserve abbreviation punctuation for Tokenization & adding more abbreviations for Sentence Splitting

The Marathi corpus has ~1M sentences and the Hindi corpus has ~7M sentences which are incorrectly split due to lack of a few language-specific abbreviations. Unfortunately, as the sentences are shuffled there is no way to get the original sentence back.
A few abbreviations I noticed are missing from sentence_tokenize.py : प्रा. (private), जि. (district).
Abbreviations can be changed to preserve the ending '.' while tokenizing to avoid incorrect sentence splits.

A quick fix for this is limiting the sentence lengths to 5-50 words. Most of the sentences lying outside this region are affected by this. I have attached a sample errors.txt file which contains a few of the incorrectly split sentences.
mr_errors.txt

Unit-testing

Could you please add unit-testing to this nice work ?
Python-3 porting requests will become easier if we have reference unit tests, and besides there is a script 'two2three' and module 'six' to help with it.

AttributeError: 'NoneType' object has no attribute 'iloc'

Similarity between क and ख
Traceback (most recent call last):
File "test.py", line 224, in
isc.get_phonetic_feature_vector(c1, lang),
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/indicnlp/script/indic_scripts.py", line 186, in get_phonetic_feature_vector
if phonetic_data.iloc[offset]['Valid Vector Representation'] == 0:
AttributeError: 'NoneType' object has no attribute 'iloc'

ALL_PHONETIC_DATA=pd.read_csv(os.path.join(common.get_resources_path(),'script','all_script_phonetic_data.csv'),encoding='utf-8')

looks like this file is not getting loaded properly.

`from indicnlp.script import indic_scripts as isc # nopep8
from indicnlp.script import phonetic_sim as psim # nopep8

c1 = 'क'
c2 = 'ख'
lang = 'hi'

print('Similarity between {} and {}'.format(c1, c2))
print(psim.cosine(
isc.get_phonetic_feature_vector(c1, lang),
isc.get_phonetic_feature_vector(c2, lang)
))`

I have exported:
global INDIC_RESOURCES_PATH
INDIC_RESOURCES_PATH = "/Users/arunbaby/indic_nlp_resources"
global PYTHONPATH
PYTHONPATH = "$PYTHONPATH:/Users/arunbaby/src"

Placement of Anuswara

While using syllabifier class, the anuswara is carried over to the next character.

'जगदीशचंद्र' becomes ज ग दी श च ंद्र

This is technically correct. But there are times when someone may need a different representation like 'ज ग दी श चं द्र '
There should be an option for this as well.

Unified Command line tool

Currently, each of the tools (tokenizer, normalizer) have their own CLI interface. It would be good to have single unified CLI interface to access all the tools.

installation of latest version not working correctly

Hi, I've been working with your library and noticed today that the latest version (0.80) is not installing properly:

>>> import indicnlp
Traceback (most recent call last):
  File "/Users/seanmiller/pyenv/lib/python3.8/site-packages/indicnlp/__init__.py", line 5, in <module>
    from .version import __version__  # noqa
ModuleNotFoundError: No module named 'indicnlp.version'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/seanmiller/pyenv/lib/python3.8/site-packages/indicnlp/__init__.py", line 8, in <module>
    with open(version_txt) as f:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/seanmiller/pyenv/lib/python3.8/site-packages/indicnlp/version.txt'

I didn't get this error with the previous version (0.71).

Long R^I vowel in transliterator.py

Dear Anoop,

I have been using this transliterator too, for a while. Have you figured a way to get it to transliterate the long R^I vowel? Like in pitR^In? (पितॄन्)

Text Normalisation using Indic NLP library not working

from indicnlp.normalize.indic_normalize import IndicNormalizerFactory

input_text="சில உன்னத வேலைகளைச் செய்ய மனிதன் இந்த உலகில் பிறக்கிறான். அவர் வாழ்க்கையில் ஒரு உன்னத இலக்கு இருக்க வேண்டும். அவர் எட்டாம் வகுப்பு மாணவனாக இருக்கும்போது இந்த இலக்கை நிர்ணயிக்க வேண்டும். அதற்கு அவர் உண்மையான முயற்சிகளை மேற்கொள்ள வேண்டும். இது அவருக்கு வெற்றியைத் தரும், மேலும் அவர் தனது இலக்கை அடைய முடியும்"
remove_nuktas=False
factory=IndicNormalizerFactory()
normalizer=factory.get_normalizer("ta",remove_nuktas=False)
output_text=normalizer.normalize(input_text)

print(input_text)
print(output_text)

The text normalisation is not working with this code, it gives back the same string regardless of remove_nuktas is true or false, can you tell what am I doing wrong?

indic_tokenize

hai, I'm trying to use indic_tokenize. I got he following message. Im using Python 3.6

File "C:/Users/CS-14/Anaconda3/lib/site-packages/indicnlp/tokenize/indic_tokenize.py", line 27
triv_tokenizer_indic_pat=re.compile(ur'(['+string.punctuation+ur'\u0964\u0965'+ur'])')
^
SyntaxError: invalid syntax.
Can you help to get rid of this error.
Thanks

Morphogical analyser

self._script_range_pat=ur'^[{}-{}]+$'.format(unichr(langinfo.SCRIPT_RANGES[lang][0]),unichr(langinfo.SCRIPT_RANGES[lang][1]))
^
SyntaxError: invalid syntax

Issue with Urdu word segmenter

Hi,

I am trying to use word segmentation for Urdu, but getting following error-

Traceback (most recent call last):
File "/home/raj/smt/decoder/indic_nlp_library/src/indicnlp/morph/unsupervised_morph.py", line 136, in
analyzer=UnsupervisedMorphAnalyzer(language,add_marker)
File "/home/raj/smt/decoder/indic_nlp_library/src/indicnlp/morph/unsupervised_morph.py", line 53, in init
self._script_range_pat=ur'^[{}-{}]+$'.format(unichr(langinfo.SCRIPT_RANGES[lang][0]),unichr(langinfo.SCRIPT_RANGES[lang][1]))

KeyError: 'ur'

Kindly check for the same.
Thank you.

Tokenization failing for IITB Monolingual corpus

Getting the below error while trying to do the tokenization for IITB monolingual corpus while same is working fine for the parallel corups(target language - Hindi)

Traceback (most recent call last):
File "indic_tokenize.py", line 67, in
for line in ifile.readlines():
File "/usr/lib/python2.7/codecs.py", line 676, in readlines
return self.reader.readlines(sizehint)
File "/usr/lib/python2.7/codecs.py", line 585, in readlines
data = self.read()
File "/usr/lib/python2.7/codecs.py", line 474, in read
newchars, decodedbytes = self.decode(data, self.errors)
MemoryError

Script conversion of danda and double danda

For Hindi to other Indic script, the danda character is mapped to an invalid character.

For danda and double danda, the Undicode characters are U+0964 and U+0965 respectively irrespective of the script. Hence, script conversion must not happend for these characters from Devanagari to other scripts.

Transliteration not working

Python version: 3.8.9

pip install indic-nlp-library
git clone https://github.com/anoopkunchukuttan/indic_nlp_resources.git
`import sys
from indicnlp import common
INDIC_NLP_RESOURCES=r"/home/user/indic_nlp_resources"

common.set_resources_path(INDIC_NLP_RESOURCES)
from indicnlp.transliterate.unicode_transliterate import ItransTransliterator
input_text='അടിക്ക് മോനെ'
print(ItransTransliterator.to_itrans(input_text, 'mal'))`

Output

അടിക്ക് മോനെ

I tried both in my local PC and in colab, but the api is not transliterating

colab

Code normalization error for Malayalam

The following Malayalam text is being removed when normalized.

ദക്ഷിണാഫ്രിക്കയിലെ സെന്‍റര്‍ മൗണ്‍റ്റേന്‍സിലെ ബുഷ്മ്യാന്‍സ് ക്ല്യൂഫിനെ ഏറ്റവും നല്ല ഹോട്ടല്‍ , സിംഗപൂര്‍ എയര്‍ലൈന്‍സിനെ ഏറ്റവും നല്ല അന്താരാഷ്ട്റ വിമാനം , വെര്ജിന്‍ അമേരിക്കയെ ഏറ്റവും ശ്രേഷ്ഠമായ സ്വകാര്യ വിമാനം, ക്രിസ്റ്റല്‍ ക്രൂസിനെ ഏറ്റവും നല്ല ക്രൂസ് ലൈന്‍ ( വലിയ കപ്പല്‍ ) യആട്ട് ഓഫ് സീബോണിനെ ഏറ്റവും ശ്രേഷ്ഠമായ ക്രൂന്‍ ലൈന്‍ ( ചെറിയ കപ്പല്‍ ) എന്നിവയായി പ്രഖ്യാപിച്ചു .

kindly check for the same.
Thank you.

Issue in Romanization

Sir i tried the code pasted below for romanising the hindi script. But when i run the code the script is not getting romanised. The output that I get is ाजान. Please let me know how can i get the proper romanised script.

from indicnlp.transliterate.unicode_transliterate import ItransTransliterator

input_text='राजस्थान'
#input_text='ஆசிரியர்கள்'
lang='hi'

print(ItransTransliterator.to_itrans(input_text,lang))

sentence_split missing all_script_phonetic_data.csv

Invoking sentence_split raises an error:

$ python ~/venv/lib/python3.8/site-packages/indicnlp/cli/cliparser.py sentence_split -l ta ../test-blind/ta.txt ../test-blind/ta.sent
Traceback (most recent call last):
File "/home/attardi/venv/lib/python3.8/site-packages/indicnlp/cli/cliparser.py", line 264, in
loader.load()
File "/home/attardi/venv/lib/python3.8/site-packages/indicnlp/loader.py", line 27, in load
indic_scripts.init()
File "/home/attardi/venv/lib/python3.8/site-packages/indicnlp/script/indic_scripts.py", line 103, in init
ALL_PHONETIC_DATA=pd.read_csv(os.path.join(common.get_resources_path(),'script','all_script_phonetic_data.csv'),encoding='utf-8')
File "/home/attardi/venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 686, in read_csv
return _read(filepath_or_buffer, kwds)
File "/home/attardi/venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 452, in _read
parser = TextFileReader(fp_or_buf, **kwds)
File "/home/attardi/venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 936, in init
self._make_engine(self.engine)
File "/home/attardi/venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 1168, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/home/attardi/venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 1981, in init
src = open(src, "rb")
FileNotFoundError: [Errno 2] No such file or directory: '/home/attardi/venv/lib/python3.8/site-packages/indicnlp/script/all_script_phonetic_data.csv'

AttributeError: 'NoneType' object has no attribute 'iloc'

I am trying to perform Orthographic Syllabification, however, I have run into an error:

AttributeError                            Traceback (most recent call last)
[<ipython-input-9-9f5f7217ed93>](https://localhost:8080/#) in <module>
      3 lang='hi'
      4 
----> 5 print(' '.join(syllabifier.orthographic_syllabify(text,lang)))

2 frames
[/usr/local/lib/python3.9/dist-packages/indicnlp/syllable/syllabifier.py](https://localhost:8080/#) in orthographic_syllabify(word, lang, vocab)
    213 def orthographic_syllabify(word,lang,vocab=None):
    214 
--> 215     p_vectors=[si.get_phonetic_feature_vector(c,lang) for c in word]
    216 
    217     syllables=[]

[/usr/local/lib/python3.9/dist-packages/indicnlp/syllable/syllabifier.py](https://localhost:8080/#) in <listcomp>(.0)
    213 def orthographic_syllabify(word,lang,vocab=None):
    214 
--> 215     p_vectors=[si.get_phonetic_feature_vector(c,lang) for c in word]
    216 
    217     syllables=[]

[/usr/local/lib/python3.9/dist-packages/indicnlp/script/indic_scripts.py](https://localhost:8080/#) in get_phonetic_feature_vector(c, lang)
    168     phonetic_data, phonetic_vectors= get_phonetic_info(lang)
    169 
--> 170     if phonetic_data.iloc[offset]['Valid Vector Representation']==0:
    171         return invalid_vector()
    172 

AttributeError: 'NoneType' object has no attribute 'iloc'

I am using indic-nlp-library version 0.91

Is translate function available?

I have tried to find the translate function but to no avail. The zip file or the github repo dont show any translate option. It is stated in the document that translate is one of the options available. Please help anyone

vectors for SOS and EOS

I wanted to know if there is any vector representation for SOS and EOS in the Hindi embeddings.

Schwa deletion in romanization for Hindi

Hi,

Looks like Schwa deletion is not handled for Hindi , Punjabi etc.

>>> input_text='जिसका'
>>> print(ItransTransliterator.to_itrans(input_text,'hi'))
jisakaa

get_normalizer() takes 2 positional arguments but 3 were given

Traceback (most recent call last):
File "test1.py", line 6, in
normalizer=factory.get_normalizer("hi",remove_nuktas)
TypeError: get_normalizer() takes 2 positional arguments but 3 were given
occurs while execution of the exmaple in juyter notebook provided

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.