anoopkunchukuttan / indic_nlp_library Goto Github PK

View Code? Open in Web Editor NEW

537.0 537.0 158.0 9.55 MB

Resources and tools for Indian language Natural Language Processing

Home Page: http://anoopkunchukuttan.github.io/indic_nlp_library/

License: MIT License

Python 80.61% Jupyter Notebook 19.39%

indian-languages natural-language-processing python

indic_nlp_library's People

Contributors

Stargazers

Watchers

Forkers

jamesjohnson92 adityatewari thesachinsbhat kavetifun particlemedia soumyag213 puneetarora2000 huangkang-chn skmalviya isurudilhan saikswaroop mjsignup rakesh-mohanta karimkhanp techiev2 miradel51 shrirambhat shrishailbhat phuysmans narasimhag sanjibnarzary smravi rahulmirdha pinkesh jjkotni naveenholla harshithapr concept-inversion zhipengyang lijielife pwaila deepaknlp akankshya-ap abayhsu prpankajsingh midroid gegetang shaz13 ml-ai-nlp-ir sanketvmehta srujanm vandnachaturvedi jhandei niranjanaryan erzaliator bhaddow gowthamnani augmen gitvivekgupta rootally sainiudit bojiehu shivamshaiv jerinphilip ravish0007 arushisinghal bnjasim rahulvish31 vinothdinakar aj7tesh rathiankit03 monikameda ezhil-language-foundation sawantsaurabh cs145442 gentaiscool abhishekyana amalbros krishnakatyal vyshnavigutta369 milindnaik007 jomonthomaslobo nerdishhomosapein vasurobo karthikindia anupamaray shreya2105 iostream04 amananand32 priyansh2 keshava atul-anand-jha kesudh hrahmansha sknadig whopriyam thammegowda biranchi2018 supreet21 priya-74llh rahul1990gupta neerajvashistha dwtcourses pandyahariom iamsulabh xpertasks pattenelabs bhavesh-mandloi venkatapathy panlingua-india

indic_nlp_library's Issues

Transliteration not proper for few characters in Tamil

Please find the below code for transliterating from Tamil to English.

from indicnlp.transliterate.unicode_transliterate import ItransTransliterator

input_text = u'ஒன்றுமட்டுமல்லாது'
lang='ta'
input_text = ItransTransliterator.to_itrans(input_text,lang)
print input_text
#OUTPUT : .oऩRumaTTumallAtu

from indicnlp.transliterate.unicode_transliterate import ItransTransliterator
lang='ta'
x=ItransTransliterator.from_itrans(input_text,lang)
print x
#OUTPUT :  ஒனறுமட்டுமல்லாது

How can i use the morph analyser as the stemmer

Can you please tell me how i can use the existing morphological analyser in order to get the stems of the words provided as the input to the indic library.

unable to use indic_nlp_library

Traceback (most recent call last):
File "indic_nlp_library-master/src/indicnlp/tokenize/indic_tokenize.py", line 27, in
from indicnlp.common import IndicNlpException
ModuleNotFoundError: No module named 'indicnlp'

Even after exporting the path I am getting this error.

Wrong sentence tokenization of sentences with quotes

Example:

>>> sentence_tokenize.sentence_split('He said "Will you bring me some water?". She said "Sure!", and went away.', lang='en')
['He said "Will you bring me some water? ". She said "Sure!',
'", and went away.']

The correct output should have been:

['He said "Will you bring me some water?".',
'She said "Sure!", and went away.']

Undo wrong Moses tokenization

Some datasets have been pre-processed with Moses tokenizer (or some other tokenizer), which incorrectly handles halant, considering it to be punctuation and adding spaces around it. Add functionality in the normalizer to undo this behaviour.

Change Romanizer/Indicizer implementation

The current romanizer/indicizer implementation is based on Alan Little's code. This worked for Devanagari alone and some retrofitting had been done to make it work for other languages. Now doing an implementation from scratch for ITRANS.

Normalizer Not working with other Options

This is the error it is throwing when I try any other option other than "do_nothing" can you please check why this is happening
Traceback (most recent call last):
File "indic_nlp_library/src/indicnlp/normalize/indic_normalize.py", line 720, in
normalizer=factory.get_normalizer(language,remove_nuktas,normalize_nasals)
File "indic_nlp_library/src/indicnlp/normalize/indic_normalize.py", line 680, in get_normalizer
normalizer=TeluguNormalizer(lang=language, remove_nuktas=remove_nuktas, nasals_mode=nasals_mode)
File "indic_nlp_library/src/indicnlp/normalize/indic_normalize.py", line 542, in init
super(TeluguNormalizer,self).init(lang,remove_nuktas,nasals_mode)
File "indic_nlp_library/src/indicnlp/normalize/indic_normalize.py", line 69, in init
self._init_normalize_nasals()
File "indic_nlp_library/src/indicnlp/normalize/indic_normalize.py", line 183, in _init_normalize_nasals
self._init_to_anusvaara_strict()
File "indic_nlp_library/src/indicnlp/normalize/indic_normalize.py", line 93, in _init_to_anusvaara_strict
nasal=langinfo.offset_to_char(pat_signature[0],self.lang),
File "indic_nlp_library/src/indicnlp/langinfo.py", line 91, in offset_to_char
return chr(c+SCRIPT_RANGES[lang][0])
ValueError: chr() arg not in range(256)

BrahmiNet is down

I understand this is not a library issue, but the BrahmiNet API referenced in the Transliteration example in this tutorial notebok isn't working (404 error)

The website links and the web interface are also giving 404.

Thanks,
Sourya

ImportError: No module named indicnlp.common

File "indic_tokenize.py", line 20, in
from indicnlp.common import IndicNlpException
ImportError: No module named indicnlp.common

loaderload() fails in latest pandas

Hi, I've Pandas 1.0.4 installed and if I try to execute the following code, it shows be the error AttributeError: 'DataFrame' object has no attribute 'ix', which is because ix is deprecated.

from indicnlp import loader
loader.load()

Full stacktrace of the error.

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-6-5f49d51c6132> in <module>
      1 from indicnlp import loader
----> 2 loader.load()

~/anaconda3/envs/nlp_bert/lib/python3.6/site-packages/indicnlp/loader.py in load()
     25 
     26     ## Initialization of Indic scripts module
---> 27     indic_scripts.init()
     28 
     29     ## Initialization of English scripts module

~/anaconda3/envs/nlp_bert/lib/python3.6/site-packages/indicnlp/script/indic_scripts.py in init()
    104     TAMIL_PHONETIC_DATA=pd.read_csv(os.path.join(common.get_resources_path(),'script','tamil_script_phonetic_data.csv'),encoding='utf-8')
    105 
--> 106     ALL_PHONETIC_VECTORS= ALL_PHONETIC_DATA.ix[:,PHONETIC_VECTOR_START_OFFSET:].as_matrix()
    107     TAMIL_PHONETIC_VECTORS=TAMIL_PHONETIC_DATA.ix[:,PHONETIC_VECTOR_START_OFFSET:].as_matrix()
    108 

~/anaconda3/envs/nlp_bert/lib/python3.6/site-packages/pandas/core/generic.py in __getattr__(self, name)
   5272             if self._info_axis._can_hold_identifiers_and_holds_name(name):
   5273                 return self[name]
-> 5274             return object.__getattribute__(self, name)
   5275 
   5276     def __setattr__(self, name: str, value) -> None:

AttributeError: 'DataFrame' object has no attribute 'ix'

Any chance of making it compatible with pandas >= 1.0 in near future?

One more issue is, I saw this line in the sample notebook that you're using: sys.path.append(r'{}\src'.format(INDIC_NLP_LIB_HOME)). But if I look into the repo there is no src folder in the top level dir. Probably it needs to get updated.

Text Normalisation

I am trying to run the script but getting an error message on the Text Normalisation part.

The error message is: "TypeError: get_normalizer() takes 2 positional arguments but 3 were given" on line "normalizer=factory.get_normalizer("hi",remove_nuktas)".

For reference, I have attached the snapshot of the error message.

Introduction to the CLTK

Hi @anoopkunchukuttan,

I just learned of your fantastic library from @Akirato, a contributor to my project, the CLTK. We share many of the same goals, including offering good NLP functionality for students of Indian languages.

I'm writing to introduce myself and let you know that we may have some questions for you, if we should port parts of your code to the CLTK. Of course, all of your work will be fully credited by us.

Thank you for your great work!

Kyle

Wrong/incomplete mapping for script conversion to Tamil

Wrong mappings

nasals are incorrectly mapped to unvoiced, unaspirated plosive
Mapping in 5th consonant row of varnamala si wrong due to extra consonant in the preceding row

Incomplete mappings

Tamil has a character for 'ja' to which explicit mapping has to be done
'sa' becomes 'Sa'

Create a python installer package

Script Conversion

Does not handle velar nasal plosive
For Tamil the additional labial nasal plosive is not handled.

Inappropriate Hindi English Transliteration

Code:

from indicnlp.transliterate.unicode_transliterate import ItransTransliterator
from indicnlp import loader
from indicnlp import common
common.set_resources_path(INDIC_RESOURCES_PATH)
loader.load()
ItransTransliterator.to_itrans('मैं आज आपकी किस प्रकार सहायता कर सकता हूँ?', 'hi')

Output:

mai.m aaja aapakii kisa prakaara sahaayataa kara sakataa huuँ?

Output using google translator:

Main aaj aapakee kis prakaar sahaayata kar sakata hoon?

There is unnecessary use of '.' and 'uँ' in romanization. What can be the best solution that gives appropriate and presentable transliterated output.

Unable to do Machine Translation

Hi
Running the following snippet of code results in JSONDecodeError :

Code Snippet

Error:

Preserve abbreviation punctuation for Tokenization & adding more abbreviations for Sentence Splitting

The Marathi corpus has ~1M sentences and the Hindi corpus has ~7M sentences which are incorrectly split due to lack of a few language-specific abbreviations. Unfortunately, as the sentences are shuffled there is no way to get the original sentence back.
A few abbreviations I noticed are missing from sentence_tokenize.py : प्रा. (private), जि. (district).
Abbreviations can be changed to preserve the ending '.' while tokenizing to avoid incorrect sentence splits.

A quick fix for this is limiting the sentence lengths to 5-50 words. Most of the sentences lying outside this region are affected by this. I have attached a sample errors.txt file which contains a few of the incorrectly split sentences.
mr_errors.txt

AttributeError: 'NoneType' object has no attribute 'ix'

Getting an error after running the code "--> 180 if phonetic_data.ix[offset,'Valid Vector Representation']==0:"
It is showing error in above line .
Thank You

Unit-testing

Could you please add unit-testing to this nice work ?
Python-3 porting requests will become easier if we have reference unit tests, and besides there is a script 'two2three' and module 'six' to help with it.

AttributeError: 'NoneType' object has no attribute 'iloc'

Similarity between क and ख
Traceback (most recent call last):
File "test.py", line 224, in
isc.get_phonetic_feature_vector(c1, lang),
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/indicnlp/script/indic_scripts.py", line 186, in get_phonetic_feature_vector
if phonetic_data.iloc[offset]['Valid Vector Representation'] == 0:
AttributeError: 'NoneType' object has no attribute 'iloc'

ALL_PHONETIC_DATA=pd.read_csv(os.path.join(common.get_resources_path(),'script','all_script_phonetic_data.csv'),encoding='utf-8')

looks like this file is not getting loaded properly.

`from indicnlp.script import indic_scripts as isc # nopep8
from indicnlp.script import phonetic_sim as psim # nopep8

c1 = 'क'
c2 = 'ख'
lang = 'hi'

print('Similarity between {} and {}'.format(c1, c2))
print(psim.cosine(
isc.get_phonetic_feature_vector(c1, lang),
isc.get_phonetic_feature_vector(c2, lang)
))`

I have exported:
global INDIC_RESOURCES_PATH
INDIC_RESOURCES_PATH = "/Users/arunbaby/indic_nlp_resources"
global PYTHONPATH
PYTHONPATH = "$PYTHONPATH:/Users/arunbaby/src"

Placement of Anuswara

While using syllabifier class, the anuswara is carried over to the next character.

'जगदीशचंद्र' becomes ज ग दी श च ंद्र

This is technically correct. But there are times when someone may need a different representation like 'ज ग दी श चं द्र '
There should be an option for this as well.

Make a kaggle dataset to use this library in the inferece of a kaggle competetion

In some kaggle competitions, you need to keep the internet off so you cant use the ! pip install command, for that you need a kaggle dataset so that you can import it by system append, while the internet is off.

Unified Command line tool

Currently, each of the tools (tokenizer, normalizer) have their own CLI interface. It would be good to have single unified CLI interface to access all the tools.

installation of latest version not working correctly

Hi, I've been working with your library and noticed today that the latest version (0.80) is not installing properly:

>>> import indicnlp
Traceback (most recent call last):
  File "/Users/seanmiller/pyenv/lib/python3.8/site-packages/indicnlp/__init__.py", line 5, in <module>
    from .version import __version__  # noqa
ModuleNotFoundError: No module named 'indicnlp.version'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/seanmiller/pyenv/lib/python3.8/site-packages/indicnlp/__init__.py", line 8, in <module>
    with open(version_txt) as f:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/seanmiller/pyenv/lib/python3.8/site-packages/indicnlp/version.txt'

I didn't get this error with the previous version (0.71).

Detect the language of transliterated text

Is there any functionality to detect the language of a transliterated text?

Long R^I vowel in transliterator.py

Dear Anoop,

I have been using this transliterator too, for a while. Have you figured a way to get it to transliterate the long R^I vowel? Like in pitR^In? (पितॄन्)

Text Normalisation using Indic NLP library not working

from indicnlp.normalize.indic_normalize import IndicNormalizerFactory

input_text="சில உன்னத வேலைகளைச் செய்ய மனிதன் இந்த உலகில் பிறக்கிறான். அவர் வாழ்க்கையில் ஒரு உன்னத இலக்கு இருக்க வேண்டும். அவர் எட்டாம் வகுப்பு மாணவனாக இருக்கும்போது இந்த இலக்கை நிர்ணயிக்க வேண்டும். அதற்கு அவர் உண்மையான முயற்சிகளை மேற்கொள்ள வேண்டும். இது அவருக்கு வெற்றியைத் தரும், மேலும் அவர் தனது இலக்கை அடைய முடியும்"
remove_nuktas=False
factory=IndicNormalizerFactory()
normalizer=factory.get_normalizer("ta",remove_nuktas=False)
output_text=normalizer.normalize(input_text)

print(input_text)
print(output_text)

The text normalisation is not working with this code, it gives back the same string regardless of remove_nuktas is true or false, can you tell what am I doing wrong?

indic_tokenize

hai, I'm trying to use indic_tokenize. I got he following message. Im using Python 3.6

File "C:/Users/CS-14/Anaconda3/lib/site-packages/indicnlp/tokenize/indic_tokenize.py", line 27
triv_tokenizer_indic_pat=re.compile(ur'(['+string.punctuation+ur'\u0964\u0965'+ur'])')
^
SyntaxError: invalid syntax.
Can you help to get rid of this error.
Thanks

Orthograhic syllabification

Visarga should lead to start of a new orthographic syllable.

Morphogical analyser

self._script_range_pat=ur'^[{}-{}]+$'.format(unichr(langinfo.SCRIPT_RANGES[lang][0]),unichr(langinfo.SCRIPT_RANGES[lang][1]))
^
SyntaxError: invalid syntax

Issue with Urdu word segmenter

Hi,

I am trying to use word segmentation for Urdu, but getting following error-

Traceback (most recent call last):
File "/home/raj/smt/decoder/indic_nlp_library/src/indicnlp/morph/unsupervised_morph.py", line 136, in
analyzer=UnsupervisedMorphAnalyzer(language,add_marker)
File "/home/raj/smt/decoder/indic_nlp_library/src/indicnlp/morph/unsupervised_morph.py", line 53, in init
self._script_range_pat=ur'^[{}-{}]+$'.format(unichr(langinfo.SCRIPT_RANGES[lang][0]),unichr(langinfo.SCRIPT_RANGES[lang][1]))

KeyError: 'ur'

Kindly check for the same.
Thank you.

Unable to do transliteration using BrahmiNet REST API

The BrahmiNet REST API used above is not working. The API endpoint cannot be found . Is BrahmiNet down?

Tokenization failing for IITB Monolingual corpus

Getting the below error while trying to do the tokenization for IITB monolingual corpus while same is working fine for the parallel corups(target language - Hindi)

Traceback (most recent call last):
File "indic_tokenize.py", line 67, in
for line in ifile.readlines():
File "/usr/lib/python2.7/codecs.py", line 676, in readlines
return self.reader.readlines(sizehint)
File "/usr/lib/python2.7/codecs.py", line 585, in readlines
data = self.read()
File "/usr/lib/python2.7/codecs.py", line 474, in read
newchars, decodedbytes = self.decode(data, self.errors)
MemoryError

Computing similarity between languages

Is there a documentation support to find similarity between two languages ? If so,can you include an example here

Script conversion of danda and double danda

For Hindi to other Indic script, the danda character is mapped to an invalid character.

For danda and double danda, the Undicode characters are U+0964 and U+0965 respectively irrespective of the script. Hence, script conversion must not happend for these characters from Devanagari to other scripts.

Can you publish this library on pip?

Or if you have it already, what package is it?

Transliteration not working

Python version: 3.8.9

pip install indic-nlp-library
git clone https://github.com/anoopkunchukuttan/indic_nlp_resources.git
`import sys
from indicnlp import common
INDIC_NLP_RESOURCES=r"/home/user/indic_nlp_resources"

common.set_resources_path(INDIC_NLP_RESOURCES)
from indicnlp.transliterate.unicode_transliterate import ItransTransliterator
input_text='അടിക്ക് മോനെ'
print(ItransTransliterator.to_itrans(input_text, 'mal'))`

Output

അടിക്ക് മോനെ

I tried both in my local PC and in colab, but the api is not transliterating

colab

morph_analyze_document should form a list of lists

My reference is to

indic_nlp_library/src/indicnlp/morph/unsupervised_morph.py

Line 93 in 925e3d4

def morph_analyze_document(self,tokens):

Instead of extend(), could you use append() to make a more easy to use list of lists?

CLI parser: BrokenPipeError: [Errno 32] Broken pipe

When using cliparser to normalize and then tokenize from commandline by chaining commands using pipe, error encountered: BrokenPipeError: [Errno 32] Broken pipe

Code normalization error for Malayalam

The following Malayalam text is being removed when normalized.

ദക്ഷിണാഫ്രിക്കയിലെ സെന്‍റര്‍ മൗണ്‍റ്റേന്‍സിലെ ബുഷ്മ്യാന്‍സ് ക്ല്യൂഫിനെ ഏറ്റവും നല്ല ഹോട്ടല്‍ , സിംഗപൂര്‍ എയര്‍ലൈന്‍സിനെ ഏറ്റവും നല്ല അന്താരാഷ്ട്റ വിമാനം , വെര്ജിന്‍ അമേരിക്കയെ ഏറ്റവും ശ്രേഷ്ഠമായ സ്വകാര്യ വിമാനം, ക്രിസ്റ്റല്‍ ക്രൂസിനെ ഏറ്റവും നല്ല ക്രൂസ് ലൈന്‍ ( വലിയ കപ്പല്‍ ) യആട്ട് ഓഫ് സീബോണിനെ ഏറ്റവും ശ്രേഷ്ഠമായ ക്രൂന്‍ ലൈന്‍ ( ചെറിയ കപ്പല്‍ ) എന്നിവയായി പ്രഖ്യാപിച്ചു .

kindly check for the same.
Thank you.

Issue in Romanization

Sir i tried the code pasted below for romanising the hindi script. But when i run the code the script is not getting romanised. The output that I get is ाजान. Please let me know how can i get the proper romanised script.

from indicnlp.transliterate.unicode_transliterate import ItransTransliterator

input_text='राजस्थान'
#input_text='ஆசிரியர்கள்'
lang='hi'

print(ItransTransliterator.to_itrans(input_text,lang))

sentence_split missing all_script_phonetic_data.csv

Invoking sentence_split raises an error:

$ python ~/venv/lib/python3.8/site-packages/indicnlp/cli/cliparser.py sentence_split -l ta ../test-blind/ta.txt ../test-blind/ta.sent
Traceback (most recent call last):
File "/home/attardi/venv/lib/python3.8/site-packages/indicnlp/cli/cliparser.py", line 264, in
loader.load()
File "/home/attardi/venv/lib/python3.8/site-packages/indicnlp/loader.py", line 27, in load
indic_scripts.init()
File "/home/attardi/venv/lib/python3.8/site-packages/indicnlp/script/indic_scripts.py", line 103, in init
ALL_PHONETIC_DATA=pd.read_csv(os.path.join(common.get_resources_path(),'script','all_script_phonetic_data.csv'),encoding='utf-8')
File "/home/attardi/venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 686, in read_csv
return _read(filepath_or_buffer, kwds)
File "/home/attardi/venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 452, in _read
parser = TextFileReader(fp_or_buf, **kwds)
File "/home/attardi/venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 936, in init
self._make_engine(self.engine)
File "/home/attardi/venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 1168, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/home/attardi/venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 1981, in init
src = open(src, "rb")
FileNotFoundError: [Errno 2] No such file or directory: '/home/attardi/venv/lib/python3.8/site-packages/indicnlp/script/all_script_phonetic_data.csv'

AttributeError: 'NoneType' object has no attribute 'iloc'

I am trying to perform Orthographic Syllabification, however, I have run into an error:

AttributeError                            Traceback (most recent call last)
[<ipython-input-9-9f5f7217ed93>](https://localhost:8080/#) in <module>
      3 lang='hi'
      4 
----> 5 print(' '.join(syllabifier.orthographic_syllabify(text,lang)))

2 frames
[/usr/local/lib/python3.9/dist-packages/indicnlp/syllable/syllabifier.py](https://localhost:8080/#) in orthographic_syllabify(word, lang, vocab)
    213 def orthographic_syllabify(word,lang,vocab=None):
    214 
--> 215     p_vectors=[si.get_phonetic_feature_vector(c,lang) for c in word]
    216 
    217     syllables=[]

[/usr/local/lib/python3.9/dist-packages/indicnlp/syllable/syllabifier.py](https://localhost:8080/#) in <listcomp>(.0)
    213 def orthographic_syllabify(word,lang,vocab=None):
    214 
--> 215     p_vectors=[si.get_phonetic_feature_vector(c,lang) for c in word]
    216 
    217     syllables=[]

[/usr/local/lib/python3.9/dist-packages/indicnlp/script/indic_scripts.py](https://localhost:8080/#) in get_phonetic_feature_vector(c, lang)
    168     phonetic_data, phonetic_vectors= get_phonetic_info(lang)
    169 
--> 170     if phonetic_data.iloc[offset]['Valid Vector Representation']==0:
    171         return invalid_vector()
    172 

AttributeError: 'NoneType' object has no attribute 'iloc'

I am using indic-nlp-library version 0.91

Is translate function available?

I have tried to find the translate function but to no avail. The zip file or the github repo dont show any translate option. It is stated in the document that translate is one of the options available. Please help anyone

Getting an error for from indicnlp import loader loader.load()

ModuleNotFoundError: No module named 'indicnlp.script'
"from indicnlp import loader
loader.load()"
when trying to run the above code.

vectors for SOS and EOS

I wanted to know if there is any vector representation for SOS and EOS in the Hindi embeddings.

Better tokenization of numbers needed

४,३२,००० get tokenized as ४ , ३२ , ०००. This should not happen.

Schwa deletion in romanization for Hindi

Hi,

Looks like Schwa deletion is not handled for Hindi , Punjabi etc.

>>> input_text='जिसका'
>>> print(ItransTransliterator.to_itrans(input_text,'hi'))
jisakaa

get_normalizer() takes 2 positional arguments but 3 were given

Traceback (most recent call last):
File "test1.py", line 6, in
normalizer=factory.get_normalizer("hi",remove_nuktas)
TypeError: get_normalizer() takes 2 positional arguments but 3 were given
occurs while execution of the exmaple in juyter notebook provided