anoopkunchukuttan / indic_nlp_library Goto Github PK
View Code? Open in Web Editor NEWResources and tools for Indian language Natural Language Processing
Home Page: http://anoopkunchukuttan.github.io/indic_nlp_library/
License: MIT License
Resources and tools for Indian language Natural Language Processing
Home Page: http://anoopkunchukuttan.github.io/indic_nlp_library/
License: MIT License
Please find the below code for transliterating from Tamil to English.
from indicnlp.transliterate.unicode_transliterate import ItransTransliterator
input_text = u'ஒன்றுமட்டுமல்லாது'
lang='ta'
input_text = ItransTransliterator.to_itrans(input_text,lang)
print input_text
#OUTPUT : .oऩRumaTTumallAtu
from indicnlp.transliterate.unicode_transliterate import ItransTransliterator
lang='ta'
x=ItransTransliterator.from_itrans(input_text,lang)
print x
#OUTPUT : ஒனறுமட்டுமல்லாது
Can you please tell me how i can use the existing morphological analyser in order to get the stems of the words provided as the input to the indic library.
Traceback (most recent call last):
File "indic_nlp_library-master/src/indicnlp/tokenize/indic_tokenize.py", line 27, in
from indicnlp.common import IndicNlpException
ModuleNotFoundError: No module named 'indicnlp'
Even after exporting the path I am getting this error.
Example:
>>> sentence_tokenize.sentence_split('He said "Will you bring me some water?". She said "Sure!", and went away.', lang='en')
['He said "Will you bring me some water? ". She said "Sure!',
'", and went away.']
The correct output should have been:
['He said "Will you bring me some water?".',
'She said "Sure!", and went away.']
Some datasets have been pre-processed with Moses tokenizer (or some other tokenizer), which incorrectly handles halant, considering it to be punctuation and adding spaces around it. Add functionality in the normalizer to undo this behaviour.
The current romanizer/indicizer implementation is based on Alan Little's code. This worked for Devanagari alone and some retrofitting had been done to make it work for other languages. Now doing an implementation from scratch for ITRANS.
This is the error it is throwing when I try any other option other than "do_nothing" can you please check why this is happening
Traceback (most recent call last):
File "indic_nlp_library/src/indicnlp/normalize/indic_normalize.py", line 720, in
normalizer=factory.get_normalizer(language,remove_nuktas,normalize_nasals)
File "indic_nlp_library/src/indicnlp/normalize/indic_normalize.py", line 680, in get_normalizer
normalizer=TeluguNormalizer(lang=language, remove_nuktas=remove_nuktas, nasals_mode=nasals_mode)
File "indic_nlp_library/src/indicnlp/normalize/indic_normalize.py", line 542, in init
super(TeluguNormalizer,self).init(lang,remove_nuktas,nasals_mode)
File "indic_nlp_library/src/indicnlp/normalize/indic_normalize.py", line 69, in init
self._init_normalize_nasals()
File "indic_nlp_library/src/indicnlp/normalize/indic_normalize.py", line 183, in _init_normalize_nasals
self._init_to_anusvaara_strict()
File "indic_nlp_library/src/indicnlp/normalize/indic_normalize.py", line 93, in _init_to_anusvaara_strict
nasal=langinfo.offset_to_char(pat_signature[0],self.lang),
File "indic_nlp_library/src/indicnlp/langinfo.py", line 91, in offset_to_char
return chr(c+SCRIPT_RANGES[lang][0])
ValueError: chr() arg not in range(256)
I understand this is not a library issue, but the BrahmiNet API referenced in the Transliteration example in this tutorial notebok isn't working (404 error)
The website links and the web interface are also giving 404.
Thanks,
Sourya
File "indic_tokenize.py", line 20, in
from indicnlp.common import IndicNlpException
ImportError: No module named indicnlp.common
Hi, I've Pandas 1.0.4 installed and if I try to execute the following code, it shows be the error AttributeError: 'DataFrame' object has no attribute 'ix'
, which is because ix is deprecated.
from indicnlp import loader
loader.load()
Full stacktrace of the error.
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-6-5f49d51c6132> in <module>
1 from indicnlp import loader
----> 2 loader.load()
~/anaconda3/envs/nlp_bert/lib/python3.6/site-packages/indicnlp/loader.py in load()
25
26 ## Initialization of Indic scripts module
---> 27 indic_scripts.init()
28
29 ## Initialization of English scripts module
~/anaconda3/envs/nlp_bert/lib/python3.6/site-packages/indicnlp/script/indic_scripts.py in init()
104 TAMIL_PHONETIC_DATA=pd.read_csv(os.path.join(common.get_resources_path(),'script','tamil_script_phonetic_data.csv'),encoding='utf-8')
105
--> 106 ALL_PHONETIC_VECTORS= ALL_PHONETIC_DATA.ix[:,PHONETIC_VECTOR_START_OFFSET:].as_matrix()
107 TAMIL_PHONETIC_VECTORS=TAMIL_PHONETIC_DATA.ix[:,PHONETIC_VECTOR_START_OFFSET:].as_matrix()
108
~/anaconda3/envs/nlp_bert/lib/python3.6/site-packages/pandas/core/generic.py in __getattr__(self, name)
5272 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5273 return self[name]
-> 5274 return object.__getattribute__(self, name)
5275
5276 def __setattr__(self, name: str, value) -> None:
AttributeError: 'DataFrame' object has no attribute 'ix'
Any chance of making it compatible with pandas >= 1.0 in near future?
One more issue is, I saw this line in the sample notebook that you're using: sys.path.append(r'{}\src'.format(INDIC_NLP_LIB_HOME))
. But if I look into the repo there is no src folder in the top level dir. Probably it needs to get updated.
I am trying to run the script but getting an error message on the Text Normalisation part.
The error message is: "TypeError: get_normalizer() takes 2 positional arguments but 3 were given" on line "normalizer=factory.get_normalizer("hi",remove_nuktas)".
For reference, I have attached the snapshot of the error message.
I just learned of your fantastic library from @Akirato, a contributor to my project, the CLTK. We share many of the same goals, including offering good NLP functionality for students of Indian languages.
I'm writing to introduce myself and let you know that we may have some questions for you, if we should port parts of your code to the CLTK. Of course, all of your work will be fully credited by us.
Thank you for your great work!
Kyle
Wrong mappings
Incomplete mappings
Code:
from indicnlp.transliterate.unicode_transliterate import ItransTransliterator
from indicnlp import loader
from indicnlp import common
common.set_resources_path(INDIC_RESOURCES_PATH)
loader.load()
ItransTransliterator.to_itrans('मैं आज आपकी किस प्रकार सहायता कर सकता हूँ?', 'hi')
Output:
mai.m aaja aapakii kisa prakaara sahaayataa kara sakataa huuँ?
Output using google translator:
Main aaj aapakee kis prakaar sahaayata kar sakata hoon?
There is unnecessary use of '.' and 'uँ' in romanization. What can be the best solution that gives appropriate and presentable transliterated output.
The Marathi corpus has ~1M sentences and the Hindi corpus has ~7M sentences which are incorrectly split due to lack of a few language-specific abbreviations. Unfortunately, as the sentences are shuffled there is no way to get the original sentence back.
A few abbreviations I noticed are missing from sentence_tokenize.py : प्रा. (private), जि. (district).
Abbreviations can be changed to preserve the ending '.' while tokenizing to avoid incorrect sentence splits.
A quick fix for this is limiting the sentence lengths to 5-50 words. Most of the sentences lying outside this region are affected by this. I have attached a sample errors.txt file which contains a few of the incorrectly split sentences.
mr_errors.txt
Getting an error after running the code "--> 180 if phonetic_data.ix[offset,'Valid Vector Representation']==0:"
It is showing error in above line .
Thank You
Could you please add unit-testing to this nice work ?
Python-3 porting requests will become easier if we have reference unit tests, and besides there is a script 'two2three' and module 'six' to help with it.
Similarity between क and ख
Traceback (most recent call last):
File "test.py", line 224, in
isc.get_phonetic_feature_vector(c1, lang),
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/indicnlp/script/indic_scripts.py", line 186, in get_phonetic_feature_vector
if phonetic_data.iloc[offset]['Valid Vector Representation'] == 0:
AttributeError: 'NoneType' object has no attribute 'iloc'
ALL_PHONETIC_DATA=pd.read_csv(os.path.join(common.get_resources_path(),'script','all_script_phonetic_data.csv'),encoding='utf-8')
looks like this file is not getting loaded properly.
`from indicnlp.script import indic_scripts as isc # nopep8
from indicnlp.script import phonetic_sim as psim # nopep8
c1 = 'क'
c2 = 'ख'
lang = 'hi'
print('Similarity between {} and {}'.format(c1, c2))
print(psim.cosine(
isc.get_phonetic_feature_vector(c1, lang),
isc.get_phonetic_feature_vector(c2, lang)
))`
I have exported:
global INDIC_RESOURCES_PATH
INDIC_RESOURCES_PATH = "/Users/arunbaby/indic_nlp_resources"
global PYTHONPATH
PYTHONPATH = "$PYTHONPATH:/Users/arunbaby/src"
While using syllabifier class, the anuswara is carried over to the next character.
'जगदीशचंद्र' becomes ज ग दी श च ंद्र
This is technically correct. But there are times when someone may need a different representation like 'ज ग दी श चं द्र '
There should be an option for this as well.
In some kaggle competitions, you need to keep the internet off so you cant use the ! pip install command, for that you need a kaggle dataset so that you can import it by system append, while the internet is off.
Currently, each of the tools (tokenizer, normalizer) have their own CLI interface. It would be good to have single unified CLI interface to access all the tools.
Hi, I've been working with your library and noticed today that the latest version (0.80) is not installing properly:
>>> import indicnlp
Traceback (most recent call last):
File "/Users/seanmiller/pyenv/lib/python3.8/site-packages/indicnlp/__init__.py", line 5, in <module>
from .version import __version__ # noqa
ModuleNotFoundError: No module named 'indicnlp.version'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/seanmiller/pyenv/lib/python3.8/site-packages/indicnlp/__init__.py", line 8, in <module>
with open(version_txt) as f:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/seanmiller/pyenv/lib/python3.8/site-packages/indicnlp/version.txt'
I didn't get this error with the previous version (0.71).
Is there any functionality to detect the language of a transliterated text?
Dear Anoop,
I have been using this transliterator too, for a while. Have you figured a way to get it to transliterate the long R^I vowel? Like in pitR^In? (पितॄन्)
from indicnlp.normalize.indic_normalize import IndicNormalizerFactory
input_text="சில உன்னத வேலைகளைச் செய்ய மனிதன் இந்த உலகில் பிறக்கிறான். அவர் வாழ்க்கையில் ஒரு உன்னத இலக்கு இருக்க வேண்டும். அவர் எட்டாம் வகுப்பு மாணவனாக இருக்கும்போது இந்த இலக்கை நிர்ணயிக்க வேண்டும். அதற்கு அவர் உண்மையான முயற்சிகளை மேற்கொள்ள வேண்டும். இது அவருக்கு வெற்றியைத் தரும், மேலும் அவர் தனது இலக்கை அடைய முடியும்"
remove_nuktas=False
factory=IndicNormalizerFactory()
normalizer=factory.get_normalizer("ta",remove_nuktas=False)
output_text=normalizer.normalize(input_text)
print(input_text)
print(output_text)
The text normalisation is not working with this code, it gives back the same string regardless of remove_nuktas is true or false, can you tell what am I doing wrong?
hai, I'm trying to use indic_tokenize. I got he following message. Im using Python 3.6
File "C:/Users/CS-14/Anaconda3/lib/site-packages/indicnlp/tokenize/indic_tokenize.py", line 27
triv_tokenizer_indic_pat=re.compile(ur'(['+string.punctuation+ur'\u0964\u0965'+ur'])')
^
SyntaxError: invalid syntax.
Can you help to get rid of this error.
Thanks
Visarga should lead to start of a new orthographic syllable.
self._script_range_pat=ur'^[{}-{}]+$'.format(unichr(langinfo.SCRIPT_RANGES[lang][0]),unichr(langinfo.SCRIPT_RANGES[lang][1]))
^
SyntaxError: invalid syntax
Hi,
Traceback (most recent call last):
File "/home/raj/smt/decoder/indic_nlp_library/src/indicnlp/morph/unsupervised_morph.py", line 136, in
analyzer=UnsupervisedMorphAnalyzer(language,add_marker)
File "/home/raj/smt/decoder/indic_nlp_library/src/indicnlp/morph/unsupervised_morph.py", line 53, in init
self._script_range_pat=ur'^[{}-{}]+$'.format(unichr(langinfo.SCRIPT_RANGES[lang][0]),unichr(langinfo.SCRIPT_RANGES[lang][1]))
Kindly check for the same.
Thank you.
Getting the below error while trying to do the tokenization for IITB monolingual corpus while same is working fine for the parallel corups(target language - Hindi)
Traceback (most recent call last):
File "indic_tokenize.py", line 67, in
for line in ifile.readlines():
File "/usr/lib/python2.7/codecs.py", line 676, in readlines
return self.reader.readlines(sizehint)
File "/usr/lib/python2.7/codecs.py", line 585, in readlines
data = self.read()
File "/usr/lib/python2.7/codecs.py", line 474, in read
newchars, decodedbytes = self.decode(data, self.errors)
MemoryError
Is there a documentation support to find similarity between two languages ? If so,can you include an example here
For Hindi to other Indic script, the danda character is mapped to an invalid character.
For danda and double danda, the Undicode characters are U+0964 and U+0965 respectively irrespective of the script. Hence, script conversion must not happend for these characters from Devanagari to other scripts.
Or if you have it already, what package is it?
Python version: 3.8.9
pip install indic-nlp-library
git clone https://github.com/anoopkunchukuttan/indic_nlp_resources.git
`import sys
from indicnlp import common
INDIC_NLP_RESOURCES=r"/home/user/indic_nlp_resources"
common.set_resources_path(INDIC_NLP_RESOURCES)
from indicnlp.transliterate.unicode_transliterate import ItransTransliterator
input_text='അടിക്ക് മോനെ'
print(ItransTransliterator.to_itrans(input_text, 'mal'))`
Output
അടിക്ക് മോനെ
I tried both in my local PC and in colab, but the api is not transliterating
My reference is to
Instead of extend()
, could you use append()
to make a more easy to use list of lists?
When using cliparser to normalize and then tokenize from commandline by chaining commands using pipe, error encountered: BrokenPipeError: [Errno 32] Broken pipe
The following Malayalam text is being removed when normalized.
ദക്ഷിണാഫ്രിക്കയിലെ സെന്റര് മൗണ്റ്റേന്സിലെ ബുഷ്മ്യാന്സ് ക്ല്യൂഫിനെ ഏറ്റവും നല്ല ഹോട്ടല് , സിംഗപൂര് എയര്ലൈന്സിനെ ഏറ്റവും നല്ല അന്താരാഷ്ട്റ വിമാനം , വെര്ജിന് അമേരിക്കയെ ഏറ്റവും ശ്രേഷ്ഠമായ സ്വകാര്യ വിമാനം, ക്രിസ്റ്റല് ക്രൂസിനെ ഏറ്റവും നല്ല ക്രൂസ് ലൈന് ( വലിയ കപ്പല് ) യആട്ട് ഓഫ് സീബോണിനെ ഏറ്റവും ശ്രേഷ്ഠമായ ക്രൂന് ലൈന് ( ചെറിയ കപ്പല് ) എന്നിവയായി പ്രഖ്യാപിച്ചു .
kindly check for the same.
Thank you.
Sir i tried the code pasted below for romanising the hindi script. But when i run the code the script is not getting romanised. The output that I get is ाजान. Please let me know how can i get the proper romanised script.
from indicnlp.transliterate.unicode_transliterate import ItransTransliterator
input_text='राजस्थान'
#input_text='ஆசிரியர்கள்'
lang='hi'
print(ItransTransliterator.to_itrans(input_text,lang))
Invoking sentence_split raises an error:
$ python ~/venv/lib/python3.8/site-packages/indicnlp/cli/cliparser.py sentence_split -l ta ../test-blind/ta.txt ../test-blind/ta.sent
Traceback (most recent call last):
File "/home/attardi/venv/lib/python3.8/site-packages/indicnlp/cli/cliparser.py", line 264, in
loader.load()
File "/home/attardi/venv/lib/python3.8/site-packages/indicnlp/loader.py", line 27, in load
indic_scripts.init()
File "/home/attardi/venv/lib/python3.8/site-packages/indicnlp/script/indic_scripts.py", line 103, in init
ALL_PHONETIC_DATA=pd.read_csv(os.path.join(common.get_resources_path(),'script','all_script_phonetic_data.csv'),encoding='utf-8')
File "/home/attardi/venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 686, in read_csv
return _read(filepath_or_buffer, kwds)
File "/home/attardi/venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 452, in _read
parser = TextFileReader(fp_or_buf, **kwds)
File "/home/attardi/venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 936, in init
self._make_engine(self.engine)
File "/home/attardi/venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 1168, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/home/attardi/venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 1981, in init
src = open(src, "rb")
FileNotFoundError: [Errno 2] No such file or directory: '/home/attardi/venv/lib/python3.8/site-packages/indicnlp/script/all_script_phonetic_data.csv'
I am trying to perform Orthographic Syllabification, however, I have run into an error:
AttributeError Traceback (most recent call last)
[<ipython-input-9-9f5f7217ed93>](https://localhost:8080/#) in <module>
3 lang='hi'
4
----> 5 print(' '.join(syllabifier.orthographic_syllabify(text,lang)))
2 frames
[/usr/local/lib/python3.9/dist-packages/indicnlp/syllable/syllabifier.py](https://localhost:8080/#) in orthographic_syllabify(word, lang, vocab)
213 def orthographic_syllabify(word,lang,vocab=None):
214
--> 215 p_vectors=[si.get_phonetic_feature_vector(c,lang) for c in word]
216
217 syllables=[]
[/usr/local/lib/python3.9/dist-packages/indicnlp/syllable/syllabifier.py](https://localhost:8080/#) in <listcomp>(.0)
213 def orthographic_syllabify(word,lang,vocab=None):
214
--> 215 p_vectors=[si.get_phonetic_feature_vector(c,lang) for c in word]
216
217 syllables=[]
[/usr/local/lib/python3.9/dist-packages/indicnlp/script/indic_scripts.py](https://localhost:8080/#) in get_phonetic_feature_vector(c, lang)
168 phonetic_data, phonetic_vectors= get_phonetic_info(lang)
169
--> 170 if phonetic_data.iloc[offset]['Valid Vector Representation']==0:
171 return invalid_vector()
172
AttributeError: 'NoneType' object has no attribute 'iloc'
I am using indic-nlp-library version 0.91
I have tried to find the translate function but to no avail. The zip file or the github repo dont show any translate option. It is stated in the document that translate is one of the options available. Please help anyone
ModuleNotFoundError: No module named 'indicnlp.script'
"from indicnlp import loader
loader.load()"
when trying to run the above code.
I wanted to know if there is any vector representation for SOS and EOS in the Hindi embeddings.
४,३२,००० get tokenized as ४ , ३२ , ०००. This should not happen.
Hi,
Looks like Schwa deletion is not handled for Hindi , Punjabi etc.
>>> input_text='जिसका'
>>> print(ItransTransliterator.to_itrans(input_text,'hi'))
jisakaa
Traceback (most recent call last):
File "test1.py", line 6, in
normalizer=factory.get_normalizer("hi",remove_nuktas)
TypeError: get_normalizer() takes 2 positional arguments but 3 were given
occurs while execution of the exmaple in juyter notebook provided
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.