goru001 / inltk Goto Github PK

Natural Language Toolkit for Indic Languages aims to provide out of the box support for various NLP tasks that an application developer might need

Home Page: https://inltk.readthedocs.io

License: MIT License

Python 100.00%

nlp deep-learning indic-languages pytorch data-augmentation sentence-similarity sentence-encoding word-embeddings sentence-embeddings

inltk's People

Contributors

Stargazers

Watchers

Forkers

loretoparisi shankar0206 shyamalschandra urvishp80 aabhash rsingh2083 icemoon1987 rishisinha kshitizrimal avinashtgoje kkadakia aryancodify praggie patropavan vaasanthk shubhamrock428 cahuja1992 abhinavdwivedi1 chaumayu 100rabh1401 ravithejaburugu abiraja2004 swadhin ritika26 anujsrc dearjagdish zeeshan75 kashyapkishore amoghm masihullah17 gitter-badger ravish0007 amirunpri2018 stenpiren manikant92 ersawant eruditepanda dominiqueloyer prmohanty ameyinvent praneet9 vaisakh-kp praveenjoshi01 umaimat anuragshas mayankchourasia adamshamsudeen dummy9 arifsuhan moizrauf upendrasingh18 abhishekyana rbs-pli rajeshwarn vmmlog aliendeep laksh9950 sampathweb surya-datascientist santhu45482 mansiganatra tengben0905 quxiaofeng slbinilkumar ssgaur vyaslkv sujitjean ashishpatel26 roininja nbrrawal techiev2 rsanimesh ravi-code-ranjan abhinavm24 ofentswe1 shyam1234 ssthurai aizazsharif shiva16 aakash1822 omrajkumar ankitsingh077 sougata09 yuvarajtana skmalviya sumepr asrst fuadomar ashiks-qb atul-anand-jha avanish ascendency shubhamjain27 himanshuverma09 satishkrr dashayushman abhavk vinayaga-pillai lalitghongade tuxnani

inltk's Issues

How to do emotion classification? #helprequest

Hi
Could you give me some advise on doing emotion classification and sentiment analysis for Tamil ?

Gujarati Support

I need Support for Gujarati NLP

Not working with latest torch version 1.7

Getting below error when identify_language function call for sanskrit words.

File "C:\Users\e01526\AppData\Local\Programs\Python\Python36\lib\site-packages\inltk\inltk.py", line 77, in identify_language
learn = load_learner(path / 'models' / 'all')
File "C:\Users\e01526\AppData\Local\Programs\Python\Python36\lib\site-packages\fastai\basic_train.py", line 626, in load_learner
res = clas_func(data, model, **state)
File "C:\Users\e01526\AppData\Local\Programs\Python\Python36\lib\site-packages\fastai\text\learner.py", line 52, in init
super().init(data, model, metrics=metrics, **learn_kwargs)
File "", line 19, in init
File "C:\Users\e01526\AppData\Local\Programs\Python\Python36\lib\site-packages\fastai\basic_train.py", line 166, in post_init
self.model = self.model.to(self.data.device)
File "C:\Users\e01526\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\nn\modules\module.py", line 612, in to
return self._apply(convert)
File "C:\Users\e01526\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\nn\modules\module.py", line 359, in _apply
module._apply(fn)
File "C:\Users\e01526\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\nn\modules\module.py", line 359, in _apply
module._apply(fn)
File "C:\Users\e01526\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\nn\modules\module.py", line 359, in _apply
module._apply(fn)
[Previous line repeated 1 more times]
File "C:\Users\e01526\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\nn\modules\rnn.py", line 160, in _apply
self._flat_weights = [(lambda wn: getattr(self, wn) if hasattr(self, wn) else None)(wn) for wn in self._flat_weights_names]
File "C:\Users\e01526\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\nn\modules\module.py", line 779, in getattr
type(self).name, name))
torch.nn.modules.module.ModuleAttributeError: 'LSTM' object has no attribute '_flat_weights_names'

from inltk.inltk import setup setup('hi') raises error

How to deal with aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host www.dropbox.com:443 ssl:default [Connect call failed ('192.133.77.191', 443)]

plaigarism checker in malayalam

Pytorch 1.3 not available to download

Hi.
PyTorch 1.3 is no longer available to download.
Can you please upgrade the models or suggest how should we use them now?

tamil support

Thank you for this project. It would be worth to add Tamil support!

Methods other than tokenize like predict n words, similar sentences are not working in inltk

It gives
ModuleAttributeError: 'LSTM' object has no attribute '_flat_weights_names'

Torch 1.3 installation Issue

Hi, Thanks for the Amazing Library to work in Indic languages. I got an error while Installing torch 1.3. I followed the Installation instructions from here. Error below
pip install torch==1.3.1+cpu -f https://download.pytorch.org/whl/torch_stable.html

Looking in links: https://download.pytorch.org/whl/torch_stable.html
ERROR: Could not find a version that satisfies the requirement torch==1.3.1+cpu (from versions: 0.1.2, 0.1.2.post1, 0.1.2.post2, 1.4.0, 1.4.0+cpu, 1.4.0+cu100, 1.4.0+cu92, 1.5.0, 1.5.0+cpu, 1.5.0+cu101, 1.5.0+cu92, 1.5.1, 1.5.1+cpu, 1.5.1+cu101, 1.5.1+cu92, 1.6.0, 1.6.0+cpu, 1.6.0+cu101, 1.6.0+cu92, 1.7.0, 1.7.0+cpu, 1.7.0+cu101, 1.7.0+cu110, 1.7.0+cu92, 1.7.1, 1.7.1+cpu, 1.7.1+cu101, 1.7.1+cu110, 1.7.1+cu92)
ERROR: No matching distribution found for torch==1.3.1+cpu

error on calling setup(<'code-of-language'>)

RuntimeError Traceback (most recent call last)
in
----> 1 setup('ml')

~/anaconda3/lib/python3.7/site-packages/inltk/inltk.py in setup(language_code)
21 loop = asyncio.get_event_loop()
22 tasks = [asyncio.ensure_future(download(language_code))]
---> 23 learn = loop.run_until_complete(asyncio.gather(*tasks))[0]
24 loop.close()
25

~/anaconda3/lib/python3.7/asyncio/base_events.py in run_until_complete(self, future)
569 future.add_done_callback(_run_until_complete_cb)
570 try:
--> 571 self.run_forever()
572 except:
573 if new_task and future.done() and not future.cancelled():

~/anaconda3/lib/python3.7/asyncio/base_events.py in run_forever(self)
524 self._check_closed()
525 if self.is_running():
--> 526 raise RuntimeError('This event loop is already running')
527 if events._get_running_loop() is not None:
528 raise RuntimeError(

RuntimeError: This event loop is already running

Done!

identify languages doesn't work with Telugu in v0.9

identify languages function which uses separate model for identifying the languages hasn't been retrained on Telugu in v0.9.
Need to retrain it to support Telugu.

Improving English model

Thanks for sharing. It's really helpful and looking forward to make it to next level with help of Bert and SQUAD dataset. Looking only for get similar utterances function.

Reason: Suppose query was "What is machine learning?"

Results with default degree_of_aug =0.1 was ['what is machine learn','what is machine thinking','what is mortar learning','what is tank learning','what is rifle learning','what is typewriter learning','what is machine discovering','what is engine learning','what is engine learning','what is factory learning']

Results with degree_of_aug =0.8 was ['what lies typewriter thinking', 'what can mortar learn', 'whatever has machine learned', 'what consists engine noticing', 'what consists engine noticing', 'what was machines hearing', 'what has rifle learned', 'how is typewriter thinking', 'what becomes rocket learns', 'what appears canister realising']

Expected examples should have been ['Can you tell me about machine learning, Explain machine learning, how do you define machine learning, Give definition of machine learning]

I am not sure if this can be achieved but we should give it a try. Bert model uses a large corpus of wiki and there are some dataset like SQUAD1, SQUAD2 which is QnA based. Can we use there features to improve the results. I have worked on Bert model and on SQUAD2 dataset and the results are great. Would you try to implement on this package or let me know about the training process in it. So that I can give it a try.

Thanks for sharing..

Dropbox location for Models are not publically available.

I am trying to download the model for tamil Language.

inltk/inltk/config.py

Line 36 in 3e6bf64

 all_language_codes.tamil: 'https://www.dropbox.com/s/88klv70zl82u39b/export.pkl?raw=1', 

.
the dropbox location is not publically available.

Urdu, Kashmiri and Maithili Support

I would like to contribute for Urdu and Kashmiri language which are also one of the official languages in India and has Indian origins.
I have started working on Urdu Language using repo NLP for Marathi. I have gathered around 350K wikipedia articles link and in process of scraping those articles. I have also added multiprocessing support for gathering articles

Integrating with HuggingFace Transformer

Hi,
Could you give me some insights whether is possible to plug in inltk with huggingface transformer library

Getting error while using get_sentence_similarity

I'm trying to get similar bengali sentences for a given sentence,but when use the code
from inltk.inltk import get_similar_sentences
get_similar_sentences(sentence, no_of_variants, '', degree_of_aug = 0.1)

getting error: "torch.nn.modules.module.ModuleAttributeError: 'LSTM' object has no attribute '_flat_weights_names'"

[Error in Sentence & word encoding for Hindi]

Python: 3.8
torch==1.6.0+cpu
torchvision==0.7.0+cpu
inltk=0.9
OS: Ubuntu 20

Steps followed: As given in the documentation, https://inltk.readthedocs.io/en/latest/api_docs.html

from inltk.inltk import setup
setup('hi')
Downloading Model. This might take time, depending on your internet connection. Please be patient.
We'll only do this for the first time.
Downloading Model. This might take time, depending on your internet connection. Please be patient.
We'll only do this for the first time.
Done!

>>> from inltk.inltk import get_embedding_vectors
>>> vectors = get_embedding_vectors('भारत', 'hi')
Traceback (most recent call last):                                                                             
  File "<stdin>", line 1, in <module>
  File "env3/lib/python3.8/site-packages/inltk/inltk.py", line 100, in get_embedding_vectors
    learn = load_learner(path / 'models' / f'{language_code}')
  File "env3/lib/python3.8/site-packages/fastai/basic_train.py", line 626, in load_learner
    res = clas_func(data, model, **state)
  File "env3/lib/python3.8/site-packages/fastai/text/learner.py", line 52, in __init__
    super().__init__(data, model, metrics=metrics, **learn_kwargs)
  File "<string>", line 20, in __init__
  File "env3/lib/python3.8/site-packages/fastai/basic_train.py", line 166, in __post_init__
    self.model = self.model.to(self.data.device)
  File "env3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 607, in to
    return self._apply(convert)
  File "env3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 354, in _apply
    module._apply(fn)
  File "env3/home/sarah/github/Offensive_Hindi/env3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 354, in _apply
    module._apply(fn)
  File "env3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 354, in _apply
    module._apply(fn)
  [Previous line repeated 1 more time]
  File "env3/lib/python3.8/site-packages/torch/nn/modules/rnn.py", line 159, in _apply
    self._flat_weights = [(lambda wn: getattr(self, wn) if hasattr(self, wn) else None)(wn) for wn in self._flat_weights_names]
  File "env3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 771, in __getattr__
    raise ModuleAttributeError("'{}' object has no attribute '{}'".format(
torch.nn.modules.module.ModuleAttributeError: 'LSTM' object has no attribute '_flat_weights_names'

Same error as above occurs when using

>>> from inltk.inltk import get_sentence_encoding
>>> encoding = get_sentence_encoding('मुझे अपने देश से', 'hi')
......
env3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 771, in __getattr__
    raise ModuleAttributeError("'{}' object has no attribute '{}'".format(
torch.nn.modules.module.ModuleAttributeError: 'LSTM' object has no attribute '_flat_weights_names'

Machine Translation

Any module for English to Indian lang translation will be added to inltk in upcoming release?

Question?

First of all, very great and easy code, written in a extremely simple way.
Just a question, as you are using Fastai I presume you are using language models trained with fastai on the wikitext or a similar corpus, right?

Would the embeddings change after having trained a model for a classification task?

Last thing, do you have any idea how BERT embeddings could be different?

Torch throwing an error for all languages other than mixed language

torch.nn.modules.module.ModuleAttributeError: 'LSTM' object has no attribute '_flat_weights_names' error occuring while using Native languages but working fine for mixed languages

Issue with identify_language function

identify_language() works properly when an English sentence is passed.
When a Hindi/other language sentences are passed, it produces a Runtime Error.
Here is the basic code: -

from inltk.inltk import identify_language
identify_language("हैलो")

Note:- I did try adding reset_language_identifying_models() function, but the error persist.

Traceback: -

RuntimeError Traceback (most recent call last)
in
1 from inltk.inltk import identify_language
2
----> 3 identify_language("हैलो")

C:\ProgramData\Anaconda3\lib\site-packages\inltk\inltk.py in identify_language(input)
66 loop = asyncio.get_event_loop()
67 tasks = [asyncio.ensure_future(check_all_languages_identifying_model())]
---> 68 done = loop.run_until_complete(asyncio.gather(*tasks))[0]
69 loop.close()
70 defaults.device = torch.device('cpu')

C:\ProgramData\Anaconda3\lib\asyncio\base_events.py in run_until_complete(self, future)
568 future.add_done_callback(_run_until_complete_cb)
569 try:
--> 570 self.run_forever()
571 except:
572 if new_task and future.done() and not future.cancelled():

C:\ProgramData\Anaconda3\lib\asyncio\base_events.py in run_forever(self)
523 self._check_closed()
524 if self.is_running():
--> 525 raise RuntimeError('This event loop is already running')
526 if events._get_running_loop() is not None:
527 raise RuntimeError(

RuntimeError: This event loop is already running

Suppress warnings in console :SourceChangeWarnings

Environment :- Google Colaboratory
Runtime :- CPU
Installed inltk as directed in the documentation for iNLTK

!pip3 install torch==1.3.1+cpu -f https://download.pytorch.org/whl/torch_stable.html

!pip3 install inltk

We are using the function get_similar_sentences() for our case study. We are running this function for our dataset .Dataset size = 9k split in chunks of 1000 records.

The code logs warning to the console for each records slowing down the similar sentence generation and taking more than 4 hours for just 1k records.

We have tried disabling warning using python constructs , but it does not work
Also tried to install latest version of torch and run the code , still facing same issue.
Could you please help disable these warnings or provide a fix for the same.
Thank you
Jayabalambika.R

getting error when first time setup language

Following exception is raised when I try to setup Gujarati language.

from inltk.inltk import setup
setup('gu')
Downloading Model. This might take time, depending on your internet connection. Please be patient.
We'll only do this for the first time.
Traceback (most recent call last):
File "C:\Users\LENOVO\Anaconda3\lib\site-packages\aiohttp\connector.py", line 1001, in _create_direct_connection
hosts = await asyncio.shield(host_resolved)
File "C:\Users\LENOVO\Anaconda3\lib\site-packages\aiohttp\connector.py", line 867, in _resolve_host
addrs = await self._resolver.resolve(host, port, family=self._family)
File "C:\Users\LENOVO\Anaconda3\lib\site-packages\aiohttp\resolver.py", line 32, in resolve
hostname, port, type=socket.SOCK_STREAM, family=family
File "C:\Users\LENOVO\Anaconda3\lib\asyncio\base_events.py", line 784, in getaddrinfo
None, getaddr_func, host, port, family, type, proto, flags)
File "C:\Users\LENOVO\Anaconda3\lib\concurrent\futures\thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "C:\Users\LENOVO\Anaconda3\lib\socket.py", line 748, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno 11001] getaddrinfo failed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "", line 1, in
File "C:\Users\LENOVO\Anaconda3\lib\site-packages\inltk\inltk.py", line 33, in setup
learn = loop.run_until_complete(asyncio.gather(*tasks))[0]
File "C:\Users\LENOVO\Anaconda3\lib\asyncio\base_events.py", line 579, in run_until_complete
return future.result()
File "C:\Users\LENOVO\Anaconda3\lib\site-packages\inltk\inltk.py", line 25, in download
learn = await setup_language(language_code)
File "C:\Users\LENOVO\Anaconda3\lib\site-packages\inltk\download_assets.py", line 29, in setup_language
await download_file(config['lm_model_url'], path/'models'/f'{language_code}', config["lm_model_file_name"])
File "C:\Users\LENOVO\Anaconda3\lib\site-packages\inltk\download_assets.py", line 19, in download_file
async with session.get(url) as response:
File "C:\Users\LENOVO\Anaconda3\lib\site-packages\aiohttp\client.py", line 1124, in aenter
self._resp = await self._coro
File "C:\Users\LENOVO\Anaconda3\lib\site-packages\aiohttp\client.py", line 528, in _request
req, traces=traces, timeout=real_timeout
File "C:\Users\LENOVO\Anaconda3\lib\site-packages\aiohttp\connector.py", line 537, in connect
proto = await self._create_connection(req, traces, timeout)
File "C:\Users\LENOVO\Anaconda3\lib\site-packages\aiohttp\connector.py", line 894, in _create_connection
_, proto = await self._create_direct_connection(req, traces, timeout)
File "C:\Users\LENOVO\Anaconda3\lib\site-packages\aiohttp\connector.py", line 1013, in _create_direct_connection
raise ClientConnectorError(req.connection_key, exc) from exc
aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host www.dropbox.com:443 ssl:default [getaddrinfo failed]

please help...

Stop word functionality

stop word functionality can be added for indian languages.

Identify languages doesn't work with code-mixed languages in latin script

Is there a way to detect the language of the transliterated text of an Indian language?

how to access inltk from IDE like pycharm

Unified Language Model

The work I've done for unification of all models is up to the extent you can find in this repo.
Before training the classifier, I train a LM for all the languages, which needs to be improved, so that it can be used standalone for all the languages. I plan to come back to this after NER, POS, Textual Entailment. If you're interested, you can get started and share your progress here.
Shout out if there's anything, I'll be happy to help.

How to identify whether the text is Hinglish?

Can you please tell us is there any way to identify if the text is Hinglish ? I went through the code of identify_language() and u are outputting en if we type something in English? Can you please reply asap I am in need of a model which identifies whether it's Hinglish or not.

Add support for Classical Languages (Prakrit, Pali)

Data Source is transliterated in Roman alphabet.

Steps:

Convert the data from the source into Early Brahmic script
Train LM Models

Segmentation Fault

Hi, I just installed this and tried the following

from inltk.inltk import setup
setup('hi')

from inltk.inltk import get_embedding_vectors
vectors = get_embedding_vectors(example_sent, 'hi')

This results in a segmentation fault. Can I get an idea on how to debug this?

Assamese Support

Please add Assamese language support. I'm ready to volunteer. Thank you.

add gpu support

Received feedback to add GPU support to cater the case when inference should be done on huge amounts of data! Creating the issue to track it!

telugu support

Hey, great repository.
I'd like to add Telugu support. If you have a framework I should follow to download Telugu wikipedia and train it, I'd love some instructions and get going

POS tagging

https://universaldependencies.org/ has labelled data for parts of speech, dependencies and information about morphology for Hindi, Sanskrit, Marathi, Tamil and Telugu.
I plan on using a LM-LSTM-CRF architecture for sequence tagging. However the language models in iNLTK use sentencepiece tokens. Could anyone guide me through using the existing lm for word tokens or do I need to retrain the word embeddings for word tokens?

Hindi NER Support for Inltk

Currently we are working on research project for NER in Hindi. We would like to extend our code and work to add Support for Hindi-NER in NLTK. Our current model(Embeddings->LSTM->CRF) is trained on this dataset http://ltrc.iiit.ac.in/ner-ssea-08/index.cgi?topic=2 with 14 tags and has an accuracy around 70%. We are trying to increase the accuracy of model currently. Do you have any contribution guidelines to the project or any specifics which u would like in the NER model? Otherwise, we are really interested to contribute to the project.

Separate Test sets for classification and Possible cleaning

All the Classification datasets provided lack a separate test set. The reported accuracy is based on the validation set which was used to choose the best baseline model. This may cause issues in model generalization. Hence, there is a need for a separate test set to properly compare accuracy.
Another issue worth checking - The datasets have a few out-of-language characters (for example, Marathi Headlines have a few English characters and Arabic Numerals). These can be transliterated to the language specific characters and numerals. The datasets have ~50-100 duplicates which can be cleaned.

Following the above steps, I have created new splits for the Marathi Headlines Dataset (link here). I report the baseline accuracy on the test set using ALBERT-base pre-trained on the Marathi corpus here (Acc:0.94, Kappa:0.89).

How do I split paragraphs into sentences?

is there a function that helps splitting paragraphs into sentences?

nltk has sent_text which is reasonably accurate in most of the real-life scenarios. Is there a similar version in iNLTK?

Equivalent code snippet from nltk:

import nltk
sent_text = nltk.sent_tokenize(text)

error while install inltk

unable to install inltk as it was not able to find fastai==1.0.57 even though fastai 1.0.60 is installed

sudo pip install inltk

Collecting inltk
  Using cached inltk-0.8.1-py3-none-any.whl (11 kB)
Requirement already satisfied: sentencepiece in /usr/local/lib/python3.5/dist-packages (from inltk) (0.1.85)
Collecting numexpr
  Using cached numexpr-2.7.1-cp35-cp35m-manylinux1_x86_64.whl (162 kB)
Collecting beautifulsoup4
  Using cached beautifulsoup4-4.9.0-py3-none-any.whl (109 kB)
Collecting aiohttp>=3.5.4
  Using cached aiohttp-3.6.2-cp35-cp35m-manylinux1_x86_64.whl (1.1 MB)
Requirement already satisfied: numpy>=1.15 in /usr/local/lib/python3.5/dist-packages (from inltk) (1.18.3)
Collecting typing
  Using cached typing-3.7.4.1-py3-none-any.whl (25 kB)
Requirement already satisfied: pyyaml in /usr/local/lib/python3.5/dist-packages (from inltk) (5.3.1)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.5/dist-packages (from inltk) (3.0.3)
Requirement already satisfied: packaging in /usr/local/lib/python3.5/dist-packages (from inltk) (20.1)
Requirement already satisfied: Pillow in /usr/local/lib/python3.5/dist-packages (from inltk) (7.1.1)
Collecting dataclasses; python_version < "3.7"
  Using cached dataclasses-0.6-py3-none-any.whl (14 kB)
Requirement already satisfied: requests in /usr/local/lib/python3.5/dist-packages (from inltk) (2.23.0)
ERROR: Could not find a version that satisfies the requirement fastai==1.0.57 (from inltk) (from versions: 0.6, 0.7.0)
ERROR: No matching distribution found for fastai==1.0.57 (from inltk)

Breaks with PyTorch==1.8.0

When I try to use the function get_embedding_vectors, I get the following error when I use PyTorch v1.8.0:
'LSTM' object has no attribute '_flat_weights_names'.

Somewhere along the release history, PyTorch might have removed this attribute. I've solved the issue temporarily by installing PyTorch v1.3.0 on conda and it works now, but a permanent fix would be nice.

Thanks!

Windows 10 support

Please add support for Windows platform.

Dimension of get_sentence_encoding for Telugu

The documentation mentions that "get_sentence_encoding returns 400 dimensional encoding of the sentence"
But, in case of Telugu language, get_sentence_encoding return 410 dimensional encoding instead of 400.

Rectify the documentation to avoid confusion!

needs inltk support for Gujarati NER in Windows

I need iNLTK support for Gujarati NER in Windows 10.

Progress bar on async downloads

Checking out this amazing work, however downloading models is not transparent. Not sure, if download is broken, or placed where, or the progress. Figured out that its stored in the models folder in the site packages dir.

Would be good to have something like nltk.download or atleast, steps to setup models offline mode where we can download and unzip it to the respective dir in case of proxy blockage in several orgs.

Kernel keeps dying while calling identify_language from inltk.inltk

This is the result from terminal. I have also tried reset_language_identifying_model but the error still persist. Any ideas?

[I 07:51:21.811 NotebookApp] KernelRestarter: restarting kernel (1/5), keep random ports
kernel 78f77a3b-e8c3-4590-9e4e-58f637292b46 restarted
terminate called after throwing an instance of 'std::runtime_error'
what(): generic_type: cannot initialize type "WorkerId": an object with that name is already defined

Typo fix: 'malyalam' to 'malayalam'

Some amount of work for this was done here. Checkout this comment to see what it lacked.

English language model throws error. Is there a dependency on a specific version of spacy?

The predict_next_words function throws the following error with English:

self.tok = spacy.blank(lang, disable=["parser","tagger","ner"])
TypeError: blank() got an unexpected keyword argument 'disable'

I also tried using the model directly by running predict on the pretrained LanguageLearner that iNLTK downloads at en/export.pkl, but that throws the same error. Looking at the spacy docs, spacy.blank currently does not have a parameter called disable.

The Hindi language model works fine. I'm guessing the error with English is because, as the docs mention, the model has been taken directly from fastai which probably uses spacy somewhere.

Would appreciate any help. Thanks!

Word Embeddings

Hi,
1. Have you extracted the word embeddings for Hindi from the ULMFiT pre-trained model?
2. Is that code public?
3. How can we fine-tune the model with our own dataset and extract embeddings from the new model?

Thanks.

Encoding Devanagri

Hi,

I don't know if this is the right place for my question, but still posting since this repo is related to indic languages.

I am working on a Speech-to-text system for Maithili and since Maithili is nowadays written in Devnagiri( same as Hindi). I am wondering if this library or any other has the provision for Encoding Maithili sentences. I have gone through the docs and noticed that we can get sentence encodings and vector embeddings for Hindi, but I am not sure if we can use it for Maithili sentences. The objective is to get vector embeddings and feed it to our speech recognition model.

Again, I am pretty new to this so any help will be greatly appreciated.

get_similar_vectors, get_similar_sentences fails with AttributeError: 'LSTM' object has no attribute '_flat_weights_names'

I have installed inltk using the following on collaborator run on normal CPU as hardware accelerator. These work fine

But while using the function calls get_similar_vectors or get_similar_sentences it fails

I have used these functions around 4 months back and it was working fine. Not sure if some code got updated.Could you please look into this and provide a resolution
Thank you
Jayabalambika.R

Why the Get Embedding Vectors method gives us 400 dimensions?

I search lot of about this question but didn't find any solution that satisfied my question so i want to know about the hole architecture that why this give us a 400 dimensions and what is the reason behind it. Also which weights are important so that solve my further query.
Kindly help me for this.
Thank you.
Rajan Thakkar