taranjeet / hindi-tokenizer Goto Github PK

This is a package in Python which implements a tokenizer, stemmer for Hindi language

Python 100.00%

hindi-tokenizer's Introduction

Tokenizer for Hindi

This package tends to implement a Tokenizer and a stemmer for Hindi language.

To import the package,

from HindiTokenizer import Tokenizer

This package implements various funcions, which are listed as below:

read_from_file
generate_sentences
tokenize
generate_freq_dict
generate_stem_word
generate_stem_dict
remove_stopwords
clean_text
print_sentences
print_tokens
print_freq_dict
print_stem_dict
len_text
sentence_count
tokens_count
concordance

The Tokenizer can be created in two ways

t=Tokenizer("यह वाक्य हिन्दी में है।")

t=Tokenizer()
t.read_from_file('filename_here')

A brief description about all the functions

read_from_file

This function takes the name of the file which is present in the current directory and reads it.

t.read_from_file('hindi_file.txt')

generate_sentences

Given a text, this will generate a list of sentences.

t.generate_sentences()

print_sentences

This will print the sentences generated by print_sentences.

t.generate_sentences()
t.print_sentences()

tokenize

This will generate a list of tokens from the given text

t.tokenize()

print_tokens

This will print the sentences generated by print_tokens.

t.tokenize()
t.print_tokens()

generate_freq_dict

This will generate a dictionary of frequency of words and return it.

freq_dict=t.generate_freq_dict()

print_freq_dict

This will print the dictionary of frequency of words generated by generate_freq_dict.

freq_dict=t.generate_freq_dict()
t.print_freq_dict(freq_dict)

generate_stem_word

Given a word, this will generate its stem word.

word=t.generate_stem_word("भारतीय")
print word
भारत

generate_stem_dict

This will return the dictionary of stemmed words.

stem_dict=t.generate_stem_dict()

print_stem_dict

This will print the dictionary of stemmed words generated by generate_stem_dict.

stem_dict=t.generate_stem_dict()
t.print_stem_dict(stem_dict)

remove_stopwords

This will remove all the stopwords occuring from the given text.

t.remove_stopwords()

clean_text

This will remove all the punctuation symbols occuring in the given text.

t.clean_text()

len_text

Given a text, this will return the length of it.

print t.len_text()

sentence_count

Given a text, this will return the number of sentences in it.

print t.sentence_count()

tokens_count

Given a text, this will return the number of tokens in it.

print t.tokens_count()

concordance

Given a text, and a word, it will print all the sentences where that word is occuring.

sentences=t.concordace("हिन्दी")
t.print_sentences(sentences)

hindi-tokenizer's People

Contributors

Stargazers

Watchers

Forkers

aga9900 ankush-chander akshaygupta-1989 usc-nlp-17 tegamax brijmohan gayatrivenugopal rkthakur vishnune-kore deepaknlp khatrigaurav sachinbose-ds prajp yuktik anandabhairav dineshsingh2003 abhishekpatnaik satyam-bhalla anurag708 praggie chan0park yashchoubey arushisinghal ranvijay-sachan neeraj-somani tausif-0311 slbinilkumar kush2418 navenduchourey ashishgit7 amit08255 sanjibnarzary hiteshsubnani0128 vikas445 ankitasingh001 shreya2105 zer0-dev115 das-mithun roshangupta00750 nairvishnu18 vinaypatil-ev sumittz hanchau biranchi2018 kartikgarg19 ujdemon samchy kkaran0908 rythum pareshapraj ameyfk piyush2912 shrivastavashivani meghana27n akshala tarunima venkateshelangovan satishkumar-pratham manthan-a-mehta jay6430 muskanagrawal10 m-santh niket-agrawal ismailsiddiqui011 medigvijay danishcyber-star swagat-panda aarinp paryul10 daishinkan002 bbvachhani mlartist farhan-jafri pathaksakshamsharma

hindi-tokenizer's Issues

Figure out what needs to be done with this repository

Either mark it as archive
Write in the repo that looking for maintainers
or bring it to a decent point.

README function usage is inconsistent

Actual name of the function as in code: remove_stop_words
Name of the same function as in README: remove_stopwords

The file named "rss.txt" is absent from the GitHub repository

You uploaded the file containing stopwords as stopwords.txt but you haven't updated the code for the class.

See Line 182 of the script file.

invalid syntax error

Traceback (most recent call last):

File "/opt/anaconda3/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3418, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)

File "", line 1, in
from HindiTokenizer import Tokenizer

File "/Users/x/Desktop/PROJECT/Data PreProcessing/HindiTokenizer.py", line 35
print i.encode('utf-8')
^
SyntaxError: invalid syntax

Tokenizer package not found

Dear taranjeet,

I was trying to run an experiment on Hindi dataset by using the tokenizer which you published. But I am getting an error.

Looking forward your views on the same. Please find the attached screenshot

File is written in python 2.x ,so it is not working correctly in python 3.x

i am working on Hindi sentimental analysis , so need to stem Hindi .. but this file have few's issue with python 3.x

Error at hindi-tokenizer

An excellent initiative in the area of WSD (Word Sense Disambiguation).
Excited to run the code , but unable to do so , giving some errors ,

I have "devnagri script file" as a dataset for WSD task...I want to read the file through your Hindi-Tokenizer ...Error at line 36 , which is shown in the screenshot..

from HindiTokenizer import Tokenizer
t=Tokenizer()
t.read_from_file('ContextSenses001.txt')
#t=Tokenizer("यह वाक्य हिन्दी में है।")

The screenshot is attached here with..

Kindly resolve the same at the earliest...
Thanks

taranjeet / hindi-tokenizer Goto Github PK

hindi-tokenizer's Introduction

hindi-tokenizer's People

Contributors

Stargazers

Watchers

Forkers

hindi-tokenizer's Issues

Figure out what needs to be done with this repository

README function usage is inconsistent

The file named "rss.txt" is absent from the GitHub repository

invalid syntax error

Tokenizer package not found

File is written in python 2.x ,so it is not working correctly in python 3.x

Error at hindi-tokenizer

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent