Giter VIP home page Giter VIP logo

hindi-tokenizer's Introduction

Tokenizer for Hindi

This package tends to implement a Tokenizer and a stemmer for Hindi language.

To import the package,

from HindiTokenizer import Tokenizer

This package implements various funcions, which are listed as below:

The Tokenizer can be created in two ways

t=Tokenizer("यह वाक्य हिन्दी में है।")

Or

t=Tokenizer()
t.read_from_file('filename_here')

A brief description about all the functions

read_from_file

This function takes the name of the file which is present in the current directory and reads it.

t.read_from_file('hindi_file.txt')

generate_sentences

Given a text, this will generate a list of sentences.

t.generate_sentences()

print_sentences

This will print the sentences generated by print_sentences.

t.generate_sentences()
t.print_sentences()

tokenize

This will generate a list of tokens from the given text

t.tokenize()

print_tokens

This will print the sentences generated by print_tokens.

t.tokenize()
t.print_tokens()

generate_freq_dict

This will generate a dictionary of frequency of words and return it.

freq_dict=t.generate_freq_dict()

print_freq_dict

This will print the dictionary of frequency of words generated by generate_freq_dict.

freq_dict=t.generate_freq_dict()
t.print_freq_dict(freq_dict)

generate_stem_word

Given a word, this will generate its stem word.

word=t.generate_stem_word("भारतीय")
print word
भारत

generate_stem_dict

This will return the dictionary of stemmed words.

stem_dict=t.generate_stem_dict()

print_stem_dict

This will print the dictionary of stemmed words generated by generate_stem_dict.

stem_dict=t.generate_stem_dict()
t.print_stem_dict(stem_dict)

remove_stopwords

This will remove all the stopwords occuring from the given text.

t.remove_stopwords()

clean_text

This will remove all the punctuation symbols occuring in the given text.

t.clean_text()

len_text

Given a text, this will return the length of it.

print t.len_text()

sentence_count

Given a text, this will return the number of sentences in it.

print t.sentence_count()

tokens_count

Given a text, this will return the number of tokens in it.

print t.tokens_count()

concordance

Given a text, and a word, it will print all the sentences where that word is occuring.

sentences=t.concordace("हिन्दी")
t.print_sentences(sentences)

hindi-tokenizer's People

Contributors

taranjeet avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

hindi-tokenizer's Issues

invalid syntax error

Traceback (most recent call last):

File "/opt/anaconda3/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3418, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)

File "", line 1, in
from HindiTokenizer import Tokenizer

File "/Users/x/Desktop/PROJECT/Data PreProcessing/HindiTokenizer.py", line 35
print i.encode('utf-8')
^
SyntaxError: invalid syntax

Tokenizer package not found

Dear taranjeet,

I was trying to run an experiment on Hindi dataset by using the tokenizer which you published. But I am getting an error.

Looking forward your views on the same. Please find the attached screenshot
Screenshot 2024-04-22 at 19 39 34
Screenshot 2024-04-22 at 19 39 52

Error at hindi-tokenizer

An excellent initiative in the area of WSD (Word Sense Disambiguation).
Excited to run the code , but unable to do so , giving some errors ,

I have "devnagri script file" as a dataset for WSD task...I want to read the file through your Hindi-Tokenizer ...Error at line 36 , which is shown in the screenshot..

from HindiTokenizer import Tokenizer
t=Tokenizer()
t.read_from_file('ContextSenses001.txt')
#t=Tokenizer("यह वाक्य हिन्दी में है।")

The screenshot is attached here with..
error

Kindly resolve the same at the earliest...
Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.