nlp-word2vector's Introduction

NLP-Word2Vector

In this project, a Word2Vec model has been implemented for Persian language using "The Divān of Hafez" dataset.

Pre-processing

In the preprocessing section, we optimize our dataset and remove unusable and extra items from it. This section contains steps, some of which are different in different languages. In the following, we review the steps taken in the pre-processing of this project:

Read Dataset and Stop Words:
- Read the main dataset file
- Read the stop words file

Stop Words: words that are so commonly used that they carry very little useful information.

Normalize:
- To reduce its randomness, bringing it closer to a predefined “standard”.

Example: Normalizes نیم فاصله to نیم‌فاصله.

Tokenize:
- Breaking the raw text into small chunks. These tokens help in understanding the context or developing the model.

Before: الا یا ایها الساقی ادر کاسا و ناولها
After: ['الا', 'یا', 'ایها', 'الساقی', 'ادر', 'کاسا', 'و', 'ناولها']

Remove Stop Words:
- Removing stop words such as از or به.
Stemize:
- Lowering inflection in words to their root forms.

Example: Stemizes دل‌ها to دل.

Create Bag of words:
- Creating an array including all words in the dataset (pre-processed till now)
Convert words to sentences:
- Converting the words array of each line to a sentence (like the original dataset).
Remove duplicate words:
- Removing duplicate words, so we will have unique words in our pre-processed dataset with size of 5773

Processing

After performing the preprocessing step, we have the available information optimized and in different formats (sentence, word array in each sentence and bag of words). Now we come to the part where we can do the word-embedding using many available methods. In this project, I used Keras to imlement Skip-gram and Negative Sampling to build and train my model, which I will discuss in detail below.

Tokenize and build sequences:
- Tokenize: Giving id to each word using the Keras tokenizer.
- Word2Id and Id2Word: To Create two arrays including words and word ids based on tokenized data.
- Text to Sequence: Convert the sentences to sequence to create an inverse vocabulary that includes the numeric vectors.
Generate training data:
- Now that we have sequences including a list of integer encoded sentrences, we can create a function which iterates each word to create the target, contexts and labels.
Configure the dataset for performance:
- To perform efficient batching for a large number of training examples for our poem dataset, we use the TensorFlow optimizer to optimize our dataset.
Building the model:
- Defining model layers and implementing it by implementing a subclass
Training the model:
- Defining the loss function for ease of use
- Traning the model and computing loss and accuracy
Testing and saving:
- Checking the similarity for some words
- Saving the model output

As the test output is available in the notebook file, The final Loss is 0.0228 with 99.9% accuracy. Which is show in the table below.

کلمه	مشابه ۱	مشابه ۲	مشابه ۳	مشابه ۴	مشابه ۵
جام	حضوری	شکس	دوخته‌زمن	مناز	گر
حافظ	می	بیارا	العین	ننگرد	سوز
دیوانه	لاابال	آشناس	گزارند	ننگرد	توشه
عشق	پاکباز	رمیدن	بنمود	معماییس	فرهادک
می	دیرگاه	گهربار	شرمسار	جاندار	خاتم

Web app:
- Build the Backend and load the model.
- Build a Frontend app to show words and similar words for a word.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

milad-mohammadi / nlp-word2vector Goto Github PK

nlp-word2vector's Introduction

NLP-Word2Vector

Pre-processing

Processing

nlp-word2vector's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent