Giter VIP home page Giter VIP logo

nlp-word2vector's Introduction

NLP-Word2Vector

In this project, a Word2Vec model has been implemented for Persian language using "The Divān of Hafez" dataset.

Pre-processing

In the preprocessing section, we optimize our dataset and remove unusable and extra items from it. This section contains steps, some of which are different in different languages. In the following, we review the steps taken in the pre-processing of this project:

  1. Read Dataset and Stop Words:
    • Read the main dataset file
    • Read the stop words file

Stop Words: words that are so commonly used that they carry very little useful information.

  1. Normalize:
    • To reduce its randomness, bringing it closer to a predefined “standard”.

Example: Normalizes نیم فاصله to نیم‌فاصله.

  1. Tokenize:
    • Breaking the raw text into small chunks. These tokens help in understanding the context or developing the model.

Before: الا یا ایها الساقی ادر کاسا و ناولها
After: ['الا', 'یا', 'ایها', 'الساقی', 'ادر', 'کاسا', 'و', 'ناولها']

  1. Remove Stop Words:

    • Removing stop words such as از or به.
  2. Stemize:

    • Lowering inflection in words to their root forms.

Example: Stemizes دل‌ها to دل.

  1. Create Bag of words:

    • Creating an array including all words in the dataset (pre-processed till now)
  2. Convert words to sentences:

    • Converting the words array of each line to a sentence (like the original dataset).
  3. Remove duplicate words:

    • Removing duplicate words, so we will have unique words in our pre-processed dataset with size of 5773

Processing

After performing the preprocessing step, we have the available information optimized and in different formats (sentence, word array in each sentence and bag of words). Now we come to the part where we can do the word-embedding using many available methods. In this project, I used Keras to imlement Skip-gram and Negative Sampling to build and train my model, which I will discuss in detail below.

  1. Tokenize and build sequences:

    • Tokenize: Giving id to each word using the Keras tokenizer.
    • Word2Id and Id2Word: To Create two arrays including words and word ids based on tokenized data.
    • Text to Sequence: Convert the sentences to sequence to create an inverse vocabulary that includes the numeric vectors.
  2. Generate training data:

    • Now that we have sequences including a list of integer encoded sentrences, we can create a function which iterates each word to create the target, contexts and labels.
  3. Configure the dataset for performance:

    • To perform efficient batching for a large number of training examples for our poem dataset, we use the TensorFlow optimizer to optimize our dataset.
  4. Building the model:

    • Defining model layers and implementing it by implementing a subclass
  5. Training the model:

    • Defining the loss function for ease of use
    • Traning the model and computing loss and accuracy
  6. Testing and saving:

    • Checking the similarity for some words
    • Saving the model output

As the test output is available in the notebook file, The final Loss is 0.0228 with 99.9% accuracy. Which is show in the table below.

کلمه مشابه ۱ مشابه ۲ مشابه ۳ مشابه ۴ مشابه ۵
جام حضوری شکس دوخته‌زمن مناز گر
حافظ می بیارا العین ننگرد سوز
دیوانه لاابال آشناس گزارند ننگرد توشه
عشق پاکباز رمیدن بنمود معماییس فرهادک
می دیرگاه گهربار شرمسار جاندار خاتم
  1. Web app:
    • Build the Backend and load the model.
    • Build a Frontend app to show words and similar words for a word.

nlp-word2vector's People

Contributors

milad-mohammadi avatar liam-realtyna avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.