Giter VIP home page Giter VIP logo

nlp-final's Introduction

NLP-final

In this exercise, as the final part of the natural language processing lesson, two projects have been implemented, each of which will be examined in detail.

Part A, Text classification

Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories.

Convert text to CSV

Starting work in this section with a text dataset contains the required information and additional information that will not be used by us. So, the first step is to write the code to convert this file to csv.

  • Read the main dataset file
  • Pull out important data: Cat, Text using Regex
  • Create csv file including useful data

Final csv sample:

Cat Text
adabh جاودانگي در زندگي گروهي از طريق هنر نگاهي به نمايشگاه آثار هنري احمد...
elmfa بازي را جدي بگيريم مطالعه اي مقدماتي پيرامون نقش بازي در زندگي اجتماعي و ساماندهي...

Classification

The main part is here which I used the prepared dataset to reach the goal, which is classifying Hamshahri news corpus.

Stop Words: words that are so commonly used that they carry very little useful information.

  1. Read CSV data

  2. Normalize:

    • To reduce its randomness, bringing it closer to a predefined “standard”.

Example: Normalizes نیم فاصله to نیم‌فاصله.

  1. Tokenize:

    • Breaking the raw text into small chunks. These tokens help in understanding the context or developing the model.
  2. Remove Stop Words:

    • Removing stop words such as از or به.
  3. Find unique categories:

    • Creating an array including all unique categories
  4. Train the model using:

    • Using Embedding, Dense, SpatialDropout1D and LSTM, reached: loss: 0.0247 and accuracy: 0.9959.
  5. Prediction:

    • Defining the prediction function to find the prediction for input text.

Part B, Machine Translation

Machine translation, sometimes referred to by the abbreviation MT, is a sub-field of computational linguistics that investigates the use of software to translate text or speech from one language to another.

Merger

In this section, two separate English and Farsi datasets were made available to us, the translation of each sentence is available in another file. In the first step, I wrote a program to merge these two files and have a single dataset for simplicity. This program can create two datasets, fa2en and en2fa.

Sample en2fa dataset:

raspy breathing -> صداي خر خر
dad -> پدر
maybe its the wind -> شايد صداي باد باشه
no -> نه

Sample fa2en dataset:

صداي خر خر -> raspy breathing
پدر -> dad
شايد صداي باد باشه -> maybe its the wind
نه -> no

Farsi to English machine translation

  1. ** Import the Dataset**

  2. Preprocess the Dataset:

    • Normalize, clear and append <start> and <end> tokens to specify the start and end of a sequence. Encapsulate the unicode conversion in a function unicode_to_ascii() and sequence preprocessing in a function preprocess_sentence().
  3. Prepare the Dataset

  • Create word pairs combining the English sequences and their related Farsi sequences.
  • Tokenize and pad the sequences so all sentence arrays have the same length.
  1. Create the Dataset:
  • Segregate the train and validation datasets.
  • Validate the mapping that’s been created between the tokens of the sequences and the indices.
  1. Initialize the Model Parameters
  2. Create the Encoder Class
  3. Create the Attention Mechanism Class
  4. Create the Decoder Class
  5. Prepare the Optimizer and Loss Functions
  6. Train the Model:
  • Define the training procedure
  • Initialize the actual training loop
  • Train the model with final loss of 0.4750.
  1. Test the Model
  • Define the model evaluation procedure
  1. Plot and Predict
  2. Translate
  • Sample translate result:

image

  1. English to Farsi translation:
  • I needed 4 hours to retrain the model with En2Fa dataset, due to lack of time it couldn't happen.

nlp-final's People

Contributors

liam-realtyna avatar milad-mohammadi avatar

Watchers

 avatar

Forkers

imancn

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.