nlp-final's Introduction

NLP-final

In this exercise, as the final part of the natural language processing lesson, two projects have been implemented, each of which will be examined in detail.

Part A, Text classification

Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories.

Convert text to CSV

Starting work in this section with a text dataset contains the required information and additional information that will not be used by us. So, the first step is to write the code to convert this file to csv.

Read the main dataset file
Pull out important data: Cat, Text using Regex
Create csv file including useful data

Final csv sample:

Cat	Text
adabh	جاودانگي در زندگي گروهي از طريق هنر نگاهي به نمايشگاه آثار هنري احمد...
elmfa	بازي را جدي بگيريم مطالعه اي مقدماتي پيرامون نقش بازي در زندگي اجتماعي و ساماندهي...

Classification

The main part is here which I used the prepared dataset to reach the goal, which is classifying Hamshahri news corpus.

Stop Words: words that are so commonly used that they carry very little useful information.

Read CSV data
Normalize:
- To reduce its randomness, bringing it closer to a predefined “standard”.

Example: Normalizes نیم فاصله to نیم‌فاصله.

Tokenize:
- Breaking the raw text into small chunks. These tokens help in understanding the context or developing the model.
Remove Stop Words:
- Removing stop words such as از or به.
Find unique categories:
- Creating an array including all unique categories
Train the model using:
- Using Embedding, Dense, SpatialDropout1D and LSTM, reached: loss: 0.0247 and accuracy: 0.9959.
Prediction:
- Defining the prediction function to find the prediction for input text.

Part B, Machine Translation

Machine translation, sometimes referred to by the abbreviation MT, is a sub-field of computational linguistics that investigates the use of software to translate text or speech from one language to another.

Merger

In this section, two separate English and Farsi datasets were made available to us, the translation of each sentence is available in another file. In the first step, I wrote a program to merge these two files and have a single dataset for simplicity. This program can create two datasets, fa2en and en2fa.

Sample en2fa dataset:

raspy breathing -> صداي خر خر
dad -> پدر
maybe its the wind -> شايد صداي باد باشه
no -> نه

Sample fa2en dataset:

صداي خر خر -> raspy breathing
پدر -> dad
شايد صداي باد باشه -> maybe its the wind
نه -> no

Farsi to English machine translation

** Import the Dataset**
Preprocess the Dataset:
- Normalize, clear and append <start> and <end> tokens to specify the start and end of a sequence. Encapsulate the unicode conversion in a function unicode_to_ascii() and sequence preprocessing in a function preprocess_sentence().
Prepare the Dataset

Create word pairs combining the English sequences and their related Farsi sequences.
Tokenize and pad the sequences so all sentence arrays have the same length.

Create the Dataset:

Segregate the train and validation datasets.
Validate the mapping that’s been created between the tokens of the sequences and the indices.

Initialize the Model Parameters
Create the Encoder Class
Create the Attention Mechanism Class
Create the Decoder Class
Prepare the Optimizer and Loss Functions
Train the Model:

Define the training procedure
Initialize the actual training loop
Train the model with final loss of 0.4750.

Test the Model

Define the model evaluation procedure

Plot and Predict
Translate

Sample translate result:

English to Farsi translation:

I needed 4 hours to retrain the model with En2Fa dataset, due to lack of time it couldn't happen.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

milad-mohammadi / nlp-final Goto Github PK

nlp-final's Introduction

NLP-final

Part A, Text classification

Convert text to CSV

Classification

Part B, Machine Translation

Merger

Farsi to English machine translation

nlp-final's People

Contributors

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent