In this exercise, as the final part of the natural language processing lesson, two projects have been implemented, each of which will be examined in detail.
Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories.
Starting work in this section with a text dataset contains the required information and additional information that will not be used by us. So, the first step is to write the code to convert this file to csv.
- Read the main dataset file
- Pull out important data:
Cat
,Text
using Regex - Create csv file including useful data
Final csv sample:
Cat | Text |
---|---|
adabh | جاودانگي در زندگي گروهي از طريق هنر نگاهي به نمايشگاه آثار هنري احمد... |
elmfa | بازي را جدي بگيريم مطالعه اي مقدماتي پيرامون نقش بازي در زندگي اجتماعي و ساماندهي... |
The main part is here which I used the prepared dataset to reach the goal, which is classifying Hamshahri news corpus.
Stop Words
: words that are so commonly used that they carry very little useful information.
-
Read CSV data
-
Normalize:
- To reduce its randomness, bringing it closer to a predefined “standard”.
Example: Normalizes
نیم فاصله
toنیمفاصله
.
-
Tokenize:
- Breaking the raw text into small chunks. These tokens help in understanding the context or developing the model.
-
Remove Stop Words:
- Removing stop words such as
از
orبه
.
- Removing stop words such as
-
Find unique categories:
- Creating an array including all unique categories
-
Train the model using:
- Using Embedding, Dense, SpatialDropout1D and LSTM, reached:
loss: 0.0247
andaccuracy: 0.9959
.
- Using Embedding, Dense, SpatialDropout1D and LSTM, reached:
-
Prediction:
- Defining the prediction function to find the prediction for input text.
Machine translation, sometimes referred to by the abbreviation MT, is a sub-field of computational linguistics that investigates the use of software to translate text or speech from one language to another.
In this section, two separate English and Farsi datasets were made available to us, the translation of each sentence is available in another file. In the first step, I wrote a program to merge these two files and have a single dataset for simplicity.
This program can create two datasets, fa2en
and en2fa
.
Sample en2fa dataset:
raspy breathing -> صداي خر خر
dad -> پدر
maybe its the wind -> شايد صداي باد باشه
no -> نه
Sample fa2en dataset:
صداي خر خر -> raspy breathing
پدر -> dad
شايد صداي باد باشه -> maybe its the wind
نه -> no
-
** Import the Dataset**
-
Preprocess the Dataset:
- Normalize, clear and append
<start>
and<end>
tokens to specify the start and end of a sequence. Encapsulate the unicode conversion in a functionunicode_to_ascii()
and sequence preprocessing in a functionpreprocess_sentence()
.
- Normalize, clear and append
-
Prepare the Dataset
- Create word pairs combining the English sequences and their related Farsi sequences.
- Tokenize and pad the sequences so all sentence arrays have the same length.
- Create the Dataset:
- Segregate the train and validation datasets.
- Validate the mapping that’s been created between the tokens of the sequences and the indices.
- Initialize the Model Parameters
- Create the Encoder Class
- Create the Attention Mechanism Class
- Create the Decoder Class
- Prepare the Optimizer and Loss Functions
- Train the Model:
- Define the training procedure
- Initialize the actual training loop
- Train the model with final loss of
0.4750
.
- Test the Model
- Define the model evaluation procedure
- Plot and Predict
- Translate
- Sample translate result:
- English to Farsi translation:
- I needed 4 hours to retrain the model with En2Fa dataset, due to lack of time it couldn't happen.