Text Analytics

This program is meant to search for a word in a news dataset.

Theory behind code

OpenMP
PCAM

Requirements

Python 3+
- Python modules:
  - pandas
```
pip install pandas
```
  - nltk
```
pip install nltk
```
Various nodes:
- This project is meant to be a work for TET subject in EAFIT University, for Systems Engineering students. We are given access to a cluster and we have two slave nodes and a master node which we are using in the code. If you have you're own nodes, you can set their ip adresses in file mpi/hosts_mpi.

Usage

First, you have to know for solution there is a batch part, whose idea is to create a table with each word's occurrency in the content of a new. To do so, a table is created where each word is paired to the id of the new it was in, how many times it was found (frequency) and the title of the new.

After the table is completely created, you have to enter the word you are looking for and the algorithm accesses to the mentioned table to search for that word. Saves registers where it was found and organizes those registers descending to finally show top 10 of registers (biggest frequency). It shows frequency, id of the new and its title. This is the online part of code.

Conventions: nThreads: number of threads you want the code runs with.

To use this program, you have to follow the next steps:

To clean files

Note: Make sure you have files in /opt/datasets so code can access to clean them. Otherwise, go to code and change path in cleaning.py, line 42.
```
$ cd serial/
$ python cleaning.py
```
To execute serial algorithm

Note: After cleaning, files are saved automatically in this folder. If you want to save it somewhere else, you'll have to change all .cpp files to indicate where they are stored.
```
$ cd serial/
$ g++ serial.cpp -std=c++11 -o serial
$ time ./serial
```
To execute OpenMP algorithm
```
$ cd openmp/
$ g++ wordCounterOMP.cpp -std=c++11 -o omp -fopenmp
$ export OMP_NUM_THREADS=<nThreads>
$ export OMP_DISPLAY_ENV='true'
$ time ./omp
```
Note: an inverted time index is going to be displayed. Divide that time on the number of threads you set and you'll get the real time it took to create batch part (table with the occurrency of each word).
To execute MPI with OpenMP algorithm

Note: np refers to number of nodes the algorithm is going to run in. It's specifically 3 because we count on 2 slaves nodes and 1 master node, as explained before.
```
$ cd mpi
$ mpic++ wordCounterMPI.cpp -o mpi -std=c++11 -fopenmp
$ export OMP_NUM_THREADS=<nThreads>
$ export OMP_DISPLAY_ENV='true'
$ mpirun -f hosts_mpi -np 3 ./mpi
```

Performance

Serial

It took 203.18s (3:23:18m) to create inverted index in serial algorithm.

This algorithm is obviously slow, given the fact creating the initial table is O(n), with n equals to number of words in all culumn content of the whole dataset.

Accessing to the created table in the online part is log(n), which gives a little efficiency to it.

OpenMP

Implementing threads helped in execution. However, as explained in OpenMP design, there were two problems with threads: they couldn't either access to the same new nor access to the same table entry to set its value. What threads could do was, given a line (new), they could start counting each word, and different lines could be given to various threads at the same time.

As shown in the following charts, we can conclude that using 3 threads is the most optimum. The more threads, the worst is going to become performance.

MPI with OpenMP

Doing a distributed serial algorithm improves efficiency in a very palpable way. Each node took afile, created the table with all words existing in each new and after master sends the word to search, they made their table with top 10 frequencies. After that, master collects those tables and organizes them to show the final table that evidences top 10 in the whole dataset.

The following table shows that, even though difference with 3 threas is really small, it keeps being more omptimum to use two threads per node.

Team members

Wendy María Arango Chavarría [email protected] -- Explanation
Manuela Carrasco Pinzón [email protected]

mnl359 / text-analytics Goto Github PK

text-analytics's Introduction

Text Analytics

Theory behind code

Requirements

Usage

Performance

Team members

text-analytics's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent