Multilingual-NLP-for-Islamic-Theology

This is a Data Driven Theological App that uses Cross Lingual sentence embeddings for making search engines for Holy Quran and Sahih Hadiths

Understanding the problem

The aim of this project is to develop a search engine that will take input query for Holy Quran and Sahih Hadiths from any language like arabic,hindi,bangla,english etc (it also supports code mixed language input like mixture of english and arabic in input query) and provide output in english and in bangla language separately. this is an asymmetric semantic search problem.means, you usually have a short query (like a question or some keywords) and you want to find a longer paragraph answering the query.

Motivation

Subjective search from Quran or Hadith is a big problem even in today's age. This project has been created as a very basic step to find out which verses are there in the Quran and which are the Sahih Hadith that matches with users Query. Suppose we want to find out what are the verses and hadiths related to reward that can be found in Holy Quran or Sahih hadith. If you write in Bengali and search on Google, you will see many results, none of which actually cover all the verses or hadiths related to award or reward. In fact, there is no such search engine in Bengali or any other language. And due to the absence of this search engine, even in today's age, we hear many strange words related to Islamic law without proper reference, which have no basis in the Quran or Hadith. Or even if it is, the place where it is mentioned,cannot be found out. This is why this project is undertaken. The aim of this project is to create a search engine that will take input queries for Holy Quran and Sahih Hadith from any language like Arabic, Hindi, Bengali, English etc. (It also supports code mixed language input i.e. mix of English and Arabic input), Provide output in Bangla and English languages separately. This is an "asymmetric semantic search" problem. Meaning, you usually have a short question (like a question or some keywords) and you want to find a long paragraph answering the Question/Query. There are a few things to keep in mind when using the project in its current state. For example, this is not a question-answering system, to build such system we actually need a labeled dataset, which we don't have at this moment. This is a mathematical/statistical process to find the closest verses or hadiths, so the results are often inaccurate. Again, mixed code can be used as input Query, but it does not mean that local languages can be mixed, i.e Bangla - Takla (Bangla written in English letters) mixed or misspelled, in those cases most of the time the result will be wrong. Or if something is searched that doesn't have a related verse or hadith then it will show garbage results. As the hadith mentions the game of dice, if one searches for chess related verses or hadiths thinking of chess as dice, one will see garbage results. Currently two metrics are available in the project. One is - Euclidean distance, another is dot product. Most of the times it has been found that the Euclidean distance results are better than the dot product. While the dot product works well in some cases, it is unlikely that both will do well together. The higher the prediction score, means the higher the similarity. If the score is low, a different keyword/query should be used instead. Also the current system does not catch domains during search. For example, if you search whether playing cricket is haram, you may get "haram" related results instead of "Cricket" related. Bengali hadiths in the system are mainly translated using AI model called NLLB(no language left behind). so there may be mistakes in translation in some cases. In most cases the system may not return suitable results for the input query, but most of the time it will try to output something that includes words or keywords very close to your input query. Currently the work uses LASER (Language-Agnostic Sentence Representations) embedding as Zero Shot Learning Approach. These embeddings need to be fine-tuned and work with related labeled datasets to get better results in future.

Supported Input Languages:

Afrikaans, Albanian, Amharic, Arabic, Armenian, Aymara, Azerbaijani, Basque, Belarusian, Bengali, Berber languages, Bosnian, Breton, Bulgarian, Burmese, Catalan, Central/Kadazan Dusun, Central Khmer, Chavacano, Chinese, Coastal Kadazan, Cornish, Croatian, Czech, Danish, Dutch, Eastern Mari, English, Esperanto, Estonian, Finnish, French, Galician, Georgian, German, Greek, Hausa, Hebrew, Hindi, Hungarian, Icelandic, Ido, Indonesian, Interlingua, Interlingue, Irish, Italian, Japanese, Kabyle, Kazakh, Korean, Kurdish, Latvian, Latin, Lingua Franca Nova, Lithuanian, Low German/Saxon, Macedonian, Malagasy, Malay, Malayalam, Maldivian (Divehi), Marathi, Norwegian (Bokmål), Occitan, Persian (Farsi), Polish, Portuguese, Romanian, Russian, Serbian, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Swahili, Swedish, Tagalog, Tajik, Tamil, Tatar, Telugu, Thai, Turkish, Uighur, Ukrainian, Urdu, Uzbek, Vietnamese, Wu Chinese and Yue Chinese.

Supported Output Languages:

English and Bangla only

Solution/Pipeline

We process bangla and english Quran with tafsir and sahih bukhari-muslim dataset. The code used for scraping bangla translated quran and tafsir can be found here in Hadith Scrapper
Number of bukhari-muslim hadiths and chapters compiled in english language is different in number than bangla compiled hadith version,for that reason we couldn't use english and bangla human translated hadith's together for our task as number of rows are different (hadith orders are different) among them.that's why we translate english hadiths into bangla using meta's nllb-200-1.3B model. code used for english to bangla hadith translation can be found here
we create multilingual laser embeddings using facebook's laser. (let's call it corpus embedding) in this directory
Our flask app takes user's multilingual input,then converts user query into laser embedding(let's call it query embedding)
using dot product or pairwise euclidean distance metric measure, we retrive top n (n is user input here, ranges from 1-10) similar rows from our corpus embedding that are closest to query embedding/user's input

COLAB DEMO

* UPDATE *

4/4/2023 : added efficient sentence transformers and faiss based pipeline demo -> sentence-transformers-mlt-quran-hadith-search.ipynb

How to install the app?

go to terminal and cd to the root directory of this project,then
!pip install -r requirements.txt (make sure no error occured) then,
from this google drive link -> https://drive.google.com/drive/folders/1Zw64MRFvQxxwDLYFTdNki7HwNQOM30gy?usp=sharing download these 7 files and store them in assets folder
python app.py
Now,go to browser and hit -> http://127.0.0.1:33507/

demo video tutorial is available here -> https://www.youtube.com/watch?v=OWfbEfw0YO0

limitations/ cautions

This is not a question answering system,hence it won't give explicit answer for questions (Question-Answering System needs labeled dataset,which is currently missing for Holy Quran and Hadith.)
The system tries to predict closest verse or hadith for your query using some mathematical/statistical process. So predicted answer won't be always right.
If the user query contains spelling mistake or incorrect words then predictions of the system is expected to be wrong for most of the times.
if the users query contains irrelevant words and doesn't come closer to any ayat or hadith in terms of semantic similarity then the system will provide wrong results.
Most of the times Pairwise Euclidean metric provides better results than dot product metric.
This system can't consider the domain of the Query.
Higher Prediction Score indicates higher similarity,if prediction score for your query is small,then try to use different keywords/queries instead
Bangla Hadiths were translated using powerful Language Model, but still some translations can contain spelling mistakes.
Predicted results might not always give you appropriate results for your input query,but most of the times it will try to output something that contains words or keywords that are very close to your input query.
This is a zero Shot learning approach because we are lacking labeled dataset for this task,to improve performance of this system,one need to further finetune LASER.

Disclaimer

The research project that we are sharing here is the babystep towards the data driven theological research direction,the original problem that we tried to solve and couldn’t solve yet is very complex So at this stage,don’t expect it to run perfectly,it still makes many mistakes and it will take time for improvement and also needs mass collaboration to improve the performance of it even further.

Future Goals

Replace laser2 with laser3
Improve bangla translation quality with a bangla spell-checker
Data Driven Similarity checking between Quran and Bible
Replace dot product with xSIM metric

References

"Everything is easy until you work for it" ☺

Acknowledgements

Apsis Solutions Ltd.
bengali.ai

mobassir94 / multilingual-nlp-for-islamic-theology Goto Github PK