Giter VIP home page Giter VIP logo

darijadistance's Introduction

DarijaDistance Library

License: MIT

DarijaDistance is a specialized Python library crafted to handle the unique linguistic nuances of Moroccan Darija. It offers powerful tools for word comparison, distance measurement, translation lookup, and more; making it a valuable tool for natural language processing (NLP) tasks involving Darija.

Features

  • Word Distance Calculation: Calculate the distance between two words based on various factors, including letter differences, vowel swaps, and more.
  • Closest Word Finder: Identify the closest words to a given word using positional encoding and other techniques.
  • Translation Lookup: Retrieve potential translations for Darija words, including confidence scores.
  • Customizable: Includes various methods to handle different types of word comparisons, including specific cases like vowel swaps and character replacements.

Installation

You can install DarijaDistance via pip:

pip install DarijaDistance

Alternatively, you can clone the repository and install it locally:

git clone https://github.com/aissam-out/DarijaDistance.git
cd DarijaDistance
pip install .

Usage

Here are some basic usage examples to get you started:

Calculating Word Distance

from DarijaDistance.word_distance import WordDistance

wd = WordDistance()

vowel_dist = wd.distance_between("kelb", "kalb")
consonant_dist = wd.distance_between("kelb", "kedb")
print(f"Vowel: {vowel_dist} - Consonant: {consonant_dist}")
# Vowel: 2.2 - Consonant: 4

repeated_letter = wd.distance_between("abc", "abbc")
different_letter = wd.distance_between("abc", "abdc")
print(f"Repeated: {repeated_letter} - Different: {different_letter}")
# Repeated: 2 - Different: 3

dist_1 = wd.distance_between("9alam", "qalam")
dist_2 = wd.distance_between("9alam", "3alam")
print(f"Distance A: {dist_1} - Distance B: {dist_2}")
# Distance A: 2.2 - Distance B: 4

distance = wd.distance_between("so9", "sou9")
print(f"Distance: {distance}")
# Distance: 0

Traditional distance measures like Levenshtein focus on the number of insertions, deletions, and substitutions required to transform one word into another. While useful, these methods treat all characters equally, ignoring the phonetic and linguistic nuances present in languages like Darija.

Finding Closest Words

closest_words, min_distance = wd.get_closests("kulb")
print(f"Closest words to 'kulb': {closest_words} - min distance = {min_distance}")
# output: ['klb', 'kelb', 'kalb'] - 2.1

The WordDistance library encodes each character on a conceptual 3-dimensional plane, assigning numeric values to vowels, consonants and digits based on their relative importance and proximity within the Darija language. Summing these values creates a "sum image" for each word, simplifying comparisons and boosting performance. While this sum abstracts away some details, like the exact order of letters, it effectively reduces search complexity. Therefore, the integration of this 3-dimensional representation with summation produces a robust and efficient distance metric, positioning WordDistance as a superior tool for accurately assessing word similarities while ensuring optimal computational efficiency.

Looking Up Translations

translation = wd.lookup_translation_word("klb")
print(f"Potential translations for 'klb': {translation}")

For tasks such as finding the closest words and looking up translations, the WordDistance library relies on the Darija Open Dataset (DODa), which was used to create the embedded pickle files that power these features. Specifically, the hash_table_word.pickle and hash_table_sum.pickle files. This structure enables efficient word lookup and comparison within the Darija language. However, for functions like distance_between() and other tools that do not require an underlying dataset, the library remains dataset agnostic and can operate with any words, regardless of their language or context.

Checking for names

wd.check_name("aissam")
# output: (True, {'potential translations': ['Aissam'], 'confidence': '100%'})
wd.check_name("tomobil")
# output: (False, {})

DarijaDistance library also includes a check_name method, designed to verify whether a given word is recognized as a name within the system. This function quickly scans the list of known names and returns a boolean indicating whether the word is a match. If a match is found, the function also provides potential translations along with a confidence score, ensuring that you can identify and work with names accurately in your applications.

Managing Names and Translations with DarijaDataManager

The DarijaDataManager class provides easy-to-use methods for adding names and translations to your local datasets, ensuring your data is always up-to-date.

Adding Names

You can add a new name to the list using the add_name method. If the name already exists, it won't be added again.

from DarijaDistance.preprocess import DarijaDataManager

data_manager = DarijaDataManager()

data_manager.add_name("aissam")

Adding Translations

Similarly, you can add new word translations using the add_translations method. The method ensures that only unique translations are added.

from DarijaDistance.preprocess import DarijaDataManager

data_manager = DarijaDataManager()

translations = [("la", "no"), ("klb", "dog")]
data_manager.add_translations(translations)

Contributing

Contributions are welcome! If you have any ideas, suggestions, or find a bug, please open an issue or submit a pull request to the Github repo.

Running Tests

python -m unittest discover -s tests

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Contact

If you have any questions or feedback, you can find me on LinkedIn: Aissam Outchakoucht or on X: @aissam_out.

darijadistance's People

Contributors

aissam-out avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.