findsimilar / find-similar Goto Github PK

View Code? Open in Web Editor NEW

8.0 5.0 2.0 362 KB

User-friendly library to find similar objects

Home Page: https://findsimilar.craftsman.lol/

License: MIT License

Python 98.67% Makefile 1.33%

find python search similar texts words machine-learning natural-language-processing

find-similar's Introduction

FindSimilar

User-friendly library to find similar objects

You can find Full Project Documentation here

Workflows

PyPi

Anaconda

License

Support

PyPi Downloads

Anaconda Downloads

Languages

Development

Repository Stats

Mission

The mission of the FindSimilar project is to provide a powerful and versatile open source library that empowers developers to efficiently find similar objects and perform comparisons across a variety of data types. Whether dealing with texts, images, audio, or more, our project aims to simplify the process of identifying similarities and enhancing decision-making.

Open Source Project

This is the open source project with MIT license. Be free to use, fork, clone and contribute.

Features

Find similar texts

on different languages
with or without stopwords
using dictionary (or not)
using keywords (or not)

Requirements

nltk, pymorphy3
See more in Full Documentation

Development Status

Package already available on PyPi
See more in Full Documentation

Install

with pip

pip install find-similar

See more in Full Documentation

Quickstart

from find_similar import find_similar

texts = ['one two', 'two three', 'three four']

text_to_compare = 'one four'
find_similar(text_to_compare, texts, count=10)

[TokenText(text="one two", len(tokens)=2, cos=0.5), TokenText(text="three four", len(tokens)=2, cos=0.5), TokenText(text="two three", len(tokens)=2, cos=0)]

The result is the list of TokenText instances ordering by cos
cos is the mark of texts similarity

See more examples in Full Documentation

Contributing

You are welcome! To easy start please check:

find-similar's People

Contributors

Stargazers

Watchers

Forkers

jemeljanov s1wh

find-similar's Issues

Publish package to the anaconda

Is your feature request related to a problem? Please describe.
Anaconda contains packages for data scientist. I propose to package find-similar package there.
nltk, skipy, pymorphy3, ... have been already in anaconda

Describe the solution you'd like

find a way to load package to the anaconda:
- https://enterprise-docs.anaconda.com/en/latest/data-science-workflows/packages/upload.html
- https://docs.conda.io/projects/conda-build/en/stable/install-conda-build.html
- https://enterprise-docs.anaconda.com/en/latest/data-science-workflows/packages/build.html
- others
load package
find a way to load package automatically

Function find_similar don't work properly with string dictionary

If we sent dictionary to find_similar function it is not working.
We need first make prepare_dictionary and wen send the result to find_similar.
The better way is to check dictionary type first and if it's dictionary call prepare dictionary silently

Make 100% test coverage

run make coverage and see the report
add tests to all blocks
add 100% minimum coverage to GitHub action run-tests.yml

Design and develop user-friendly laboratory to improve the algorithm

Now

Console scripts are being used.

Task

Design and develop something with user-friendly interface. Web interface for example.

To do what?

analyze algorithm proximity
load test data
analyze separate parts of algorithm (tokenise for example)

Add pylint command to Makefile

There is pylint $(git ls-files '*.py') command in GitHub action file lint.yml. This command takes files in git and run pylint for them.
Then I tried to add this command to Makefile It wasn't worked.
I think where is a way to add this command to Makefile

Add ability to remove stop words and not remove. Now remove always.

May be user want to search with stop words like I, we, and others.
Now we always remove stop words.
Add option remove of not (default=remove I think)

Add test coverage report to run-test workflow

Add coverage report using pytest-cov or something else
Save report in html format
Save report as artifact on GitHub

Add one or two examples with different languages

Reduce number of arguments in core functions to get pylint test passed

There are too many arguments in core functions due to pylint report. Need to make pylint test passed

(Laboratory) For each example make management command to run in terminal

For people who like terminal we can duplicate web interface commands and terminal commands using django management commands

Put analytic logic separately
use it in web interface
duplicate it in management commands

Change code structure to work with different languages, dictionaries, stop words, ...

Hard code is used in many parts of the project.
We need to change some interfaces to work with any dictionaries, languages, ...

For example: https://github.com/findsimilar/find-similar/blob/main/find_similar/tokenize.py#L14 - this usefulness words work only with concrete data.

Strange results

Some strange results:
"люблю есть" / "есть люблю" - 0%
"люблю еду" / "еду люблю" - 100%
"люблю спать - 0, люблю дрова - 100%
"да е" / "е да" - 0%
"д е" и "е д" - 100%

Add weight to the algorithm

Use weight system in algorithm.
This system will not be used by default.

Change pymorphy2 to pymorphy3

In this discussion: #21 (comment) we decided to use pymorphy3 instead of pymorphy2.

do changes which was described here: #21 (reply in thread)
run tests and coverage on python 3.11 (make coverage)
Add python 3.11 to GitHub action. This is the action example: https://github.com/quillcraftsman/lavacactus/blob/master/.github/workflows/testing.yml#L9

Replace simple functions without database from lab (console) to laboratory (web)

compare two
example frequency analysis
tokenize one

Build demo example for algorithm usage

To demonstrate main features
To load some simple data

Make the folder with different open source examples

Add examples folder to the project and to the pypi package.
Add open source examples to this folder. For example films, music, medicines, ...
Choose examples format (yml, json, ...).
Add parameters to find_similar search to the examples.
Check that is really open source data (not secret and open to use).

Add pylint to GitHub actions. Add yaml files linter to GitHub actions.

Add pylint to GitHub actions.
Add yaml files linter to GitHub actions.

Don't display results with zero cosine (default=True)

Now we have count parameter in find_similar. It's display results count. I propose to add another parameter display_zero_results (may be better name exists).

If this parameter is True then we will display all results
If this parameter is False then we will not display results with cos = 0.0

Migrate documentation on new domain

Migrate documentation on https://findsimilar.craftsman.lol
Update Readme

Add ability to define parts of speech (adjectives, verbs etc) to remove in tokenize function

Parts of speech that should be deleted during splitting a string to tokens are hardcoded. Verbs are removed by default.
It could be a problem in some cases where verbs are significant.

Add a new example. Show the use of the parameter "remove_stopwords"

It may be any texts (max count 50).
The main point is different results with remove_stopwords and without it.
Put this example in example folder in yaml format.

Use set method to save cos in find_similar instead of save cos directly

Is your feature request related to a problem? Please describe.
In the django package django-find-similar I use adapter from django model to TextToken in find-similar.
__init__ method works well. But I need to add new behavior the cos will be saved.

Describe the solution you'd like
Here we save cos directly.
text.cos = cos
create special method or setter to do in the TextToken class:

def set_cos(cos):
    self.cos = cos

Describe alternatives you've considered
we can create python @setter. But in this way we should make cos property protected and create @getter. It's not a problem, just need more attention and replacements.

Additional context

Build documentation inside GitHub action

Now documentation builds locally and in the action it's deploy to GitHub pages.
Change this behavior this way:

add command to Makefile to build docs locally
add local build to gitignore
build docs in GitHub action

Multi-languages support doesn't work

There is language parameter in the main function find_similar: https://github.com/findsimilar/find-similar/blob/main/find_similar/core.py#L11

This parameter = 'russian' by default. This is multi-languages text support.
But is't not working it's just mock. In the code it's hardcoded with Russian language.

Find where language parameter need to be used
Use this parameter to create stop word or something else

Add simple design to laboratory pages

bootstrap
add staticfiles
add base template

compare TokenTexts by text instead of id

Is your feature request related to a problem? Please describe.
in __eq__ method we compare two TokenTexts by id. It's old way then we have base items and analogues.
For universal way we should compare TokenTexts by text

Describe the solution you'd like

change __eq__ method in the TokenText class

Describe alternatives you've considered

May be in future we will need some addition parameters to comparison

Additional context

Make lint command to new files in git

make lint command works for untracked files only (red).
Update this command to work with all new files (red and green)

Different descriptions in the repository and in the python package

"User-friendly library to find similar objects" in the repository description
"User-friendly library to find similar objects" in the README
"User-friendly library to find similar objects" on the website
"Algorithm to define similarity rating between objects" in the setup.py

Add str and repr methods to TokenText class

Prepare laboratory skeleton

create folder
add coverage 100%
add GitHub action with coverage
check main tests, lints and actions
add local find_similar package to easy change

There is no examples package in the find_similar package

In find-similar==1.3.0 no examples package in the find_similar package.

missing requirement yaml

Then we try to use example module:
from find_similar.examples.analyze import frequency_analysis
we have this error:
ModuleNotFoundError: No module named 'yaml'

Solution:
add yaml module to requirements in setup.py

Remove C0103 from .pylintrc

C0103 ignore error names. It was ignored because of id attribute.
There is the one problem here. If remove C0103 form .pylintrc and run linter pylint $(git ls-files '*.py') it will fail on one line.
But if we disabled this line it wouldn't work.
We need to solve this problem and remove C0103 from .pylintrc

Handle FileNotFound error then we take non-existent example

Strange behavior with remove_stop_words parameter

Example from API:
Input:
{ "text_to_check": "я для тебя машину", "texts": [ "я для тебя машину", "ты для меня тост" ], "remove_stopwords": false }
Output:
[ { "text": "я для тебя машину", "cos": "0.50" }, { "text": "ты для меня тост", "cos": "0.00" } ]
With true:
Input:
{ "text_to_check": "я для тебя машину", "texts": [ "я для тебя машину", "ты для меня тост" ], "remove_stopwords": true }
Output:
[ { "text": "я для тебя машину", "cos": "1.00" }, { "text": "ты для меня тост", "cos": "0.00" } ]

This is a bug or feature?

findsimilar / find-similar Goto Github PK

find-similar's Introduction

FindSimilar

Workflows

PyPi

Anaconda

License

Support

PyPi Downloads

Anaconda Downloads

Languages

Development

Repository Stats

Menu

Mission

Open Source Project

Features

Requirements

Development Status

Install

with pip

Quickstart

See more examples in Full Documentation

Contributing

find-similar's People

Contributors

Stargazers

Watchers

Forkers

find-similar's Issues

Now

Task

To do what?

Additional context

Additional context

Recommend Projects

Recommend Topics

Recommend Org