Giter VIP home page Giter VIP logo

find-similar's Introduction

FindSimilar

User-friendly library to find similar objects

You can find Full Project Documentation here


Workflows

Tests Pylint

PyPi

Version Development Status Python version Wheel

Anaconda

Version Last Updated Platforms

License

License

Support

Documentation Discussions Issues

PyPi Downloads

Day Downloads Week Downloads Month Downloads

Anaconda Downloads

Anaconda

Languages

Languages Top Language

Development

  • Release date Last Commit
  • Issues Closed Issues
  • Pull Requests Closed Pull Requests
  • Discussions

Repository Stats

Stars Contributors Forks


Menu

Mission

The mission of the FindSimilar project is to provide a powerful and versatile open source library that empowers developers to efficiently find similar objects and perform comparisons across a variety of data types. Whether dealing with texts, images, audio, or more, our project aims to simplify the process of identifying similarities and enhancing decision-making.

Open Source Project

This is the open source project with MIT license. Be free to use, fork, clone and contribute.

Features

Find similar texts

  • on different languages
  • with or without stopwords
  • using dictionary (or not)
  • using keywords (or not)

Requirements

Development Status

Install

with pip

pip install find-similar

See more in Full Documentation

Quickstart

from find_similar import find_similar

texts = ['one two', 'two three', 'three four']

text_to_compare = 'one four'
find_similar(text_to_compare, texts, count=10)
[TokenText(text="one two", len(tokens)=2, cos=0.5), TokenText(text="three four", len(tokens)=2, cos=0.5), TokenText(text="two three", len(tokens)=2, cos=0)]
  • The result is the list of TokenText instances ordering by cos
  • cos is the mark of texts similarity

See more examples in Full Documentation

Contributing

You are welcome! To easy start please check:

find-similar's People

Contributors

danteonline avatar jemeljanov avatar quillcraftsman avatar s1wh avatar vtyushkevich avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

jemeljanov s1wh

find-similar's Issues

Publish package to the anaconda

Is your feature request related to a problem? Please describe.
Anaconda contains packages for data scientist. I propose to package find-similar package there.
nltk, skipy, pymorphy3, ... have been already in anaconda

Describe the solution you'd like

Make 100% test coverage

  • run make coverage and see the report
  • add tests to all blocks
  • add 100% minimum coverage to GitHub action run-tests.yml

Add pylint command to Makefile

There is pylint $(git ls-files '*.py') command in GitHub action file lint.yml. This command takes files in git and run pylint for them.
Then I tried to add this command to Makefile It wasn't worked.
I think where is a way to add this command to Makefile

Strange results

Some strange results:
"люблю есть" / "есть люблю" - 0%
"люблю еду" / "еду люблю" - 100%
"люблю спать - 0, люблю дрова - 100%
"да е" / "е да" - 0%
"д е" и "е д" - 100%

Make the folder with different open source examples

  • Add examples folder to the project and to the pypi package.
  • Add open source examples to this folder. For example films, music, medicines, ...
  • Choose examples format (yml, json, ...).
  • Add parameters to find_similar search to the examples.
  • Check that is really open source data (not secret and open to use).

Don't display results with zero cosine (default=True)

Now we have count parameter in find_similar. It's display results count. I propose to add another parameter display_zero_results (may be better name exists).

  • If this parameter is True then we will display all results
  • If this parameter is False then we will not display results with cos = 0.0

Use set method to save cos in find_similar instead of save cos directly

Is your feature request related to a problem? Please describe.
In the django package django-find-similar I use adapter from django model to TextToken in find-similar.
__init__ method works well. But I need to add new behavior the cos will be saved.

Describe the solution you'd like
Here we save cos directly.
text.cos = cos
create special method or setter to do in the TextToken class:

def set_cos(cos):
    self.cos = cos

Describe alternatives you've considered
we can create python @setter. But in this way we should make cos property protected and create @getter. It's not a problem, just need more attention and replacements.

Additional context

Build documentation inside GitHub action

Now documentation builds locally and in the action it's deploy to GitHub pages.
Change this behavior this way:

  1. add command to Makefile to build docs locally
  2. add local build to gitignore
  3. build docs in GitHub action

compare TokenTexts by text instead of id

Is your feature request related to a problem? Please describe.
in __eq__ method we compare two TokenTexts by id. It's old way then we have base items and analogues.
For universal way we should compare TokenTexts by text

Describe the solution you'd like

  • change __eq__ method in the TokenText class

Describe alternatives you've considered

  • May be in future we will need some addition parameters to comparison

Additional context

Different descriptions in the repository and in the python package

  • "User-friendly library to find similar objects" in the repository description
  • "User-friendly library to find similar objects" in the README
  • "User-friendly library to find similar objects" on the website
  • "Algorithm to define similarity rating between objects" in the setup.py

Prepare laboratory skeleton

  • create folder
  • add coverage 100%
  • add GitHub action with coverage
  • check main tests, lints and actions
  • add local find_similar package to easy change

missing requirement yaml

Then we try to use example module:
from find_similar.examples.analyze import frequency_analysis
we have this error:
ModuleNotFoundError: No module named 'yaml'

Solution:
add yaml module to requirements in setup.py

Remove C0103 from .pylintrc

C0103 ignore error names. It was ignored because of id attribute.
There is the one problem here. If remove C0103 form .pylintrc and run linter pylint $(git ls-files '*.py') it will fail on one line.
But if we disabled this line it wouldn't work.
We need to solve this problem and remove C0103 from .pylintrc

Strange behavior with remove_stop_words parameter

Example from API:
Input:
{ "text_to_check": "я для тебя машину", "texts": [ "я для тебя машину", "ты для меня тост" ], "remove_stopwords": false }
Output:
[ { "text": "я для тебя машину", "cos": "0.50" }, { "text": "ты для меня тост", "cos": "0.00" } ]
With true:
Input:
{ "text_to_check": "я для тебя машину", "texts": [ "я для тебя машину", "ты для меня тост" ], "remove_stopwords": true }
Output:
[ { "text": "я для тебя машину", "cos": "1.00" }, { "text": "ты для меня тост", "cos": "0.00" } ]

This is a bug or feature?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.