Giter VIP home page Giter VIP logo

confusable_homoglyphs's Introduction

⚠️ As of January 2024 this project is now maintained by Elena “of Valhalla” at sr.ht/~valhalla/confusable_homoglyphs/.

confusable_homoglyphs [doc]

Documentation Status

a homoglyph is one of two or more graphemes, characters, or glyphs with shapes that appear identical or very similar wikipedia:Homoglyph

Unicode homoglyphs can be a nuisance on the web. Your most popular client, AlaskaJazz, might be upset to be impersonated by a trickster who deliberately chose the username ΑlaskaJazz.

  • AlaskaJazz is single script: only Latin characters.
  • ΑlaskaJazz is mixed-script: the first character is a greek letter.

You might also want to avoid people being tricked into entering their password on www.microsоft.com or www.faϲebook.com instead of www.microsoft.com or www.facebook.com. Here is a utility to play with these confusable homoglyphs.

Not all mixed-script strings have to be ruled out though, you could only exclude mixed-script strings containing characters that might be confused with a character from some unicode blocks of your choosing.

  • Allo and ρττ are fine: single script.
  • AlloΓ is fine when our preferred script alias is 'latin': mixed script, but Γ is not confusable.
  • Alloρ is dangerous: mixed script and ρ could be confused with p.

This library is compatible Python 2 and Python 3.

Is the data up to date?

Yep.

The unicode blocks aliases and names for each character are extracted from this file provided by the unicode consortium.

The matrix of which character can be confused with which other characters is built using this file provided by the unicode consortium.

This data is stored in two JSON files: categories.json and confusables.json. If you delete them, they will both be recreated by downloading and parsing the two abovementioned files and stored as JSON files again.

confusable_homoglyphs's People

Contributors

muusik avatar vhf avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

confusable_homoglyphs's Issues

add check command to the `confusable_homoglyphs` CLI

It would be useful to be able to check files for confusable homoglyphs on the command-line. While it is possible to write a python3 -c command, it would be much more ergonomic for the confusable_homoglyphs wrapper to have a check command that would check filenames in the arguments for confusable homoglyphs. Perhaps an argument of - could cause the command to check stdin, or maybe it should check stdin when there are no filename arguments.

TypeError: 'NoneType' object is not iterable

With the latest version:

In [1]: import confusable_homoglyphs.confusables

In [2]: confusable_homoglyphs.confusables.is_dangerous(string='Їt', preferred_aliases=['latin'])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-a295e9ba12b9> in <module>()
----> 1 confusable_homoglyphs.confusables.is_dangerous(string='Їt', preferred_aliases=['latin'])

~/.pyenv/versions/3.7.0/envs/kamabot/lib/python3.7/site-packages/confusable_homoglyphs/confusables.py in is_dangerous(string, preferred_aliases)
    158     :rtype: bool
    159     """
--> 160     return is_mixed_script(string) and is_confusable(string, preferred_aliases=preferred_aliases)

~/.pyenv/versions/3.7.0/envs/kamabot/lib/python3.7/site-packages/confusable_homoglyphs/confusables.py in is_confusable(string, greedy, preferred_aliases)
    109             potentially_confusable = []
    110             try:
--> 111                 for d in found:
    112                     aliases = [alias(glyph) for glyph in d['c']]
    113                     for a in aliases:

TypeError: 'NoneType' object is not iterable

python2: is_confusable cannot handle unicode preferred_aliases

(Thanks for creating such a useful tool that confusable_homoglyphs is)
Since the return type of categories.alias() is unicode when running in Python 2, it would make sense that preferred_alias argument of confusables.is_confusable() would accept a list of same, but it requires list of strings instead, otherwise giving a TypeError. An example follows:

>>> from confusable_homoglyphs import confusables
>>> confusables.is_confusable('', preferred_aliases=[u'LATIN'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/confusable_homoglyphs/confusables.py", line 92, in is_confusable
    preferred_aliases = list(map(str.upper, preferred_aliases))
TypeError: descriptor 'upper' requires a 'str' object but received a 'unicode'

Confusables for ㅋ vs. ᄏ

I'm confused as to why I'm getting different results for vs. . The Unicode site gives the original plus 2 additional homoglyphs for :

ㅋ ᄏ ᆿ

But the confusable_homoglyphs package yields just one additional homoglyph initially. I only get the other one when I look for homoglyphs of that previous result:

from confusable_homoglyphs import confusables
khieukh1s = confusables.is_confusable('ㅋ', preferred_aliases=[], greedy=True)
set(map(lambda x: x['c'], khieukh1s[0]['homoglyphs']))
# >> {'ᄏ'}
khieukh2s = confusables.is_confusable('ᄏ', preferred_aliases=[], greedy=True)
set(map(lambda x: x['c'], khieukh2s[0]['homoglyphs']))
# >> {'ㅋ', 'ᆿ'}

Is this expected behavior?

(Somewhat related to this issue.)

raise Exception('Datafile not found, datafile generation failed!') gives no debugging path

Currently, hitting the Absolutely EVIL 'Datafile not found, datafile generation failed!' error in version 3.0.0 of the library.

That's it, that's all the debugging data I get. And that's the problem, I should be giving the path its looking for it and failing to find it as well. But I'm not.

The worst part about being 100% blocked on a Django deployment is that I have no idea where it's looking for the file in question-based on the error message alone, if its a permissions issue, or whatever.. I'm just getting the same worthless Django exception over and over with no debug data that's relevant, no CLI docs for the package, and no way out. Double checked docs.

Files are as they should be, on the correct path.. or at least, they work in that path on one system. Since I have no way to know what the second system is expecting (due to the lack of debugging path data in the exception) I'm just trying to make sure everything else matches.. and it does according to sha1/md5 hashes hashes. Yet, I still get the same exception and that same exception gives me no data to go off of to improve my situation.

And really, these data files should be in the lib and something I shouldn't have to mess with.

I also checked stack overflow, and only found this https://stackoverflow.com/questions/46512148/datafile-not-found-datafile-generation-failed-in-confusable-homoglyphs-categori but I have the files on disk.

Desired fix:

  • Include paths used and not found in the exception message.
  • Include the files in question in the library so they don't violate DRY and SOLID and can keep a good separation of concerns without having to clutter up thousands if not millions of projects that use django and thus use this library.

Library gives fatal error when unable to contact unicode.org

If utils.get() is run too many times, unicode.org may throttle or even blacklist the server. In this case, categories.py fails with the error Datafile not found, datafile generation failed! The timeout on unicode.org is reported when the file is run from the command line, but not when it is called from django-registration.

I'm open to various ways of avoiding the timeout issue in the first place. Two ideas are allowing users to specify a path to cached copies of categories.json and confusables.json (the release on PyPI does not include these files), or to a mirror copy of those files on a different server.

Regardless, I suggest that utils.get() raise an error if it is unable to contact unicode.org, so that the errors are more informative.

I vs. l vs. 1 vs. |

Case 1

I (I capital "eye") vs. l (l lowercase "ell") with just LATIN

Code:

from confusable_homoglyphs import confusables
confusables.is_confusable('I', preferred_aliases=['LATIN'], greedy=True)

Expected: at least l

Actual: False

Case 2

I (I capital "eye") vs. l (l lowercase "ell") vs. number 1 vs. pipe | with LATIN and COMMON

Code:

from confusable_homoglyphs import confusables
confusables.is_confusable('I', preferred_aliases=['LATIN', 'COMMON'], greedy=True)

Expected: at least l, |, 1

Actual: False

AttributeError: module 'configparser' has no attribute 'SafeConfigParser' for file versioneer.py

The file versioneer.py still uses SafeConfigParser and read_fp, which where required before python 3.2, when read_file was introduced and the default behaviour of ConfigParser was changed to be strict.

https://docs.python.org/3/library/configparser.html#customizing-parser-behaviour
https://docs.python.org/3/library/configparser.html#configparser.ConfigParser.read_file

With python 3.12 SafeConfigParser and read_fp are no longer available.

I've found this by running python3.12 setup.py clean (it was being run in a CI system on the version from pypi, and I could reproduce it on master), I don't know what else would be affected.

Simply changing SafeConfigParser with ConfigParser and read_fp with read_file works, and from what I've read in the python documentation should not change the behaviour.

Not all the function is showing

Hi, I recently installed the package and trying to use it in my project, but not all the functions are showing from the class when i try to use confusable_homoglyphs.
image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.