vhf / confusable_homoglyphs Goto Github PK

ϲοｎｆｕѕаｂｌе＿һοｍоɡｌｙｐｈｓ

Home Page: https://pypi.python.org/pypi/confusable_homoglyphs/

License: MIT License

Python 98.26% Makefile 1.74%

python unicode confusable homoglyphs attack

confusable_homoglyphs's Introduction

⚠️ As of January 2024 this project is now maintained by Elena “of Valhalla” at sr.ht/~valhalla/confusable_homoglyphs/.

confusable_homoglyphs [doc]

a homoglyph is one of two or more graphemes, characters, or glyphs with shapes that appear identical or very similar wikipedia:Homoglyph

Unicode homoglyphs can be a nuisance on the web. Your most popular client, AlaskaJazz, might be upset to be impersonated by a trickster who deliberately chose the username ΑlaskaJazz.

AlaskaJazz is single script: only Latin characters.
ΑlaskaJazz is mixed-script: the first character is a greek letter.

You might also want to avoid people being tricked into entering their password on www.microsоft.com or www.faϲebook.com instead of www.microsoft.com or www.facebook.com. Here is a utility to play with these confusable homoglyphs.

Not all mixed-script strings have to be ruled out though, you could only exclude mixed-script strings containing characters that might be confused with a character from some unicode blocks of your choosing.

Allo and ρττ are fine: single script.
AlloΓ is fine when our preferred script alias is 'latin': mixed script, but Γ is not confusable.
Alloρ is dangerous: mixed script and ρ could be confused with p.

This library is compatible Python 2 and Python 3.

API documentation

Is the data up to date?

Yep.

The unicode blocks aliases and names for each character are extracted from this file provided by the unicode consortium.

The matrix of which character can be confused with which other characters is built using this file provided by the unicode consortium.

This data is stored in two JSON files: categories.json and confusables.json. If you delete them, they will both be recreated by downloading and parsing the two abovementioned files and stored as JSON files again.

confusable_homoglyphs's People

Contributors

Stargazers

Watchers

Forkers

muusik v-a-kernel barseghyanartur seospace byrnesz pombredanne raymondkyliu tracy-waudby 0xhaven modulexcite mauriziocasciano frankkkkk tawfung optionalg sambacha cheerfulmushroom arpitjain799 valholl

confusable_homoglyphs's Issues

Homoglyph for \u15e9

Symbol "ᗩ" looks a lot like latin "A", but isn't recognized.

add check command to the `confusable_homoglyphs` CLI

It would be useful to be able to check files for confusable homoglyphs on the command-line. While it is possible to write a python3 -c command, it would be much more ergonomic for the confusable_homoglyphs wrapper to have a check command that would check filenames in the arguments for confusable homoglyphs. Perhaps an argument of - could cause the command to check stdin, or maybe it should check stdin when there are no filename arguments.

TypeError: 'NoneType' object is not iterable

With the latest version:

In [1]: import confusable_homoglyphs.confusables

In [2]: confusable_homoglyphs.confusables.is_dangerous(string='Їt', preferred_aliases=['latin'])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-a295e9ba12b9> in <module>()
----> 1 confusable_homoglyphs.confusables.is_dangerous(string='Їt', preferred_aliases=['latin'])

~/.pyenv/versions/3.7.0/envs/kamabot/lib/python3.7/site-packages/confusable_homoglyphs/confusables.py in is_dangerous(string, preferred_aliases)
    158     :rtype: bool
    159     """
--> 160     return is_mixed_script(string) and is_confusable(string, preferred_aliases=preferred_aliases)

~/.pyenv/versions/3.7.0/envs/kamabot/lib/python3.7/site-packages/confusable_homoglyphs/confusables.py in is_confusable(string, greedy, preferred_aliases)
    109             potentially_confusable = []
    110             try:
--> 111                 for d in found:
    112                     aliases = [alias(glyph) for glyph in d['c']]
    113                     for a in aliases:

TypeError: 'NoneType' object is not iterable

python2: is_confusable cannot handle unicode preferred_aliases

(Thanks for creating such a useful tool that confusable_homoglyphs is)
Since the return type of categories.alias() is unicode when running in Python 2, it would make sense that preferred_alias argument of confusables.is_confusable() would accept a list of same, but it requires list of strings instead, otherwise giving a TypeError. An example follows:

>>> from confusable_homoglyphs import confusables
>>> confusables.is_confusable('', preferred_aliases=[u'LATIN'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/confusable_homoglyphs/confusables.py", line 92, in is_confusable
    preferred_aliases = list(map(str.upper, preferred_aliases))
TypeError: descriptor 'upper' requires a 'str' object but received a 'unicode'

Confusables for ㅋ vs. ᄏ

I'm confused as to why I'm getting different results for ㅋ vs. ᄏ. The Unicode site gives the original plus 2 additional homoglyphs for ㅋ:

ㅋ ᄏ ᆿ

But the confusable_homoglyphs package yields just one additional homoglyph initially. I only get the other one when I look for homoglyphs of that previous result:

from confusable_homoglyphs import confusables
khieukh1s = confusables.is_confusable('ㅋ', preferred_aliases=[], greedy=True)
set(map(lambda x: x['c'], khieukh1s[0]['homoglyphs']))
# >> {'ᄏ'}
khieukh2s = confusables.is_confusable('ᄏ', preferred_aliases=[], greedy=True)
set(map(lambda x: x['c'], khieukh2s[0]['homoglyphs']))
# >> {'ㅋ', 'ᆿ'}

Is this expected behavior?

(Somewhat related to this issue.)

raise Exception('Datafile not found, datafile generation failed!') gives no debugging path

Currently, hitting the Absolutely EVIL 'Datafile not found, datafile generation failed!' error in version 3.0.0 of the library.

That's it, that's all the debugging data I get. And that's the problem, I should be giving the path its looking for it and failing to find it as well. But I'm not.

The worst part about being 100% blocked on a Django deployment is that I have no idea where it's looking for the file in question-based on the error message alone, if its a permissions issue, or whatever.. I'm just getting the same worthless Django exception over and over with no debug data that's relevant, no CLI docs for the package, and no way out. Double checked docs.

Files are as they should be, on the correct path.. or at least, they work in that path on one system. Since I have no way to know what the second system is expecting (due to the lack of debugging path data in the exception) I'm just trying to make sure everything else matches.. and it does according to sha1/md5 hashes hashes. Yet, I still get the same exception and that same exception gives me no data to go off of to improve my situation.

And really, these data files should be in the lib and something I shouldn't have to mess with.

I also checked stack overflow, and only found this https://stackoverflow.com/questions/46512148/datafile-not-found-datafile-generation-failed-in-confusable-homoglyphs-categori but I have the files on disk.

Desired fix:

Include paths used and not found in the exception message.
Include the files in question in the library so they don't violate DRY and SOLID and can keep a good separation of concerns without having to clutter up thousands if not millions of projects that use django and thus use this library.

Library gives fatal error when unable to contact unicode.org

If utils.get() is run too many times, unicode.org may throttle or even blacklist the server. In this case, categories.py fails with the error Datafile not found, datafile generation failed! The timeout on unicode.org is reported when the file is run from the command line, but not when it is called from django-registration.

I'm open to various ways of avoiding the timeout issue in the first place. Two ideas are allowing users to specify a path to cached copies of categories.json and confusables.json (the release on PyPI does not include these files), or to a mirror copy of those files on a different server.

Regardless, I suggest that utils.get() raise an error if it is unable to contact unicode.org, so that the errors are more informative.

Translate from confusable string to simple string?

Thank you for this package.

I have this use case:

People accidentally insert unicode characters and area not aware of this.

I want to normalize the user input. See this StackOverflow question: http://stackoverflow.com/questions/43367355/translate-unicode-to-ascii-if-possible

I have not found a simple way to do this with your library.

... or I am on the wrong track here, and your library can't help me.

Thank you for sharing your code.

Golang port of confusable_homoglyphs

@vhf

This library is great and I ported the library to golang here: https://github.com/skygeario/go-confusable-homoglyphs.

I have included you and the library in our readme and license. Please feel free to let me know if it looks good, or if there are any changes needed to be made.

Thanks!!

Possibly outdated confusables.json?

So I was looking at the confusables.json file, and I'm not sure if it's outdated. It was last committed on Sep 13, 2016, but the official source (https://unicode.org/Public/security/latest/confusables.txt) has a last updated date of: Date: 2017-04-08, 16:13:41 GMT

I vs. l vs. 1 vs. |

Case 1

I (I capital "eye") vs. l (l lowercase "ell") with just LATIN

Code:

from confusable_homoglyphs import confusables
confusables.is_confusable('I', preferred_aliases=['LATIN'], greedy=True)

Expected: at least l

Actual: False

Case 2

I (I capital "eye") vs. l (l lowercase "ell") vs. number 1 vs. pipe | with LATIN and COMMON

Code:

from confusable_homoglyphs import confusables
confusables.is_confusable('I', preferred_aliases=['LATIN', 'COMMON'], greedy=True)

Expected: at least l, |, 1

Actual: False

AttributeError: module 'configparser' has no attribute 'SafeConfigParser' for file versioneer.py

The file versioneer.py still uses SafeConfigParser and read_fp, which where required before python 3.2, when read_file was introduced and the default behaviour of ConfigParser was changed to be strict.

https://docs.python.org/3/library/configparser.html#customizing-parser-behaviour
https://docs.python.org/3/library/configparser.html#configparser.ConfigParser.read_file

With python 3.12 SafeConfigParser and read_fp are no longer available.

I've found this by running python3.12 setup.py clean (it was being run in a CI system on the version from pypi, and I could reproduce it on master), I don't know what else would be affected.

Simply changing SafeConfigParser with ConfigParser and read_fp with read_file works, and from what I've read in the python documentation should not change the behaviour.

Not all the function is showing

Hi, I recently installed the package and trying to use it in my project, but not all the functions are showing from the class when i try to use confusable_homoglyphs.