scottwernervt / favicon Goto Github PK

View Code? Open in Web Editor NEW

56.0 4.0 12.0 60 KB

Find a website's favicon.

License: MIT License

Python 100.00%

favicon favicons python

favicon's People

Contributors

Stargazers

Watchers

Forkers

marzique alexjacobson95 monty5811 mayple remcovrinzen brmmm3 parth-gm xangis mortimer lencodigitexer awwitecki podboy

favicon's Issues

https://twitter.com returns a 400 Client Error: Bad Request for url: https://mobile.twitter.com/

Add request timeout

I'm scraping a big list of urls, and for each website I use favicon to extract images that I then store in database. And only when new website gets added to the list - it performs get request:

icons = favicon.get("http://" + value)
return icons[0].url

The problem is that some laggy websites may delay page load for minutes, and all I want is to somehow limit time of each request.
For example :


if images not downloaded after 5 seconds:
    return None

Thanks.

Set Icon's format using Headers if there's no file extension

Occasionally I encounter favicons that don't have a file extension. e.g. https://secure.gravatar.com/blavatar/bd4bda4207561b6998f10dec44b570f04ff4072b20f89162d525b186dfca3e49?s=32

Getting this results in a list of Icon objects like this, with an empty format:

Icon(
    url='https://secure.gravatar.com/blavatar/bd4bda4207561b6998f10dec44b570f04ff4072b20f89162d525b186dfca3e49?s=32',
    width=16,
    height=16,
    format=''
)

In a situation like this could/should favicon use the response headers from requests to determine the format instead? For example, doing:

response = requests.get("https://secure.gravatar.com/blavatar/bd4bda4207561b6998f10dec44b570f04ff4072b20f89162d525b186dfca3e49?s=32")

then response.headers includes:

'Content-Type': 'image/jpeg',
'Content-Disposition': 'inline; filename="bd4bda4207561b6998f10dec44b570f04ff4072b20f89162d525b186dfca3e49.jpeg"'

Perhaps fall back to using one of those to determine the likely file extension? At the moment, from outside favicon, it's impossible to get this data without manually using requests again myself.

(Is this project still maintained?)

Error with URL https://www.commercecentric.com causes no valid icons to be returned

Though there are some valid icons, none are returned. Instead, this error is thrown:

Traceback (most recent call last):
  File "/src/logic/website_scraping.py", line 312, in getWebsiteScrapedDataForURL
    potentialIcons = favicon.get(
  File "/.venv/lib/python3.8/site-packages/favicon/favicon.py", line 66, in get
    link_icons = tags(response.url, response.text)
  File "/.venv/lib/python3.8/site-packages/favicon/favicon.py", line 142, in tags
    width, height = dimensions(tag)
  File "/.venv/lib/python3.8/site-packages/favicon/favicon.py", line 176, in dimensions
    width, height = re.split(r'[x\xd7]', size[0])
ValueError: not enough values to unpack (expected 2, got 1)

size is an array with contents: ['32/32'].

Get icon dimensions if size not provided in filename or size attribute

ico and apple-touch-icon-precomposed files have a width and height of 0. Investigate methods to partially download the image and parse the width and height from the data stream.

References

Fails to find fav icon for microsoft.com

Fails due to case sensitivity when finding all links with rel and href attributes.

https://www.microsoft.com/en-us/

<link rel="SHORTCUT ICON" href="https://c.s-microsoft.com/favicon.ico?v2" type="image/x-icon">

max-redirect limit should be passed as parameter

To set max_redirects we have to create requests.Session() object.

session = requests.Session()
session.max_redirects = 3
session.get(url)

Reference: https://stackoverflow.com/questions/31552627/python-requests-limit-number-of-redirects-followed/31553146

Use given website for icon url scheme

favicon assumes the schema https when creating a icon url with <link href="//static.openr.co/main/favicons/favicon.ico" rel="shortcut icon" type="image/vnd.microsoft.icon">. When trying to download the icon using requests, it fails with requests.exceptions.SSLError. Instead of defaulting to https we should parse the scheme from the passed website.

Async support

https://compiletoi.net/fast-scraping-in-python-with-asyncio/

Add extra icon discovery methods

Add new icon discovery methods:

manifest.json and browserconfig.xml
Microsoft <meta name='msapplication-TileImage' content='icon.png'> (#15)
base64 icons data:image/png;base64,...

Pass html string as param

Hi. Thanks for your module.
It would be great if the get method could accept not only url but also preloaded html string as an optional parameter.

requests.exceptions.SSLError

requests.exceptions.SSLError: HTTPSConnectionPool(host='monksoftware.it', port=443): Max retries exceeded with url: / (Caused by SSLError(CertificateError("hostname 'monksoftware.it' doesn't match 'www.monksoftware.it'",),))

Add option to ignore SSL certificate mismatch, see http://docs.python-requests.org/en/master/user/advanced/?highlight=verify#ssl-cert-verification

ValueError: not enough values to unpack (expected 2, got 1) for nytimes.com

https://www.nytimes.com/ uses the multiplication symbol for sizes 144×144 which causes a ValueError Exception:

    width, height = size[0].split('x')
ValueError: not enough values to unpack (expected 2, got 1)

<link rel="apple-touch-icon-precomposed" sizes="144×144" href="/vi-assets/static-assets/ios-ipad-144x144-319373aaf4524d94d38aa599c56b8655.png">

max-redirect limit not supported to pass as parameter - Takes too long to exceed default redirection limit (=30)

To set max_redirects we have to create requests.Session() object.

session = requests.Session()
session.max_redirects = 3
session.get(url)

Reference: https://stackoverflow.com/questions/31552627/python-requests-limit-number-of-redirects-followed/31553146

Use src/ directory

Put the package in src/favicon, as recommended in these articles:

Incorrect location of favicon.ico

The library currently tries to find the default favicon.ico file in the wrong place.

According to https://html.spec.whatwg.org/multipage/links.html#rel-icon:

"In the absence of a link with the icon keyword . . . Let request be a new request whose url is the URL record obtained by resolving the URL "/favicon.ico" against the Document object's URL".

In other words the favicon.ico is stored at the site root.

Example:
Original URL: https://github.com/scottwernervt/favicon/
Correct Favicon URL: github.com/favicon.ico
Incorrect Favicon URL: github.com/scottwenervt/favicon/favicon.ico

Right now the library searches the incorrect url.

Handle poor html values in links

The fav icon for http://www.iposcoop.com/ has a tab \t in the filename. We should handle poor html formatting by peforming strip() on the filename.

Icon(url='https://www.iposcoop.com/wp-content/uploads/2014/02/favicon.ico\t', width=0, height=0, format='ico\t'),
Icon(url='https://www.iposcoop.com/favicon.ico', width=0, height=0, format='ico'),
Icon(url='https://www.iposcoop.com/wp-content/themes/flatsome/apple-touch-icon-precomposed.png', width=0, height=0, format='png')

ValueError when making an int from empty width/height

I don't know which site/favicon my code was trying to fetch when the final line in favicon.py generated:

File "/webapps/oohdir/code/venv/lib/python3.10/site-packages/favicon/favicon.py", line 66, in get
link_icons = tags(response.url, response.text)
File "/webapps/oohdir/code/venv/lib/python3.10/site-packages/favicon/favicon.py", line 142, in tags
width, height = dimensions(tag)
File "/webapps/oohdir/code/venv/lib/python3.10/site-packages/favicon/favicon.py", line 188, in dimensions
return int(width), int(height)
ValueError: invalid literal for int() with base 10: ''

But I've replicated the error for my tests with an HTML page that has an element like:

<link rel="icon" type="image/jpeg" sizes="x" href="/favicon.jpg" />

That sizes attribute results in the code trying to make a width/height from "" and generating the ValueError.

'NoneType' object has no attribute 'strip'

Hi, thanks for this awesome egg :)

I think that I found a bug when a website contains this meta: <link href="" rel="shortcut icon"/> .

This causes tag.get('href') or tag.get('content') at favicon.py to return None instead of a string.

You may want to not include metas with empty href/content into the one to parse 😊

Thank you in advance for your time and I wish you a great week :)

AttributeError: 'NoneType' object has no attribute 'lower'

The following tag <meta content="en-US" data-rh="true" itemprop="inLanguage"/> causes an exception because it does not have name or proprety attribute.

Traceback (most recent call last):
  File "/opt/pycharm-professional/helpers/pydev/pydevd.py", line 1664, in <module>
    main()
  File "/opt/pycharm-professional/helpers/pydev/pydevd.py", line 1658, in main
    globals = debugger.run(setup['file'], None, None, is_module)
  File "/opt/pycharm-professional/helpers/pydev/pydevd.py", line 1068, in run
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/opt/pycharm-professional/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/home/swerner/development/projects/favicon/debug.py", line 3, in <module>
    fav_icons = favicon.get('https://www.nytimes.com/')
  File "/home/swerner/development/projects/favicon/src/favicon/favicon.py", line 69, in get
    link_icons = tags(response.url, response.text)
  File "/home/swerner/development/projects/favicon/src/favicon/favicon.py", line 125, in tags
    if meta_type.lower() == name.lower():
AttributeError: 'NoneType' object has no attribute 'lower'

UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
  
  The code that caused this warning is on line 8 of the file /home/swerner/development/projects/favicon/tests/test_favicon.py. To get rid of this warning, pass the additional argument 'features="html.parser"' to the BeautifulSoup constructor.
  
    s = BeautifulSoup('')

-- Docs: https://docs.pytest.org/en/latest/warnings.html

Add option to download largest sized icon

Provide an argument to download the largest sized icon to a user supplied path and filename.