scottwernervt / favicon Goto Github PK
View Code? Open in Web Editor NEWFind a website's favicon.
License: MIT License
Find a website's favicon.
License: MIT License
I'm scraping a big list of urls, and for each website I use favicon
to extract images that I then store in database. And only when new website gets added to the list - it performs get request:
icons = favicon.get("http://" + value)
return icons[0].url
The problem is that some laggy websites may delay page load for minutes, and all I want is to somehow limit time of each request.
For example :
if images not downloaded after 5 seconds:
return None
Thanks.
Occasionally I encounter favicons that don't have a file extension. e.g. https://secure.gravatar.com/blavatar/bd4bda4207561b6998f10dec44b570f04ff4072b20f89162d525b186dfca3e49?s=32
Getting this results in a list of Icon
objects like this, with an empty format
:
Icon(
url='https://secure.gravatar.com/blavatar/bd4bda4207561b6998f10dec44b570f04ff4072b20f89162d525b186dfca3e49?s=32',
width=16,
height=16,
format=''
)
In a situation like this could/should favicon use the response headers from requests to determine the format instead? For example, doing:
response = requests.get("https://secure.gravatar.com/blavatar/bd4bda4207561b6998f10dec44b570f04ff4072b20f89162d525b186dfca3e49?s=32")
then response.headers
includes:
'Content-Type': 'image/jpeg',
'Content-Disposition': 'inline; filename="bd4bda4207561b6998f10dec44b570f04ff4072b20f89162d525b186dfca3e49.jpeg"'
Perhaps fall back to using one of those to determine the likely file extension? At the moment, from outside favicon, it's impossible to get this data without manually using requests again myself.
(Is this project still maintained?)
Though there are some valid icons, none are returned. Instead, this error is thrown:
Traceback (most recent call last):
File "/src/logic/website_scraping.py", line 312, in getWebsiteScrapedDataForURL
potentialIcons = favicon.get(
File "/.venv/lib/python3.8/site-packages/favicon/favicon.py", line 66, in get
link_icons = tags(response.url, response.text)
File "/.venv/lib/python3.8/site-packages/favicon/favicon.py", line 142, in tags
width, height = dimensions(tag)
File "/.venv/lib/python3.8/site-packages/favicon/favicon.py", line 176, in dimensions
width, height = re.split(r'[x\xd7]', size[0])
ValueError: not enough values to unpack (expected 2, got 1)
size
is an array with contents: ['32/32']
.
ico
and apple-touch-icon-precomposed
files have a width and height of 0. Investigate methods to partially download the image and parse the width and height from the data stream.
References
Fails due to case sensitivity when finding all links with rel and href attributes.
https://www.microsoft.com/en-us/
<link rel="SHORTCUT ICON" href="https://c.s-microsoft.com/favicon.ico?v2" type="image/x-icon">
To set max_redirects we have to create requests.Session() object.
session = requests.Session()
session.max_redirects = 3
session.get(url)
favicon assumes the schema https
when creating a icon url with <link href="//static.openr.co/main/favicons/favicon.ico" rel="shortcut icon" type="image/vnd.microsoft.icon">
. When trying to download the icon using requests, it fails with requests.exceptions.SSLError
. Instead of defaulting to https
we should parse the scheme from the passed website.
Add new icon discovery methods:
manifest.json
and browserconfig.xml
<meta name='msapplication-TileImage' content='icon.png'>
(#15)data:image/png;base64,...
Hi. Thanks for your module.
It would be great if the get method could accept not only url but also preloaded html string as an optional parameter.
requests.exceptions.SSLError: HTTPSConnectionPool(host='monksoftware.it', port=443): Max retries exceeded with url: / (Caused by SSLError(CertificateError("hostname 'monksoftware.it' doesn't match 'www.monksoftware.it'",),))
Add option to ignore SSL certificate mismatch, see http://docs.python-requests.org/en/master/user/advanced/?highlight=verify#ssl-cert-verification
https://www.nytimes.com/ uses the multiplication symbol for sizes 144×144
which causes a ValueError Exception
:
width, height = size[0].split('x')
ValueError: not enough values to unpack (expected 2, got 1)
<link rel="apple-touch-icon-precomposed" sizes="144×144" href="/vi-assets/static-assets/ios-ipad-144x144-319373aaf4524d94d38aa599c56b8655.png">
To set max_redirects we have to create requests.Session() object.
session = requests.Session()
session.max_redirects = 3
session.get(url)
Put the package in src/favicon, as recommended in these articles:
The library currently tries to find the default favicon.ico file in the wrong place.
According to https://html.spec.whatwg.org/multipage/links.html#rel-icon:
"In the absence of a link with the icon keyword . . . Let request be a new request whose url is the URL record obtained by resolving the URL "/favicon.ico" against the Document object's URL".
In other words the favicon.ico is stored at the site root.
Example:
Original URL: https://github.com/scottwernervt/favicon/
Correct Favicon URL: github.com/favicon.ico
Incorrect Favicon URL: github.com/scottwenervt/favicon/favicon.ico
Right now the library searches the incorrect url.
The fav icon for http://www.iposcoop.com/ has a tab \t
in the filename. We should handle poor html formatting by peforming strip()
on the filename.
Icon(url='https://www.iposcoop.com/wp-content/uploads/2014/02/favicon.ico\t', width=0, height=0, format='ico\t'),
Icon(url='https://www.iposcoop.com/favicon.ico', width=0, height=0, format='ico'),
Icon(url='https://www.iposcoop.com/wp-content/themes/flatsome/apple-touch-icon-precomposed.png', width=0, height=0, format='png')
I don't know which site/favicon my code was trying to fetch when the final line in favicon.py
generated:
File "/webapps/oohdir/code/venv/lib/python3.10/site-packages/favicon/favicon.py", line 66, in get
link_icons = tags(response.url, response.text)
File "/webapps/oohdir/code/venv/lib/python3.10/site-packages/favicon/favicon.py", line 142, in tags
width, height = dimensions(tag)
File "/webapps/oohdir/code/venv/lib/python3.10/site-packages/favicon/favicon.py", line 188, in dimensions
return int(width), int(height)
ValueError: invalid literal for int() with base 10: ''
But I've replicated the error for my tests with an HTML page that has an element like:
<link rel="icon" type="image/jpeg" sizes="x" href="/favicon.jpg" />
That sizes
attribute results in the code trying to make a width/height from "" and generating the ValueError.
Hi, thanks for this awesome egg :)
I think that I found a bug when a website contains this meta: <link href="" rel="shortcut icon"/>
.
This causes tag.get('href') or tag.get('content')
at favicon.py to return None
instead of a string.
You may want to not include metas with empty href/content into the one to parse 😊
Thank you in advance for your time and I wish you a great week :)
The following tag <meta content="en-US" data-rh="true" itemprop="inLanguage"/>
causes an exception because it does not have name
or proprety
attribute.
Traceback (most recent call last):
File "/opt/pycharm-professional/helpers/pydev/pydevd.py", line 1664, in <module>
main()
File "/opt/pycharm-professional/helpers/pydev/pydevd.py", line 1658, in main
globals = debugger.run(setup['file'], None, None, is_module)
File "/opt/pycharm-professional/helpers/pydev/pydevd.py", line 1068, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "/opt/pycharm-professional/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "/home/swerner/development/projects/favicon/debug.py", line 3, in <module>
fav_icons = favicon.get('https://www.nytimes.com/')
File "/home/swerner/development/projects/favicon/src/favicon/favicon.py", line 69, in get
link_icons = tags(response.url, response.text)
File "/home/swerner/development/projects/favicon/src/favicon/favicon.py", line 125, in tags
if meta_type.lower() == name.lower():
AttributeError: 'NoneType' object has no attribute 'lower'
the function call is running endlessly and stuck for the following urls
UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
The code that caused this warning is on line 8 of the file /home/swerner/development/projects/favicon/tests/test_favicon.py. To get rid of this warning, pass the additional argument 'features="html.parser"' to the BeautifulSoup constructor.
s = BeautifulSoup('')
-- Docs: https://docs.pytest.org/en/latest/warnings.html
Provide an argument to download the largest sized icon to a user supplied path and filename.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.