amol9 / imagebot Goto Github PK
View Code? Open in Web Editor NEWA web bot to crawl websites and scrape images.
License: MIT License
A web bot to crawl websites and scrape images.
License: MIT License
Hello!
I am behind the firewall, so failed to run "python ./setup.py install". I had to install every missing python package one-by-one. Is there are way to specify proxy information?
Every time the script is launced it raises a SyntaxError. I checked on multiple devices with different OS and different Python versions (3.8, 3.9, 3.10).
Here is with virtualenv on 3.8
Traceback (most recent call last):
File "C:\Users\xstre\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\xstre\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
File "C:\Users\xstre\python3.8\Scripts\imagebot.exe\__main__.py", line 4, in <module>
File "C:\Users\xstre\python3.8\lib\site-packages\imagebot\main.py", line 11, in <module>
from imagebot.clear import clear_cache, clear_db, clear_duplicate_images
File "C:\Users\xstre\python3.8\lib\site-packages\imagebot\clear.py", line 28
print ose.message
^^^^^^^^^^^^^^^^^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print(...)?
D:>imagebot crawl http://www.bbc.com -m -s 640x480 -l info -is "E:\Photos\BBC"
2017-06-04 19:32:30 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: imagebot)
2017-06-04 19:32:30 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'imagebot.spiders', 'LOG_LEVEL': 20, 'SPIDER_MODULES': ['imagebot.spiders'], 'HTTPCACHE_ENABLED': True, 'BOT_NAME': 'imagebot', 'USER_AGENT': 'imagebot', 'HTTPCACHE_DIR': 'C:\Users\CP\.imagebot\httpcache', 'HTTPCACHE_POLICY': 'scrapy.extensions.httpcache.RFC2616Policy'}
2017-06-04 19:32:30 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2017-06-04 19:32:30 [root] ERROR: No module named gi
Unhandled error in Deferred:
2017-06-04 19:32:30 [twisted] CRITICAL: Unhandled error in Deferred:
2017-06-04 19:32:30 [twisted] CRITICAL:
Traceback (most recent call last):
File "c:\python27\lib\site-packages\twisted\internet\defer.py", line 1301, in _inlineCallbacks
result = g.send(result)
File "c:\python27\lib\site-packages\scrapy-1.4.0-py2.7.egg\scrapy\crawler.py", line 95, in crawl
six.reraise(*exc_info)
File "c:\python27\lib\site-packages\scrapy-1.4.0-py2.7.egg\scrapy\crawler.py", line 77, in crawl
self.engine = self._create_engine()
File "c:\python27\lib\site-packages\scrapy-1.4.0-py2.7.egg\scrapy\crawler.py", line 102, in create_engine
return ExecutionEngine(self, lambda : self.stop())
File "c:\python27\lib\site-packages\scrapy-1.4.0-py2.7.egg\scrapy\core\engine.py", line 69, in init
self.downloader = downloader_cls(crawler)
File "c:\python27\lib\site-packages\scrapy-1.4.0-py2.7.egg\scrapy\core\downloader_init.py", line 88, in init
self.middleware = DownloaderMiddlewareManager.from_crawler(crawler)
File "c:\python27\lib\site-packages\scrapy-1.4.0-py2.7.egg\scrapy\middleware.py", line 58, in from_crawler
return cls.from_settings(crawler.settings, crawler)
File "c:\python27\lib\site-packages\scrapy-1.4.0-py2.7.egg\scrapy\middleware.py", line 34, in from_settings
mwcls = load_object(clspath)
File "c:\python27\lib\site-packages\scrapy-1.4.0-py2.7.egg\scrapy\utils\misc.py", line 44, in load_object
mod = import_module(module)
File "c:\python27\lib\importlib_init.py", line 37, in import_module
import(name)
File "c:\python27\lib\site-packages\scrapy-1.4.0-py2.7.egg\scrapy\downloadermiddlewares\retry.py", line 20, in
from twisted.web.client import ResponseFailed
File "c:\python27\lib\site-packages\twisted\web\client.py", line 42, in
from twisted.internet.endpoints import HostnameEndpoint, wrapClientTLS
File "c:\python27\lib\site-packages\twisted\internet\endpoints.py", line 37, in
from twisted.internet.stdio import StandardIO, PipeAddress
File "c:\python27\lib\site-packages\twisted\internet\stdio.py", line 30, in
from twisted.internet import _win32stdio
File "c:\python27\lib\site-packages\twisted\internet_win32stdio.py", line 9, in
import win32api
ImportError: DLL failed to load:% 1 is not a valid Win32 application.
Hello, You creating a great software.
It's works great but when I trying to crawl some website like 17173.com which domain starts with number, it wouldn't works.
I found the problem is in the submodule "web", the regex that in class AbsUrl would not match such domains. Could you change it for me? Thanks.
It looks like the library requires the Python PIL to work properly, but it is not listed in the project dependencies.
Could you tell me which one is the right version to use?
I'm having the following issues with pillow ^7.1.2
(possibly unrelated..?):
Traceback (most recent call last):
File "d:\projects\.virtualenvs\imagescraper-uzbe7zgj-py3.8\lib\site-packages\scrapy\pipelines\files.py", line 481, in media_downloaded
checksum = self.file_downloaded(response, request, info)
File "d:\projects\.virtualenvs\imagescraper-uzbe7zgj-py3.8\lib\site-packages\scrapy\pipelines\images.py", line 106, in file_downloaded
return self.image_downloaded(response, request, info)
File "d:\projects\.virtualenvs\imagescraper-uzbe7zgj-py3.8\lib\site-packages\scrapy\pipelines\images.py", line 110, in image_downloaded
for path, image, buf in self.get_images(response, request, info):
File "d:\projects\.virtualenvs\imagescraper-uzbe7zgj-py3.8\lib\site-packages\scrapy\pipelines\images.py", line 123, in get_images
orig_image = Image.open(BytesIO(response.body))
File "d:\projects\.virtualenvs\imagescraper-uzbe7zgj-py3.8\lib\site-packages\PIL\Image.py", line 2895, in open
raise UnidentifiedImageError(
PIL.UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x0000017497F73A90>
Windows, Py 3.8, installed from master at 8738efc
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.