单独运行\crawler\run.py报错，望大神帮忙看一下 about livetv_mining HOT 8 CLOSED

taoget commented on August 31, 2024

单独运行\crawler\run.py报错，望大神帮忙看一下

from livetv_mining.

Comments (8)

eivaltsec commented on August 31, 2024

好像是两处错误环境 ubantu14 python3.4

from livetv_mining.

taogeT commented on August 31, 2024

@ROCHOU `配置文件发一下，不要发截图，用 insert code 包起来。

看起来就是Sqlalchemy和数据库连接有问题，表正常建好了吗？相关的requirement包都安装了吗？pip freeze 能看到用了什么包？

先从sqlalchemy入手。

from livetv_mining.

eivaltsec commented on August 31, 2024

pip freeze

Flask==0.11.1
Flask-Bootstrap==3.3.7.0
Flask-Celery-py3==0.2.4
Flask-Cors==3.0.2
Flask-Login==0.3.2
Flask-Migrate==2.0.0
Flask-OAuthlib==0.9.3
Flask-RESTful==0.3.5
Flask-SQLAlchemy==2.1
Flask-Script==2.0.5
Flask-Vue==0.3.4
Flask-WTF==0.13.1
Jinja2==2.8
Mako==1.0.4
Markdown==2.6.7
MarkupSafe==0.23
PyDispatcher==2.0.5
SQLAlchemy==1.1.2
Scrapy==1.2.1
Twisted==16.4.1
WTForms==2.1
Werkzeug==0.11.11
alembic==0.8.8
amqp==1.4.9
aniso8601==1.2.0
anyjson==0.3.3
appdirs==1.4.3
attrs==16.2.0
billiard==3.3.0.23
celery==4.0.2
cffi==1.8.3
chardet==2.2.1
click==6.6
colorama==0.2.5
command-not-found==0.3
coverage==4.2
cryptography==1.5.2
cssselect==1.0.0
dominate==2.2.1
html5lib==0.999
idna==2.1
itsdangerous==0.24
kombu==3.0.37
language-selector==0.1
lxml==3.6.4
mysqlclient==1.3.10
oauthlib==2.0.0
packaging==16.8
parsel==1.0.3
pyOpenSSL==16.2.0
pyasn1==0.1.9
pyasn1-modules==0.0.8
pycparser==2.16
pycurl==7.19.3
pygobject==3.12.0
pyparsing==2.2.0
python-apt==0.9.3.5
python-dateutil==2.5.3
python-editor==1.0.1
pytz==2016.7
queuelib==1.4.2
redis==2.10.5
requests==2.2.1
requests-oauthlib==0.8.0
scrapy-redis==0.6.3
service-identity==16.0.0
six==1.10.0
ufw===0.34-rc-0ubuntu2
unattended-upgrades==0.1
urllib3==1.7.1
visitor==0.1.3
w3lib==1.15.0
wheel==0.24.0
zope.interface==4.3.2

settings.py

# -*- coding: utf-8 -*-
from urllib.parse import quote

# Scrapy settings for gather project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'gather'

SPIDER_MODULES = ['gather.spiders']
NEWSPIDER_MODULE = 'gather.spiders'

LOG_LEVEL = 'INFO'
REACTOR_THREADPOOL_MAXSIZE = 50
CLOSESPIDER_TIMEOUT = 1000

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:16.0) Gecko/20121026 Firefox/16.0'
USER_AGENT_FILE = 'ua.txt'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 0.35
# The download delay setting will honor only one of:
CONCURRENT_REQUESTS_PER_DOMAIN = 8
CONCURRENT_REQUESTS_PER_IP = 8

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Encoding': 'gzip, deflate, sdch',
    'Accept-Language': 'zh-CN,zh;q=0.8,en-US;q=0.6,en;q=0.4,ja;q=0.2',
    'Cache-Control': 'max-age=0',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': 1
}

# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'gather.middlewares.MyCustomSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddleware.useragent.UserAgentMiddleware': None,
    'gather.middlewares.RandomUserAgentMiddleware': 500
}

DOWNLOADER_CLIENTCONTEXTFACTORY = 'scrapy.core.downloader.contextfactory.BrowserLikeContextFactory'

# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'gather.pipelines.SqlalchemyPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

# Database URI
SQLALCHEMY_DATABASE_URI = 'mysql://root:[email protected]/bdm255611631_db'

# Enables scheduling storing requests queue in redis.
SCHEDULER = "scrapy_redis.scheduler.Scheduler"

# Ensure all spiders share same duplicates filter through redis.
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

# Default requests serializer is pickle, but it can be changed to any module
# with loads and dumps functions. Note that pickle is not compatible between
# python versions.
# Caveat: In python 3.x, the serializer must return strings keys and support
# bytes as values. Because of this reason the json or msgpack module will not
# work by default. In python 2.x there is no such issue and you can use
# 'json' or 'msgpack' as serializers.
#SCHEDULER_SERIALIZER = "scrapy_redis.picklecompat"

# Don't cleanup redis queues, allows to pause/resume crawls.
#SCHEDULER_PERSIST = True

# Schedule requests using a priority queue. (default)
#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.SpiderPriorityQueue'

# Schedule requests using a queue (FIFO).
#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.SpiderQueue'

# Max idle time to prevent the spider from being closed when distributed crawling.
# This only works if queue class is SpiderQueue or SpiderStack,
# and may also block the same time when your spider start at the first time (because the queue is empty).
#SCHEDULER_IDLE_BEFORE_CLOSE = 10

# Store scraped item in redis for post-processing.
#ITEM_PIPELINES = {
#    'scrapy_redis.pipelines.RedisPipeline': 300
#}

# The item pipeline serializes and stores the items in this redis key.
#REDIS_ITEMS_KEY = '%(spider)s:items'

# The items serializer is by default ScrapyJSONEncoder. You can use any
# importable path to a callable object.
#REDIS_ITEMS_SERIALIZER = 'json.dumps'

# Specify the host and port to use when connecting to Redis (optional).
#REDIS_HOST = '127.0.0.1'
#REDIS_PORT = 6379

# Specify the full Redis URL for connecting (optional).
# If set, this takes precedence over the REDIS_HOST and REDIS_PORT settings.
#REDIS_URL = 'redis://user:pass@hostname:9001'
REDIS_URL = 'redis://:[email protected]:6379'
# Custom redis client parameters (i.e.: socket timeout, etc.)
#REDIS_PARAMS  = {}
# Use custom redis client class.
#REDIS_PARAMS['redis_cls'] = 'myproject.RedisClient'

# If True, it uses redis' ``spop`` operation. This could be useful if you
# want to avoid duplicates in your start urls list. In this cases, urls must
# be added via ``sadd`` command or you will get a type error from redis.
#REDIS_START_URLS_AS_SET = False

# How many start urls to fetch at once.
#REDIS_START_URLS_BATCH_SIZE = 16

# Default start urls key for RedisSpider and RedisCrawlSpider.
#REDIS_START_URLS_KEY = '%(name)s:start_urls'

表截图吧~~！

from livetv_mining.

taogeT commented on August 31, 2024

@ROCHOU DBAPI 没装

from livetv_mining.

eivaltsec commented on August 31, 2024

多谢，我先检查一下把包补齐了

from livetv_mining.

eivaltsec commented on August 31, 2024

我还是想问一下你的python3安装的哪个mysql的第三方操作库？sf上有人说python3不支持mysql-python，用的是mysqlclient或者是PyMySQL

from livetv_mining.

taogeT commented on August 31, 2024

@ROCHOU 我是用postgresql所以mysql的不是特别了解，我觉得你可以看下sqlalchemy的官方文档推荐，然后到各个库的官网上查一下应该就全了。

from livetv_mining.

eivaltsec commented on August 31, 2024

多谢楼主

from livetv_mining.

单独运行\crawler\run.py报错，望大神帮忙看一下 about livetv_mining HOT 8 CLOSED

Comments (8)

Related Issues (2)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent