I have built up your project like the picture <a target="_blank" rel="noopener nor

This is a great project, but it lacks of a full example. about distributed-multi-user-scrapy-system-with-a-web-ui HOT 6 CLOSED

aaldaber commented on June 6, 2024

This is a great project, but it lacks of a full example.

from distributed-multi-user-scrapy-system-with-a-web-ui.

Comments (6)

ZakiHe commented on June 6, 2024 2

Finally, I work out a demo~~ That's awesome!! The NoneType error caused by my pipeline did not return the item. Thanks for you help aaldaber~~

from distributed-multi-user-scrapy-system-with-a-web-ui.

ZakiHe commented on June 6, 2024 1

Thannks for your help aaldaber. My database is online now. That's a good news for me.
Another little question is that I set the link generator as the following code:

from urlparse import urljoin
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule
import scrapy

start_urls = ['http://focus.tianya.cn/thread/index.shtml']

rules = [
Rule(LinkExtractor(allow=r'.post-.-1.shtml'), callback='parse'),
]

def parse(self, response):
sel = scrapy.selector.Selector(response)
article_link = sel.xpath(
'//[@id="main"]/div[position()=7]/table/tbody[position()=2]/tr/td[position()=1]/a/@href').extract()
article_reply_time = sel.xpath(
'//[@id="main"]/div[position()=7]/table/tbody[position()=2]/tr/td[position()=5]/text()').extract()
next_page = sel.xpath('//[@id="main"]/div[position()=8]/div/a[position()=2]/@href').extract()
if len(article_reply_time)>0:
last_reply_time = article_reply_time[-1]
else:
article_link = sel.xpath(
'//[@id="main"]/div[position()=6]/table/tbody[position()=2]/tr/td[position()=1]/a/@href').extract()
article_reply_time = sel.xpath(
'//*[@id="main"]/div[position()=6]/table/tbody[position()=2]/tr/td[position()=5]/text()').extract()

for i in range(len(article_reply_time)):
    url = urljoin(self.baseurl, article_link[i])
    request = Request(url, method="GET",
                      dont_filter=True, priority=(AdminTianyaSpider._priority), callback=self.parse)
    self.insert_link(request)

And when I run it and it show finished but the log file is empty.
Is it perform right? because i didn't see any error show up in scrapyd log and spider log.

And with the scraper code like this :

from scrapy.selector import Selector
import re, datetime,time

def parse_article(self, response):
sel = Selector(response)
l1 = ItemLoader(item=ArticleItem(), response=response)
article_url = str(response.url)
params = re.split('-', str(response.url))
article_list = params[1]
article_id = params[1] + '-' + params[2]

if params[3] == '1.shtml':
    article_content = sel.xpath(
        '//*[@id="bd"]/div[4]/div[1]/div/div[2]/div[1]/text()').extract()
    content = ''
    for a in article_content:
        content = content + a.strip()
    l1.add_value(field_name='article_list', value=article_list)
    l1.add_value(field_name='article_url', value=article_url)
    l1.add_value(field_name='article_id', value=article_id)
    l1.add_xpath(field_name='article_title', xpath='//*[@id="post_head"]/h1/span[1]/span/text()')
    l1.add_xpath(field_name='article_from'
                 , xpath='//div[@id="doc"]/div[@id="bd"]/div[@class="atl-location clearfix"]/p/em/a[1]/text()')
    l1.add_xpath(field_name='article_author'
                 , xpath='//*[@id="post_head"]/div[2]/div[@class="atl-info"]/span[1]/a/text()')
    article_datetime_path = '//*[@id="post_head"]/div[2]/div[@class="atl-info"]/span[2]/text()'
    article_datetime = sel.xpath(article_datetime_path).extract()[0][3::]
    date_time = datetime.datetime.strptime(article_datetime, '%Y-%m-%d %H:%M:%S ')
    now_time = datetime.date.today()
    time_line = now_time + datetime.timedelta(days=self.day)
    deathtime = datetime.datetime.strptime(str(time_line), '%Y-%m-%d')
    if date_time > deathtime:
        l1.add_value(field_name='article_datetime', value=article_datetime)
    else:
        return
    crawl_time = get_tody_timestamp()
    l1.add_value(field_name='crawl_times', value=crawl_time)
    l1.add_xpath(field_name='article_clicks'
                 , xpath='//*[@id="post_head"]/div[2]/div[@class="atl-info"]/span[3]/text()')
    l1.add_xpath(field_name='article_reply'
                 , xpath='//*[@id="post_head"]/div[2]/div[@class="atl-info"]/span[4]/text()')
    l1.add_value(field_name='article_content', value=content)
    l1.add_value(field_name='ack_signal', value=int(response.meta['ack_signal']))
    yield l1.load_item()

def get_tody_timestamp():
t = time.time()
lt = time.localtime(t)
format_str = '%Y-%m-%d %H:%M:%S'
todytimestamp = time.strftime(format_str, lt)
return todytimestamp

I get this error in log for scraper:
2017-04-27 14:35:13 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method ?.spider_closed of <AdminTianyaSpider 'Admin_tianya' at 0x54c5ef0>>
Traceback (most recent call last):
File "e:\python27\lib\site-packages\twisted\internet\defer.py", line 150, in maybeDeferred
result = f(*args, **kw)
File "e:\python27\lib\site-packages\pydispatch\robustapply.py", line 55, in robustApply
return receiver(*arguments, **named)
File "build/bdist.linux-x86_64/egg/Admin_tianya/spiders/Admin_tianya.py", line 28, in spider_closed
AttributeError: 'AdminTianyaSpider' object has no attribute 'statstask'
2017-04-27 14:35:13 [twisted] CRITICAL: Unhandled error in Deferred:
2017-04-27 14:35:13 [twisted] CRITICAL:

Hoping you can help me with them.

Thanks,
ZakiHe

from distributed-multi-user-scrapy-system-with-a-web-ui.

aaldaber commented on June 6, 2024

Hi ZakiHe,

I agree with you that a full deployment example would be very helpful. I will write a detailed documentation when I get some free time. As for the database connection, please input the address in the following format: 192.168.2.73:27017. No need to put "mongodb://". The URI part of MONGODB_URI must have caused the confusion here. I would also advise to enable authentication, and put the mongodb user and password in the settings file for everything to work correctly, since the system was designed to be used with an auth-enabled mongodb. Enabling auth is described here: https://docs.mongodb.com/manual/tutorial/enable-authentication/

Best regards,
Aibek

from distributed-multi-user-scrapy-system-with-a-web-ui.

ZakiHe commented on June 6, 2024

I track the code in source, found out this error is caused here:

`
import scrapy
from scrapy.http import Request
from scrapy import signals
from ..items import *
from twisted.internet.task import LoopingCall
import datetime

class AdminTianyaSpider(scrapy.Spider):
islinkgenerator = False
name = 'Admin_tianya'
start_urls = []

@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
    spider = cls(*args, **kwargs)
    crawler.signals.connect(spider.spider_opened, signal=signals.spider_opened)
    crawler.signals.connect(spider.spider_closed, signal=signals.spider_closed)
    crawler.signals.connect(spider.item_scraped, signal=signals.item_scraped)
    spider._set_crawler(crawler)
    return spider

def spider_opened(self, spider):
    self.statstask = LoopingCall(self.stats_saver, spider)
    self.statstask.start(60)

def spider_closed(self, spider):
    **self.statstask.stop()**
    spider.crawler.stats.inc_value('project_stopped', spider=spider)
    self.stats_saver(spider)

def item_scraped(self, item, response, spider):
    item_name = item.__class__.__name__
    spider.crawler.stats.inc_value(item_name, spider=spider)

def stats_saver(self, spider):
    data = spider.crawler.stats.get_stats()
    if data.get('start_time', 0):
        data['start_time'] = str(data['start_time'])
    if data.get('finish_time', 0):
        data['finish_time'] = str(data['finish_time'])
    with open('%s/%s/%s/stats.log' % ('logs', spider.name, spider.name), 'w') as f:
        f.write(str(data))
from scrapy.selector import Selector
import re, datetime,time


def parse_article(self, response):
    sel = Selector(response)
    # i = ItemLoader(item=ReplyItem(), response=response)
    # l = ItemLoader(item=CommentItem(), response=response)
    l1 = ItemLoader(item=ArticleItem(), response=response)
    article_url = str(response.url)
    params = re.split('-', str(response.url))
    article_list = params[1]
    article_id = params[1] + '-' + params[2]

    if params[3] == '1.shtml':
        article_content = sel.xpath(
            '//*[@id="bd"]/div[4]/div[1]/div/div[2]/div[1]/text()').extract()
        content = ''
        for a in article_content:
            content = content + a.strip()
        l1.add_value(field_name='article_list', value=article_list)
        l1.add_value(field_name='article_url', value=article_url)
        l1.add_value(field_name='article_id', value=article_id)
        l1.add_xpath(field_name='article_title', xpath='//*[@id="post_head"]/h1/span[1]/span/text()')
        l1.add_xpath(field_name='article_from'
                     , xpath='//div[@id="doc"]/div[@id="bd"]/div[@class="atl-location clearfix"]/p/em/a[1]/text()')
        l1.add_xpath(field_name='article_author'
                     , xpath='//*[@id="post_head"]/div[2]/div[@class="atl-info"]/span[1]/a/text()')
        article_datetime_path = '//*[@id="post_head"]/div[2]/div[@class="atl-info"]/span[2]/text()'
        article_datetime = sel.xpath(article_datetime_path).extract()[0][3::]
        # print(article_datetime)
        date_time = datetime.datetime.strptime(article_datetime, '%Y-%m-%d %H:%M:%S ')
        now_time = datetime.date.today()
        time_line = now_time + datetime.timedelta(days=self.day)
        deathtime = datetime.datetime.strptime(str(time_line), '%Y-%m-%d')
        if date_time > deathtime:
            l1.add_value(field_name='article_datetime', value=article_datetime)
        else:
            return
        # l1.add_xpath(field_name='article_datetime'
        #              , xpath='//*[@id="post_head"]/div[2]/div[@class="atl-info"]/span[2]/text()')
        crawl_time = get_tody_timestamp()
        l1.add_value(field_name='crawl_times', value=crawl_time)
        l1.add_xpath(field_name='article_clicks'
                     , xpath='//*[@id="post_head"]/div[2]/div[@class="atl-info"]/span[3]/text()')
        l1.add_xpath(field_name='article_reply'
                     , xpath='//*[@id="post_head"]/div[2]/div[@class="atl-info"]/span[4]/text()')
        l1.add_value(field_name='article_content', value=content)
        l1.add_value(field_name='ack_signal', value=int(response.meta['ack_signal']))
        yield l1.load_item()


def get_tody_timestamp():
    t = time.time()
    lt = time.localtime(t)
    format_str = '%Y-%m-%d %H:%M:%S'
    todytimestamp = time.strftime(format_str, lt)
    return todytimestamp

Seen that is a scope problem? the self.statstask do not defind in the public scope? I'm not sure because I'm new to python. Hope this can help you and me with this problem.

from distributed-multi-user-scrapy-system-with-a-web-ui.

aaldaber commented on June 6, 2024

Interesting, that self.statstask.stop() line shouldn't be causing an error. I suspect that there is another error somewhere in your code that is causing the spider to close. I would pay attention to how you use your imported packages, things like:

from scrapy.selector import Selector
import re, datetime,time

Because the spider class is custom, you can see that your import statements fall inside the spider class, thus should be referred to by self.Selector, self.re, self.datetime, self.time etc. This is rather inconvenient, but that's how it works for now. So, try adding self before every package function you import, and see if that works for you.

from distributed-multi-user-scrapy-system-with-a-web-ui.

ZakiHe commented on June 6, 2024

Thanks aaldaber, after adding self. and done some debug the logs shows nothing. So i thought it works fine.
But there is nothing inserted into db. So I change the spiders setting log_level='INFO'. And the log show the following error( It is interesting that the error did not show up in logs when the log_level='ERROR').

2017-05-03 10:15:59 [scrapy.core.scraper] ERROR: Error processing {'ack_signal': [8], 'article_id': ['394-209418'], 'article_title': [u'\u4e0a\u6d77\u4e5d\u9662\u9686\u9f3b\u600e\u4e48\u6837\uff1f\u54ea\u4e2a\u533b\u751f\u9686\u7684\u597d\uff1f'], 'article_url': ['http://bbs.tianya.cn/post-394-209418-1.shtml']} Traceback (most recent call last): File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 653, in _runCallbacks current.result = callback(current.result, *args, **kw) File "build/bdist.linux-x86_64/egg/Admin_tianya/mongodb/scrapy_mongodb.py", line 146, in process_item File "/usr/local/lib/python2.7/site-packages/scrapy/exporters.py", line 65, in _get_serialized_fields field_iter = six.iterkeys(item) File "/usr/local/lib/python2.7/site-packages/six.py", line 593, in iterkeys return d.iterkeys(**kw) AttributeError: 'NoneType' object has no attribute 'iterkeys'

That's pretty interesting and just want to share with you.
BTW, could you please tell me the lib versions you use for this project? Like : scrapy,twiested and so on.

from distributed-multi-user-scrapy-system-with-a-web-ui.

This is a great project, but it lacks of a full example. about distributed-multi-user-scrapy-system-with-a-web-ui HOT 6 CLOSED

Comments (6)

Related Issues (8)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent