Giter VIP home page Giter VIP logo

Comments (6)

ZakiHe avatar ZakiHe commented on June 6, 2024 2

Finally, I work out a demo~~ That's awesome!! The NoneType error caused by my pipeline did not return the item. Thanks for you help aaldaber~~

from distributed-multi-user-scrapy-system-with-a-web-ui.

ZakiHe avatar ZakiHe commented on June 6, 2024 1

Thannks for your help aaldaber. My database is online now. That's a good news for me.
Another little question is that I set the link generator as the following code:

from urlparse import urljoin
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule
import scrapy

start_urls = ['http://focus.tianya.cn/thread/index.shtml']

rules = [
Rule(LinkExtractor(allow=r'.post-.-1.shtml'), callback='parse'),
]

def parse(self, response):
sel = scrapy.selector.Selector(response)
article_link = sel.xpath(
'//[@id="main"]/div[position()=7]/table/tbody[position()=2]/tr/td[position()=1]/a/@href').extract()
article_reply_time = sel.xpath(
'//
[@id="main"]/div[position()=7]/table/tbody[position()=2]/tr/td[position()=5]/text()').extract()
next_page = sel.xpath('//[@id="main"]/div[position()=8]/div/a[position()=2]/@href').extract()
if len(article_reply_time)>0:
last_reply_time = article_reply_time[-1]
else:
article_link = sel.xpath(
'//
[@id="main"]/div[position()=6]/table/tbody[position()=2]/tr/td[position()=1]/a/@href').extract()
article_reply_time = sel.xpath(
'//*[@id="main"]/div[position()=6]/table/tbody[position()=2]/tr/td[position()=5]/text()').extract()

for i in range(len(article_reply_time)):
    url = urljoin(self.baseurl, article_link[i])
    request = Request(url, method="GET",
                      dont_filter=True, priority=(AdminTianyaSpider._priority), callback=self.parse)
    self.insert_link(request)

And when I run it and it show finished but the log file is empty.
Is it perform right? because i didn't see any error show up in scrapyd log and spider log.

And with the scraper code like this :

from scrapy.selector import Selector
import re, datetime,time

def parse_article(self, response):
sel = Selector(response)
l1 = ItemLoader(item=ArticleItem(), response=response)
article_url = str(response.url)
params = re.split('-', str(response.url))
article_list = params[1]
article_id = params[1] + '-' + params[2]

if params[3] == '1.shtml':
    article_content = sel.xpath(
        '//*[@id="bd"]/div[4]/div[1]/div/div[2]/div[1]/text()').extract()
    content = ''
    for a in article_content:
        content = content + a.strip()
    l1.add_value(field_name='article_list', value=article_list)
    l1.add_value(field_name='article_url', value=article_url)
    l1.add_value(field_name='article_id', value=article_id)
    l1.add_xpath(field_name='article_title', xpath='//*[@id="post_head"]/h1/span[1]/span/text()')
    l1.add_xpath(field_name='article_from'
                 , xpath='//div[@id="doc"]/div[@id="bd"]/div[@class="atl-location clearfix"]/p/em/a[1]/text()')
    l1.add_xpath(field_name='article_author'
                 , xpath='//*[@id="post_head"]/div[2]/div[@class="atl-info"]/span[1]/a/text()')
    article_datetime_path = '//*[@id="post_head"]/div[2]/div[@class="atl-info"]/span[2]/text()'
    article_datetime = sel.xpath(article_datetime_path).extract()[0][3::]
    date_time = datetime.datetime.strptime(article_datetime, '%Y-%m-%d %H:%M:%S ')
    now_time = datetime.date.today()
    time_line = now_time + datetime.timedelta(days=self.day)
    deathtime = datetime.datetime.strptime(str(time_line), '%Y-%m-%d')
    if date_time > deathtime:
        l1.add_value(field_name='article_datetime', value=article_datetime)
    else:
        return
    crawl_time = get_tody_timestamp()
    l1.add_value(field_name='crawl_times', value=crawl_time)
    l1.add_xpath(field_name='article_clicks'
                 , xpath='//*[@id="post_head"]/div[2]/div[@class="atl-info"]/span[3]/text()')
    l1.add_xpath(field_name='article_reply'
                 , xpath='//*[@id="post_head"]/div[2]/div[@class="atl-info"]/span[4]/text()')
    l1.add_value(field_name='article_content', value=content)
    l1.add_value(field_name='ack_signal', value=int(response.meta['ack_signal']))
    yield l1.load_item()

def get_tody_timestamp():
t = time.time()
lt = time.localtime(t)
format_str = '%Y-%m-%d %H:%M:%S'
todytimestamp = time.strftime(format_str, lt)
return todytimestamp

I get this error in log for scraper:
2017-04-27 14:35:13 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method ?.spider_closed of <AdminTianyaSpider 'Admin_tianya' at 0x54c5ef0>>
Traceback (most recent call last):
File "e:\python27\lib\site-packages\twisted\internet\defer.py", line 150, in maybeDeferred
result = f(*args, **kw)
File "e:\python27\lib\site-packages\pydispatch\robustapply.py", line 55, in robustApply
return receiver(*arguments, **named)
File "build/bdist.linux-x86_64/egg/Admin_tianya/spiders/Admin_tianya.py", line 28, in spider_closed
AttributeError: 'AdminTianyaSpider' object has no attribute 'statstask'
2017-04-27 14:35:13 [twisted] CRITICAL: Unhandled error in Deferred:
2017-04-27 14:35:13 [twisted] CRITICAL:

Hoping you can help me with them.

Thanks,
ZakiHe

from distributed-multi-user-scrapy-system-with-a-web-ui.

aaldaber avatar aaldaber commented on June 6, 2024

Hi ZakiHe,

I agree with you that a full deployment example would be very helpful. I will write a detailed documentation when I get some free time. As for the database connection, please input the address in the following format: 192.168.2.73:27017. No need to put "mongodb://". The URI part of MONGODB_URI must have caused the confusion here. I would also advise to enable authentication, and put the mongodb user and password in the settings file for everything to work correctly, since the system was designed to be used with an auth-enabled mongodb. Enabling auth is described here: https://docs.mongodb.com/manual/tutorial/enable-authentication/

Best regards,
Aibek

from distributed-multi-user-scrapy-system-with-a-web-ui.

ZakiHe avatar ZakiHe commented on June 6, 2024

I track the code in source, found out this error is caused here:

`
import scrapy
from scrapy.http import Request
from scrapy import signals
from ..items import *
from twisted.internet.task import LoopingCall
import datetime

class AdminTianyaSpider(scrapy.Spider):
islinkgenerator = False
name = 'Admin_tianya'
start_urls = []

@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
    spider = cls(*args, **kwargs)
    crawler.signals.connect(spider.spider_opened, signal=signals.spider_opened)
    crawler.signals.connect(spider.spider_closed, signal=signals.spider_closed)
    crawler.signals.connect(spider.item_scraped, signal=signals.item_scraped)
    spider._set_crawler(crawler)
    return spider

def spider_opened(self, spider):
    self.statstask = LoopingCall(self.stats_saver, spider)
    self.statstask.start(60)

def spider_closed(self, spider):
    **self.statstask.stop()**
    spider.crawler.stats.inc_value('project_stopped', spider=spider)
    self.stats_saver(spider)

def item_scraped(self, item, response, spider):
    item_name = item.__class__.__name__
    spider.crawler.stats.inc_value(item_name, spider=spider)

def stats_saver(self, spider):
    data = spider.crawler.stats.get_stats()
    if data.get('start_time', 0):
        data['start_time'] = str(data['start_time'])
    if data.get('finish_time', 0):
        data['finish_time'] = str(data['finish_time'])
    with open('%s/%s/%s/stats.log' % ('logs', spider.name, spider.name), 'w') as f:
        f.write(str(data))
from scrapy.selector import Selector
import re, datetime,time


def parse_article(self, response):
    sel = Selector(response)
    # i = ItemLoader(item=ReplyItem(), response=response)
    # l = ItemLoader(item=CommentItem(), response=response)
    l1 = ItemLoader(item=ArticleItem(), response=response)
    article_url = str(response.url)
    params = re.split('-', str(response.url))
    article_list = params[1]
    article_id = params[1] + '-' + params[2]

    if params[3] == '1.shtml':
        article_content = sel.xpath(
            '//*[@id="bd"]/div[4]/div[1]/div/div[2]/div[1]/text()').extract()
        content = ''
        for a in article_content:
            content = content + a.strip()
        l1.add_value(field_name='article_list', value=article_list)
        l1.add_value(field_name='article_url', value=article_url)
        l1.add_value(field_name='article_id', value=article_id)
        l1.add_xpath(field_name='article_title', xpath='//*[@id="post_head"]/h1/span[1]/span/text()')
        l1.add_xpath(field_name='article_from'
                     , xpath='//div[@id="doc"]/div[@id="bd"]/div[@class="atl-location clearfix"]/p/em/a[1]/text()')
        l1.add_xpath(field_name='article_author'
                     , xpath='//*[@id="post_head"]/div[2]/div[@class="atl-info"]/span[1]/a/text()')
        article_datetime_path = '//*[@id="post_head"]/div[2]/div[@class="atl-info"]/span[2]/text()'
        article_datetime = sel.xpath(article_datetime_path).extract()[0][3::]
        # print(article_datetime)
        date_time = datetime.datetime.strptime(article_datetime, '%Y-%m-%d %H:%M:%S ')
        now_time = datetime.date.today()
        time_line = now_time + datetime.timedelta(days=self.day)
        deathtime = datetime.datetime.strptime(str(time_line), '%Y-%m-%d')
        if date_time > deathtime:
            l1.add_value(field_name='article_datetime', value=article_datetime)
        else:
            return
        # l1.add_xpath(field_name='article_datetime'
        #              , xpath='//*[@id="post_head"]/div[2]/div[@class="atl-info"]/span[2]/text()')
        crawl_time = get_tody_timestamp()
        l1.add_value(field_name='crawl_times', value=crawl_time)
        l1.add_xpath(field_name='article_clicks'
                     , xpath='//*[@id="post_head"]/div[2]/div[@class="atl-info"]/span[3]/text()')
        l1.add_xpath(field_name='article_reply'
                     , xpath='//*[@id="post_head"]/div[2]/div[@class="atl-info"]/span[4]/text()')
        l1.add_value(field_name='article_content', value=content)
        l1.add_value(field_name='ack_signal', value=int(response.meta['ack_signal']))
        yield l1.load_item()


def get_tody_timestamp():
    t = time.time()
    lt = time.localtime(t)
    format_str = '%Y-%m-%d %H:%M:%S'
    todytimestamp = time.strftime(format_str, lt)
    return todytimestamp

`

Seen that is a scope problem? the self.statstask do not defind in the public scope? I'm not sure because I'm new to python. Hope this can help you and me with this problem.

from distributed-multi-user-scrapy-system-with-a-web-ui.

aaldaber avatar aaldaber commented on June 6, 2024

Interesting, that self.statstask.stop() line shouldn't be causing an error. I suspect that there is another error somewhere in your code that is causing the spider to close. I would pay attention to how you use your imported packages, things like:

from scrapy.selector import Selector
import re, datetime,time

Because the spider class is custom, you can see that your import statements fall inside the spider class, thus should be referred to by self.Selector, self.re, self.datetime, self.time etc. This is rather inconvenient, but that's how it works for now. So, try adding self before every package function you import, and see if that works for you.

from distributed-multi-user-scrapy-system-with-a-web-ui.

ZakiHe avatar ZakiHe commented on June 6, 2024

Thanks aaldaber, after adding self. and done some debug the logs shows nothing. So i thought it works fine.
But there is nothing inserted into db. So I change the spiders setting log_level='INFO'. And the log show the following error( It is interesting that the error did not show up in logs when the log_level='ERROR').

2017-05-03 10:15:59 [scrapy.core.scraper] ERROR: Error processing {'ack_signal': [8], 'article_id': ['394-209418'], 'article_title': [u'\u4e0a\u6d77\u4e5d\u9662\u9686\u9f3b\u600e\u4e48\u6837\uff1f\u54ea\u4e2a\u533b\u751f\u9686\u7684\u597d\uff1f'], 'article_url': ['http://bbs.tianya.cn/post-394-209418-1.shtml']} Traceback (most recent call last): File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 653, in _runCallbacks current.result = callback(current.result, *args, **kw) File "build/bdist.linux-x86_64/egg/Admin_tianya/mongodb/scrapy_mongodb.py", line 146, in process_item File "/usr/local/lib/python2.7/site-packages/scrapy/exporters.py", line 65, in _get_serialized_fields field_iter = six.iterkeys(item) File "/usr/local/lib/python2.7/site-packages/six.py", line 593, in iterkeys return d.iterkeys(**kw) AttributeError: 'NoneType' object has no attribute 'iterkeys'

That's pretty interesting and just want to share with you.
BTW, could you please tell me the lib versions you use for this project? Like : scrapy,twiested and so on.

from distributed-multi-user-scrapy-system-with-a-web-ui.

Related Issues (8)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.