Giter VIP home page Giter VIP logo

webwalker's Introduction

网站数据抓取二次开发框架

基于scrapy的二次开发框架,通过简单配置,即可实现一个网站分类中所有项目指定信息的抓取
常驻式进程,启动之后,通过feed投放任务,使用redis实现分布式,多台机器多个爬虫实时监控任务抓取

本项目已过时,推荐使用更符合scrapy编码规范的升级版structure_spider,更强的可扩展性和自由度。

需要掌握技能

  • xpath表达式,正则表达式,以及css表达式,至少会其中一项
  • python 字典和列表数据结构

以下技能最好掌握

  • python lambda 表达式的使用
  • python 简单函数编写
  • 了解scrapy的基本概念,参见scrapy简单介绍

INSTALL

ubuntu && windows

web-walker 1.2.2版本以下是python2.7版本
web-walker 3.0.0版本以上是python3.6版本
git clone https://github.com/ShichaoMa/webWalker.git
cd webWalker/walker && (sudo) python setup.py install

or

(sudo) pip install web-walker==X.X.X

HELLOWORLD

  1. 安装完毕后(推荐pip安装)使用scrapy生成一个项目
ubuntu@dev:~/myprojects$ scrapy startproject demo
New Scrapy project 'demo' created in:
    /home/ubuntu/myprojects/demo

You can start your first spider with:
    cd demo
    scrapy genspider example example.com


# 目录结构如下
.
├── demo
│   ├── __init__.py
│   ├── items.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       └── __init__.py
└── scrapy.cfg

  1. 或者直接从test中复制myapp,如果要改项目名字,记得修改scarpy.cfg中的名字,对于使用python3的用户,并且web-walker>=3.1.0,可以使用startproject demo直接生成一个新项目,同时省略第1,2,3,4步
longen@dataServer:~$ startproject demo
New web-walker project 'demo', using template directory '/home/longen/.pyenv/versions/3.6.0/lib/python3.6/site-packages/walker/templates/project', created in:
    /home/longen/demo

You can start the demo spider with:
    custom-redis-server --host 127.0.0.1 -p 6379
    cd demo
    scrapy crawl bluefly
longen@dataServer:~$ cd demo/
longen@dataServer:~/demo$ tree
.
├── demo
│   ├── __init__.py
│   ├── proxy.list
│   ├── __pycache__
│   ├── settings.py
│   └── spiders
│       ├── __init__.py
│       ├── item_field.py
│       ├── item_xpath.py
│       ├── page_xpath.py
│       ├── __pycache__
│       └── spiders.py
└── scrapy.cfg

4 directories, 9 files

  1. 删除掉其中的demo/items.py demo/piplines.py,并使用myapp/settings.py,myapp/spiders/__init__.py 替掉原来的文件

  2. 在spiders目录下,创建page_xpath.py, item_xpath.py, item_field.py, spiders.py,编写以下内容

# spiders.py

# -*- coding:utf-8 -*

SPIDERS = { # 配置spider, spider名称一个字典,字典中为这个spider的一些自定义属性,可为空
    "bluefly": {}
}

# page_xpath.py

# -*- coding:utf-8 -*

PAGE_XPATH = { # 配置网站分类页中获取下一页链接的方式,具体策略参见wiki
    "bluefly": [
        '//*[@id="page-content"]//a[@rel="next"]/@href',
    ]
}

# item_xpath.py

# -*- coding:utf-8 -*

ITEM_XPATH = { # 配置网站分类页中获取商品页链接的方式,xpath表达式
    "bluefly": [
        '//ul[@class="mz-productlist-list mz-l-tiles"]/li//a[@class="mz-productlisting-title"]/@href',
    ]
}

# item_field

# -*- coding:utf-8 -*

ITEM_FIELD = { # 商品页中,所需信息的获取方式,具体策略参见wiki
    "bluefly": [
        ('product_id', {
            "xpath": [
                '//li[@itemprop="productID"]/text()',
            ],
        }),
        ('brand', {
            "xpath": [
                '//p[@class="mz-productbrand"]/a/text()',
            ],
        }),
        ('names', {
            "xpath": [
                '//span[@class="mz-breadcrumb-current"]/text()',
            ],
        }),
    ]
}

  1. 修改demo/settings.py 文件,或者直接新建localsettings.py,增加自定义配置,要修改的项目在settings.py已注明

  2. 启动redis

#如果没有安装redis,可以使用自带的custom-redis,配置文件中需写明CUSTOM_REDIS=True
custom-redis-server -p 6379

  1. 启动爬虫
cd demo
scrapy crawl bluefly

  1. 投放任务
# 使用自带的costom-redis 需要加上 --custom
# 投放分类链接
feed -c test_01 -s bluefly -u "http://www.bluefly.com/assortment/the-boot-shop-overarching/women/shoes" --custom
# 投放项目链接,支持多个项目链接一起投放,把每个链接按行放到一个文件中即可
feed -c test_04 -s ashford -uf item.txt --custom
  1. 查看任务状态
# 使用自带的costom-redis 需要加上 --custom
check test_01 --custom

DECUMENTATION

参见wiki

webwalker's People

Contributors

cnaafhvk avatar shichaoma avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

webwalker's Issues

python3安装报错,pdb没有合适版本

mac+python3.6.2,安装执行pip3 install web-walker==3.0.0,结果报错:Could not find a version that satisfies the requirement pdb (from web-walker==3.0.0) (from versions: )
No matching distribution found for pdb (from web-walker==3.0.0)

python2静态嵌套作用域引发SyntaxError报错

使用环境
Centos 6.5
python 2.7.8
web-walker 1.2.2
用例:web-walker中test文件夹中的myapp

运行时报错 SyntaxError,发现与python2版本有关,python 2.1开始引入了一项语言新特性——静态嵌套作用域,该特性引发了以下报错

  File "/usr/local/lib/python2.7/site-packages/walker/spiders/__init__.py", line 285
    exec("cls_%s = create(k, v)"% index, locals(), globals)
SyntaxError: unqualified exec is not allowed in function 'start' it contains a nested function with free variables

其中__init__.py中的相关内容如下

def start(spiders, globals, module_name, item_field, item_xpath, page_xpath):

    ITEM_FIELD.update(item_field)
    ITEM_XPATH.update(item_xpath)
    PAGE_XPATH.update(page_xpath)

    def create(k, v):
        v["__module__"] = module_name
        return type("%sSpider" % k, (ClusterSpider,), v)

    index = 0

    for k, v in spiders.items():
        v.update({"name": k})
        exec("cls_%s = create(k, v)"% index, locals(), globals)
        index += 1


详细参考PEP 227 -- Statically Nested Scopes 中的说明


If exec is used in a function and the function contains a nested
    block with free variables, the compiler will raise a SyntaxError
    unless the exec explicitly specifies the local namespace for the
    exec.  (In other words, "exec obj" would be illegal, but 
    "exec obj in ns" would be legal.)

PAGE_XPATH配置问题

您好
我是python新手,git也不是很清楚具体怎么用,不知道在这里留言是否合适
从这里看到的帖子
https://zhuanlan.zhihu.com/p/23178014?refer=zimei
我想试着爬一下http://desk.zol.com.cn/fengjing/ 中的图片
在配置PAGE_XPATH时遇到一些问题,请指教

PAGE_XPATH = {
    "ZOL": [
        '//li[@class="photo-list-padding"]/a[@class="pic"]/@href',
    ]
}

浏览器返回的地址为/bizhi/6846_85460_2.html...
其实际地址应为http://desk.zol.com.cn + 上面的返回值
这里不知道该如何来写

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.