Giter VIP home page Giter VIP logo

ganjiu11 / mini_spider Goto Github PK

View Code? Open in Web Editor NEW

This project forked from cafedeflore/mini_spider

0.0 1.0 0.0 196 KB

在调研过程中,经常需要对一些网站进行定向抓取。由于python包含各种强大的库,使用python做定向抓取比较简单。请使用python开发一个迷你定向抓取器mini_spider.py,实现对种子链接的广度优先抓取,并把URL长相符合特定pattern的网页保存到磁盘上。

Python 100.00%

mini_spider's Introduction

#####使用python开发定向抓取器mini_spider.py,实现对种子链接的广度优先抓取,并把URL长相符合特定pattern的网页保存到磁盘上。 程序运行: python mini_spider.py -c spider.conf

#####配置文件spider.conf: [spider] url_list_file: ./urls ; 种子文件路径 output_directory: ./output ; 抓取结果存储目录 max_depth: 1 ; 最大抓取深度(种子为0级) crawl_interval: 1 ; 抓取间隔. 单位: 秒 crawl_timeout: 1 ; 抓取超时. 单位: 秒 target_url: .*.(gif|png|jpg|bmp)$ ; 需要存储的目标网页URL pattern(正则表达式) thread_count: 8 ; 抓取线程数

#####种子文件每行一条链接,例如: http://www.sina.com.cn

#####要求和注意事项:

  • 需要支持命令行参数处理。具体包含: -h(帮助)、-v(版本)、-c(配置文件)

  • 需要按照广度优先的顺序抓取网页。

  • 单个网页抓取或解析失败,不能导致整个程序退出。需要在日志中记录下错误原因并继续。

  • 当程序完成所有抓取任务后,必须优雅退出。

  • 从HTML提取链接时需要处理相对路径和绝对路径。

  • 需要能够处理不同字符编码的网页(例如utf-8或gbk)。

  • 网页存储时每个网页单独存为一个文件,以URL为文件名。注意对URL中的特殊字符,需要做转义。

  • 要求支持多线程并行抓取。

  • 代码严格遵守百度python编码规范

  • 代码的可读性和可维护性好。注意模块、类、函数的设计和划分

  • 完成相应的单元测试和使用demo。你的demo必须可运行,单元测试有效而且通过

  • 注意控制抓取间隔和总量

mini_spider's People

Contributors

cafedeflore avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.