Giter VIP home page Giter VIP logo

netdiscovery's Introduction

NetDiscovery

@Tony沈哲 on weibo License

最新版本

模块 netdiscovery-core netdiscovery-extra netdiscovery-selenium netdiscovery-dsl
最新版本 Download Download Download Download

NetDiscover主要是基于Vert.x、RxJava2等实现的爬虫框架。目前还处于早期的版本,很多细节正在不断地完善中。

对于Java工程如果使用gradle构建,由于默认没有使用jcenter(),需要在相应module的build.gradle中配置

repositories {
    mavenCentral()
    jcenter()
}

下载:

netdiscovery-core

implementation 'com.cv4j.netdiscovery:netdiscovery-core:0.1.8'

netdiscovery-extra

implementation 'com.cv4j.netdiscovery:netdiscovery-extra:0.1.8'

netdiscovery-selenium

implementation 'com.cv4j.netdiscovery:netdiscovery-selenium:0.1.8'

netdiscovery-dsl

implementation 'com.cv4j.netdiscovery:netdiscovery-dsl:0.0.1'

NetDiscovery 功能点:

1.Spider功能

Spider可以单独使用,也可以添加到SpiderEngine中使用。

Spider中内置了很多组件。例如downloader就已经支持了好几种,支持热插拔随时替换,或者编写自己的downloader。

queue、parser、pipeline也都类似。其中,支持多个pipeline按照顺序执行。

在调试的时候,可以使用ConsolePipeline或者DebugPipeline

DebugPipeline打印的日志效果如下

2.SpiderEngine功能

SpiderEngine可以管理引擎中的爬虫,包括爬虫的生命周期。

2.1 获取某个爬虫的状态

http://localhost:{port}/netdiscovery/spider/{spiderName}

类型:GET

2.2 获取SpiderEngine中所有爬虫的状态

http://localhost:{port}/netdiscovery/spiders/

类型:GET

2.3 修改某个爬虫的状态

http://localhost:{port}/netdiscovery/spider/{spiderName}/status

类型:POST

参数说明:

{
    "status":2   //让爬虫暂停
}
status 作用
2 让爬虫暂停
3 让爬虫从暂停中恢复
4 让爬虫停止

NetDiscovery 基本原理:

1.基本原理

2.集群原理

案例:

  • user-agent-list:抓取常用浏览器的user agent
  • 在“Java与Android技术栈”公众号回复数字货币的关键字,获取最新的价格

TODO:

  1. 整合cv4j以及Tesseract,实现OCR识别的功能
  2. 增加elasticsearch的支持

联系方式:

QQ交流群:490882934

netdiscovery's People

Contributors

fengzhizi715 avatar freezaee avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.