Giter VIP home page Giter VIP logo

webster's Introduction

Webster

Financial Contributors on Open Collective npm version Build Status

Overview

Webster is a reliable web crawling and scraping framework written with Node.js, used to crawl websites and extract structured data from their pages.

Which is different from other crawling framework is that Webster can scrape the content which rendered by browser client side javascript and ajax request

Quick Start

Let's start a simple crawler request to google website:

docker pull zhuyingda/webster-playground

docker run --tty -e URL="https://www.google.com/robots.txt" zhuyingda/webster-playground node crawler.js

# add cookie with sign-in session
docker run --tty -e MOD=debug -e URL="https://www.google.com/robots.txt" -e Cookie="foo=1234; bar=abcd" zhuyingda/webster-playground node crawler.js

# set user-agent
docker run --tty -e URL="https://www.google.com/robots.txt" -e Cookie="foo=1234; bar=abcd" -e UA="Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36" zhuyingda/webster-playground node crawler.js

# see crawling log
docker run --tty -e MOD=debug -e URL="https://www.google.com/robots.txt" -e Cookie="foo=1234; bar=abcd" -e UA="Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36" zhuyingda/webster-playground node crawler.js

Requirements

  • Node.js 10.x+
  • Works on Linux, Mac OSX

Or you can deploy on Docker.

Install

npm install webster

Single spider example

const { spider } = require('webster');

class MySpider extends spider {
    get defUserAgent() {
        return 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36';
    }
    get defDeviceType() {
        return 'pc';
    }
    async parseHtml(html) {
        return true;
    }
}

(async () => {
    const spider = new MySpider({
        actions: [
            {
                type: 'waitForSelector',
                selector: 'div.js-details-container',
            }
        ],
        targets: [
            {
                selector: 'div.Box-row[role=row]',
                type: 'text',
                field: 'sugs'
            }
        ],
    });
    const url = `https://github.com/zhuyingda/webster`;
    let crawlResult = await spider.startRequest(url);
    console.log(crawlResult);
})();

Docker cluster example

Pull the example docker image:

docker pull zhuyingda/webster-demo
docker run -it zhuyingda/webster-demo

In this docker image, there is a simple cluster-able example:

// producer
const Webster = require('webster');
const Producer = Webster.producer;
const Task = Webster.task;

let task = new Task({
    spiderType: 'browser',
    engineType: 'playwright',
    browserType: 'chromium',
    url: 'http://quotes.toscrape.com/tag/humor/',
    targets: [
        {
            selector: 'span.text',
            type: 'text',
            field: 'quote'
        },
        {
            selector: 'li.next > a',
            type: 'attr',
            attrName: 'href',
            field: 'link'
        }
    ],
    actions: [
        {
            type: 'waitAfterPageLoading',
            value: 500
        }
    ],
    referInfo: {
        para1: 'this is a refer field 1',
        para2: 'this is a refer field 2'
    }
});

let myProducer = new Producer({
    channel: 'demo_channel1',
    dbConf: {
        redis: {
            host: 'redis-12419.c44.us-east-1-2.ec2.cloud.redislabs.com',
            port: 12419,
            password: 'X2AcjziaOOYPppWFOPiP4rmzZ9RFLViv'
        }
    }
});
myProducer.generateTask(task).then(() => {
    console.log('done');
    process.exit();
});
// consumer
const Webster = require('webster');
const Consumer = Webster.consumer;

class MyConsumer extends Consumer {
    constructor(option) {
        super(option);
    }
    afterCrawlRequest(result) {
        console.log('your scrape result:', result);
    }
}

let myConsumer = new MyConsumer({
    channel: 'demo_channel1',
    sleepTime: 5000,
    deviceType: 'pc',
    dbConf: {
        redis: {
            host: 'redis-12419.c44.us-east-1-2.ec2.cloud.redislabs.com',
            port: 12419,
            password: 'X2AcjziaOOYPppWFOPiP4rmzZ9RFLViv'
        }
    }
});
myConsumer.startConsume();
node demo_producer.js
env MOD=debug node demo_consumer.js

You can organize your crawler cluster by Consumer and Producer like this:

Usage on Raspbian Platform

sudo apt install chromium-browser chromium-codecs-ffmpeg
env MOD=debug EXE_PATH=/usr/bin/chromium-browser node demo_consumer.js

Documentation

You can see more details from here.

Code Contributors

This project exists thanks to all the people who contribute. [Contribute].

Financial Contributors

Become a financial contributor and help us sustain our community. [Contribute]

Individuals

Organizations

Support this project with your organization. Your logo will show up here with a link to your website. [Contribute]

License

GPL-V3

Copyright (c) 2017-present, Yingda (Sugar) Zhu

webster's People

Contributors

monkeywithacupcake avatar zhuyingda avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

webster's Issues

希望爬虫支持的特性

  • 设置
    • 针对反爬虫
      • user agent
      • 禁用 Cookie
      • 抓取页面的间隔
      • ip池。检测ip被封时,可动态改。
    • 爬取效率
      • 重复的url不会爬。
      • 分布式爬。
  • 抓取页面前
    • 支持回调。一般是做登录。
  • 获得页面的响应后
    • 探测页面编码。防止乱码。
    • 支持回调。
  • 提取页面数据
    • 支持用 CSS 选择器和 xpath 来解析 HTML。
    • 设置 Cookie。
    • 能模拟用户操作。
  • 能动态添加要爬取的页面。

Consumer.browserRequest方法单一功能抽离

browserRequest的功能定位是发起网络请求,因此建议是否可以把
1.deviceType判断部分
2.parseHtml解析部分
放到browserRequest方法外部,browserRequest只负责发起请求。

image

image

你的redis账户泄露了!!

let myConsumer = new MyConsumer({
    channel: 'baidu',
    sleepTime: 5000,
    deviceType: 'pc',
    dbConf: {
        redis: {
            host: 'redis-15455.c80.us-east-1-2.ec2.cloud.redislabs.com',
            port: 15455,
            password: 'L7hfNRGniDYdSZxJpCmdDtafqEsDxpaN'
        }
    }
});
myConsumer.startConsume();

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.