Giter VIP home page Giter VIP logo

spider-rs

The spider project ported to Node.js

Getting Started

  1. npm i @spider-rs/spider-rs --save
import { Website, pageTitle } from "@spider-rs/spider-rs";

const website = new Website("https://rsseau.fr")
  .withHeaders({
    authorization: "somerandomjwt",
  })
  .withBudget({
    "*": 20, // limit max request 20 pages for the website
    "/docs": 10, // limit only 10 pages on the `/docs` paths
  })
  .withBlacklistUrl(["/resume"]) // regex or pattern matching to ignore paths
  .build();

// optional: page event handler
const onPageEvent = (_err, page) => {
  const title = pageTitle(page); // comment out to increase performance if title not needed
  console.info(`Title of ${page.url} is '${title}'`);
  website.pushData({
    status: page.statusCode,
    html: page.content,
    url: page.url,
    title,
  });
};

await website.crawl(onPageEvent);
await website.exportJsonlData("./storage/rsseau.jsonl");
console.log(website.getLinks());

Collect the resources for a website.

import { Website } from "@spider-rs/spider-rs";

const website = new Website("https://rsseau.fr")
  .withBudget({
    "*": 20,
    "/docs": 10,
  })
  // you can use regex or string matches to ignore paths
  .withBlacklistUrl(["/resume"])
  .build();

await website.scrape();
console.log(website.getPages());

Run the crawls in the background on another thread.

import { Website } from "@spider-rs/spider-rs";

const website = new Website("https://rsseau.fr");

const onPageEvent = (_err, page) => {
  console.log(page);
};

await website.crawl(onPageEvent, true);
// runs immediately

Use headless Chrome rendering for crawls.

import { Website } from "@spider-rs/spider-rs";

const website = new Website("https://rsseau.fr").withChromeIntercept(
  true,
  true,
);

const onPageEvent = (_err, page) => {
  console.log(page);
};

// the third param determines headless chrome usage.
await website.crawl(onPageEvent, false, true);
console.log(website.getLinks());

Cron jobs can be done with the following.

import { Website } from "@spider-rs/spider-rs";

const website = new Website("https://choosealicense.com").withCron(
  "1/5 * * * * *",
);
// sleep function to test cron
const stopCron = (time: number, handle) => {
  return new Promise((resolve) => {
    setTimeout(() => {
      resolve(handle.stop());
    }, time);
  });
};

const links = [];

const onPageEvent = (err, value) => {
  links.push(value);
};

const handle = await website.runCron(onPageEvent);

// stop the cron in 4 seconds
await stopCron(4000, handle);

Use the crawl shortcut to get the page content and url.

import { crawl } from "@spider-rs/spider-rs";

const { links, pages } = await crawl("https://rsseau.fr");
console.log(pages);

Benchmarks

View the benchmarks to see a breakdown between libs and platforms.

Test url: https://espn.com

libraries pages speed
spider(rust): crawl 150,387 1m
spider(nodejs): crawl 150,387 153s
spider(python): crawl 150,387 186s
scrapy(python): crawl 49,598 1h
crawlee(nodejs): crawl 18,779 30m

The benches above were ran on a mac m1, spider on linux arm machines performs about 2-10x faster.

Development

Install the napi cli npm i @napi-rs/cli --global.

  1. yarn build:test

spider-rs's Projects

jsdom icon jsdom

JSDOM for Rust [incomplete]

spider icon spider

The fastest web crawler written in Rust. Maintained by @a11ywatch.

ua_generator icon ua_generator

Pre-compiled random real User-Agents. Updated weekly for windows, mac, linux, and android

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.