Giter VIP home page Giter VIP logo

screaming-puppeteer's Introduction

screaming-puppeteer

Simple web spider to capture URLs of a domain

Why

Every SEO knows of Screaming Frog and I have always been a big fan. It's great for crawling a site but there is a lot of data Screaming Frog is capturing that I do not need and yes you can customize it per crawl, but wanted a simple screaming frog "light" version that uses Puppeteer, hence Screaming Puppeteer

Problem

I have a large list of domains that are small sites (under 10k URLs) but have no sitemap. I wanted to generate a list of URLs that can be converted to a sitemap later on.

Code

There are 2 files in this repo:

  • cluster.js
  • crawler.js

They are very similar code bases, but the crawler.js is just a single threaded node app (should use for sites with less than 10,000 URLs) and the cluster.js uses concurrency to do the same thing, but faster with more hardware.

There are pro/cons for each, but wanted to showcase both of them as they are about 100 lines of code and the only real dependency is puppetter and puppeteer-cluster

The output can easily be removed, but wanted a simple way to see the URL it's crawling, the status code and then the page title (and character length) and meta description for the page. Additional the time it took to run is outputed at the end above the list of urls.

What

You provide a seed URL and it fetches the URL and then crawls all links (href) on the page, feeds into a queue and repeats. The final output is a text file with one url per line of the hostname you crawled.

Install

npm install

Run

Be sure to change the URL in cluster.js or crawler.js

node crawler.js

OR

node cluster.js

Enjoy โ˜•

If you have any feedback feel free to hit me up @johnmurch on twitter

screaming-puppeteer's People

Contributors

johnmurch avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.