Giter VIP home page Giter VIP logo

playwright-sitemap-crawler's Introduction

Playwright Sitemap Crawler

Playwright Sitemap Crawler is a Node.js script designed for visual regression testing and warming proxy cache for all pages in a sitemap.

Visual Regression Testing

This tool can be used to detect and inspect visual discrepancies after making changes to a website, for instance after

  • updating content or styling
  • updating plugins or dependencies
  • changing settings
  • adding custom CSS or Javascript code

NOTE: Since Playwright runs "real" browsers, pages opened with this script will run Javascript. It will register in website analytics software, like Google Analytics, Fathom, Plausible, Cloudflare Analytics, MicroAnalytics and similar. Potentially hundreds of visits can be registered in your analytics software.

It's strongly recommended to run these scripts on a website running locally on your own machine or a staging site.

You're welcome to make PRs to block Playwright from making requests to various website analytics tools.

Cache Warming/Priming

This tool can also be used to warm proxy caches. Use it for more advanced cache warming where all files should be cached in a proxy cache, not just the html generated by something like PHP.

NOTE: All files (js, css, iamges etc.) that make up each page will be downloaded not just the html. For warming only html use a sitemap warmer that does not use real browsers.

Installation

To get started, follow these installation steps:

  1. Clone the repository:

    git clone https://github.com/robman87/playwright-sitemap-crawler.git
  2. Change to the project directory:

    cd playwright-sitemap-crawler
  3. Install the project dependencies and browser runtimes for Playwright:

    npm install
    playwright install
  4. Copy .env.example and save it as .env

    cp .env.example .env
  5. Open the .env file in your preferred code editor and enter the URL to your sitemap file, save the file.

    SITEMAP_URL=http://localhost:8080/sitemap.xml

Usage

Visual Regression Testing

Visual regression testing is a quality assurance technique used in web development to detect unintended visual changes between different versions of a web page. It involves comparing screenshots of a web page before and after a change is made and identifying any visual discrepancies. These changes can include variations in layout, styling and content.

What the script does

Your sitemap will be downloaded and a list of all links in the sitemap will be created. Each page in the list will be opened in one or several browsers. After a page ha been loaded a full page screenshot will be saved in tests/visual-regression-sitemap.spec.js-snapshots. Then it will be compared with a previous screenshot if it exists, and diff image will be saved.

First time the script is run baseline images will be created and tests will fail for each page. This is correct since there are no images to compare against.

Then you perform your desired changes, modify css, add content, update plugins etc.

Second time the script is run, after you have performed your desired changes another set of full page screenshot will be created. Since there are two sets of screenshots diff images will be created.

For each page in the sitemap there should now be 3 images.

Run tests

To perform visual regression testing using a single browser use the command "visual:{browser}" like in the examples below

Desktop Chrome

npm run visual:chrome 

Desktop Firefox

npm run visual:firefox 

Desktop Safari

npm run visual:safari 

Mobile Safari (iPhone 11)

npm run visual:iphone 

Run tests for all devices (warning lots of images will be generated)

npm run visual 

For more advanced configuration use the playwright command directly.

playwright test --config=tests/visual.config.js --workers=2

Running tests for other devices

Open tests/visual-config.js in your preffered code editor, VisualStudio, vim, Notepad++. Add more devices to the projects array

projects: [
    ...
    {
        name: 'Mobile landscape', // any name you like
        use: { ...devices['iPhone 14 Pro Max landscape'] } // lots of devices supported, affects mostly screen width
    },
    ....
]

Analyzing results

Playwright will generate a report and serve it on localhost. After tests have finished the url to the report will be displayed in the terminal. Open it in your browser, there you'll be able to compare the 3 images (baseline screenshot, screenshot after changes, the diff image) for each page.

If all tests pass then there were no differences between the screenshot, meaning your changes to the site didn't affect anything visually. This is good after something like WordPress plugin updates or updating dependencies.

Failed tests mean that there are visual changes, this should happen after making content or styling changes. Otherwise it means your changes were not applied or that your pages are cached by your host or server.

Warming Cache

To warm FastCGI or proxy cache, run the following command:

npm run warm:cache

Contributing

I welcome contributions from the community! If you'd like to contribute to Playwright Sitemap Crawler, please follow these guidelines:

  1. Fork the repository on GitHub.

  2. Create a new branch for your feature or bug fix:

    git checkout -b feature/new-feature
  3. Make your changes and commit them:

    git commit -m "Add new feature"
  4. Push your branch to your fork:

    git push origin feature/new-feature
  5. Open a pull request on GitHub, explaining your changes and their purpose.

Feature Requests

If you have any feature requests or suggestions, please open an issue on the GitHub repository.

TODO

Support passing a CSV file to script instead of a sitemap

Support passing the path to a CSV file instead of an URL to XML sitemap.

Block website analytics requests

Make sure page views are not registered in website analytics services. Detect when libs/js for analytics services are requested and abort the requests in Playwright, like ad-blockers. Abort requests to known urls used by analytics services in case analytic js files are self-hosted.

  • Google Analytics
  • Fathom
  • Plausible
  • Cloudflare Analytics
  • MicroAnalytics

Add support for command line parameters

  • Use command line arguments as primary values and env variables as fallback values.

Visual regression testing

Reduce the time to run visual regression tests.

Playwright can spawn several browsers and multiple instances of each browser when running tests. Downloaded files are kept in memory to keep the environment pristine between tests. Which means all files are downloaded multiple times, once for each browser and context (incognito/private window).

This is not necessary when performing visual regression tests, most static files (images, js, css) will not have changed, wasting resources.

Reuse page context to minimize server communication

This method would be like opening each url in the same tab one after one, which means it reuses the already downloaded files from memory. This would work during the same run, all files would still be downloaded again between separate runs.

Reduce communication with server by caching files on disk

Caching files to disk would reduce server load and bandwidth usage between separate runs. Multiple browsers and all instances can reuse the same files. Each file would be downloaded once, then every request for the same file would just get a 304 response from the server.

  • Intercept requests for files with a file-extension (js, css, jpg, png, woff etc.)
  • Check if file exists in disk cache
    • File does not exist in disk cache
      • Download file
      • Save response body to disk cache
      • Return completed request
    • File exists in disk cache
      • Get last modified time of file
      • Create request config with 'If-Modified-Since' header
      • Make request for file to server with extra header
        • Check response status
          • If status 304, file has not changed and was not downloaded again.
            • Return response with status 200 and body content from disk cache
          • If status 200, file has changed and was downloaded
            • Save response body content to disk cache
            • Return response with status 200 and freshly downloaded body content
          • Else throw error (status 404, 500 etc.)

Another possibility could be to use a proxy that caches the files on disk, reducing the complexity of Playwright tests. Steps above would be performed by proxy instead, maybe there's something out there that does this already. Please let me know if you have any suggestions ^_^

License

This project is licensed under the MIT License, use freely for fun and profit.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.