Playwright Sitemap Crawler

Playwright Sitemap Crawler is a Node.js script designed for visual regression testing and warming proxy cache for all pages in a sitemap.

Visual Regression Testing

This tool can be used to detect and inspect visual discrepancies after making changes to a website, for instance after

updating content or styling
updating plugins or dependencies
changing settings
adding custom CSS or Javascript code

NOTE: Since Playwright runs "real" browsers, pages opened with this script will run Javascript. It will register in website analytics software, like Google Analytics, Fathom, Plausible, Cloudflare Analytics, MicroAnalytics and similar. Potentially hundreds of visits can be registered in your analytics software.

It's strongly recommended to run these scripts on a website running locally on your own machine or a staging site.

You're welcome to make PRs to block Playwright from making requests to various website analytics tools.

Cache Warming/Priming

This tool can also be used to warm proxy caches. Use it for more advanced cache warming where all files should be cached in a proxy cache, not just the html generated by something like PHP.

NOTE: All files (js, css, iamges etc.) that make up each page will be downloaded not just the html. For warming only html use a sitemap warmer that does not use real browsers.

Installation

To get started, follow these installation steps:

Clone the repository:

git clone https://github.com/robman87/playwright-sitemap-crawler.git

Change to the project directory:
```
cd playwright-sitemap-crawler
```
Install the project dependencies and browser runtimes for Playwright:
```
npm install
playwright install
```
Copy .env.example and save it as .env
```
cp .env.example .env
```
Open the .env file in your preferred code editor and enter the URL to your sitemap file, save the file.
```
SITEMAP_URL=http://localhost:8080/sitemap.xml
```

Usage

Visual Regression Testing

Visual regression testing is a quality assurance technique used in web development to detect unintended visual changes between different versions of a web page. It involves comparing screenshots of a web page before and after a change is made and identifying any visual discrepancies. These changes can include variations in layout, styling and content.

What the script does

Your sitemap will be downloaded and a list of all links in the sitemap will be created. Each page in the list will be opened in one or several browsers. After a page ha been loaded a full page screenshot will be saved in tests/visual-regression-sitemap.spec.js-snapshots. Then it will be compared with a previous screenshot if it exists, and diff image will be saved.

First time the script is run baseline images will be created and tests will fail for each page. This is correct since there are no images to compare against.

Then you perform your desired changes, modify css, add content, update plugins etc.

Second time the script is run, after you have performed your desired changes another set of full page screenshot will be created. Since there are two sets of screenshots diff images will be created.

For each page in the sitemap there should now be 3 images.

Run tests

To perform visual regression testing using a single browser use the command "visual:{browser}" like in the examples below

Desktop Chrome

npm run visual:chrome

Desktop Firefox

npm run visual:firefox

Desktop Safari

npm run visual:safari

Mobile Safari (iPhone 11)

npm run visual:iphone

Run tests for all devices (warning lots of images will be generated)

npm run visual

For more advanced configuration use the playwright command directly.

playwright test --config=tests/visual.config.js --workers=2

Running tests for other devices

Open tests/visual-config.js in your preffered code editor, VisualStudio, vim, Notepad++. Add more devices to the projects array

projects: [
    ...
    {
        name: 'Mobile landscape', // any name you like
        use: { ...devices['iPhone 14 Pro Max landscape'] } // lots of devices supported, affects mostly screen width
    },
    ....
]

Analyzing results

Playwright will generate a report and serve it on localhost. After tests have finished the url to the report will be displayed in the terminal. Open it in your browser, there you'll be able to compare the 3 images (baseline screenshot, screenshot after changes, the diff image) for each page.

If all tests pass then there were no differences between the screenshot, meaning your changes to the site didn't affect anything visually. This is good after something like WordPress plugin updates or updating dependencies.

Failed tests mean that there are visual changes, this should happen after making content or styling changes. Otherwise it means your changes were not applied or that your pages are cached by your host or server.

Warming Cache

To warm FastCGI or proxy cache, run the following command:

npm run warm:cache

Contributing

I welcome contributions from the community! If you'd like to contribute to Playwright Sitemap Crawler, please follow these guidelines:

Fork the repository on GitHub.
Create a new branch for your feature or bug fix:
```
git checkout -b feature/new-feature
```
Make your changes and commit them:
```
git commit -m "Add new feature"
```
Push your branch to your fork:
```
git push origin feature/new-feature
```
Open a pull request on GitHub, explaining your changes and their purpose.

Feature Requests

If you have any feature requests or suggestions, please open an issue on the GitHub repository.

TODO

Support passing a CSV file to script instead of a sitemap

Support passing the path to a CSV file instead of an URL to XML sitemap.

Block website analytics requests

Make sure page views are not registered in website analytics services. Detect when libs/js for analytics services are requested and abort the requests in Playwright, like ad-blockers. Abort requests to known urls used by analytics services in case analytic js files are self-hosted.

Add support for command line parameters

Use command line arguments as primary values and env variables as fallback values.

Visual regression testing

Reduce the time to run visual regression tests.

Playwright can spawn several browsers and multiple instances of each browser when running tests. Downloaded files are kept in memory to keep the environment pristine between tests. Which means all files are downloaded multiple times, once for each browser and context (incognito/private window).

This is not necessary when performing visual regression tests, most static files (images, js, css) will not have changed, wasting resources.

Reuse page context to minimize server communication

This method would be like opening each url in the same tab one after one, which means it reuses the already downloaded files from memory. This would work during the same run, all files would still be downloaded again between separate runs.

Reduce communication with server by caching files on disk

Caching files to disk would reduce server load and bandwidth usage between separate runs. Multiple browsers and all instances can reuse the same files. Each file would be downloaded once, then every request for the same file would just get a 304 response from the server.

Another possibility could be to use a proxy that caches the files on disk, reducing the complexity of Playwright tests. Steps above would be performed by proxy instead, maybe there's something out there that does this already. Please let me know if you have any suggestions ^_^

License

This project is licensed under the MIT License, use freely for fun and profit.

robman87 / playwright-sitemap-crawler Goto Github PK