Giter VIP home page Giter VIP logo

sitediff's Introduction

SiteDiff CLI

Warning: SiteDiff 1.2.0 requires at least Ruby 3.1.2.

Warning: SiteDiff 1.0.0 introduces some backwards incompatible changes.

Build Status

Table of contents

Introduction

SiteDiff makes it easy to see how a website changes. It can compare two similar sites or it can show how a single site changed over time. It helps identify undesirable changes to the site's HTML and it's a useful tool for conducting QA on re-deployments, site upgrades, and more!

When you run SiteDiff, it produces an HTML report showing whether pages on your site have changed or not. For pages that have changed, you can see a colorized diff exactly what changed, or compare the visual differences side-by-side in a browser.

SiteDiff supports a range of normalization / sanitization rules. These allow you to eliminate spurious differences, narrowing down differences to the ones that materially affect the site.

Installation

SiteDiff is fairly easy to install. Please refer to the installation docs.

Demo

After installing all dependencies including the bundle version 2 gem, you can quickly see what SiteDiff can do. Simply use the following commands:

git clone https://github.com/evolvingweb/sitediff
cd sitediff
bundle install
bundle exec thor fixture:serve

Then visit http://localhost:13080/ to view the report.

SiteDiff shows you an overview of all the pages and clearly indicates which pages have changed and not changed. page report preview

When you click on a changed page, you see a colorized diff of the page's markup showing exactly what changed on the page. page report preview

Usage

Here are some instructions on getting started with SiteDiff. To see a list of commands that SiteDiff offers, you can run:

sitediff help

To get help for a particular command, say, diff, you can run:

sitediff help diff

Getting started

To use SiteDiff on your site, create a configuration for your site:

sitediff init http://mysite.example.com

SiteDiff will generate a configuration file named sitediff.yaml by default.

You can open the configuration file sitediff/sitediff.yaml to see the default configuration generated by SiteDiff. The the configuration reference section explains the contents of this file and helps you customize it as per your requirements.

Then get SiteDiff to crawl your site by using:

sitediff crawl

SiteDiff will then crawl your site, finding pages and caching their contents. A list of discovered paths will be saved to a paths.txt file.

Now, you can make alterations to your site. For example, change a word on your site's front page. After you're done, you can check what actually changed:

sitediff diff

For each page, SiteDiff will report whether it did or did not change. For pages that changed, it will display a diff. You can also see an HTML version of the report using the following command:

sitediff serve

SiteDiff will start an internal web server and open a report page on your browser. For each page, you can see the diff and a side-by-side view of the old and new versions.

You can now see if the changes were as you expected, or if some things didn't quite work out as you hoped. If you noticed unexpected changes, congratulations: SiteDiff just helped you find an issue you would have otherwise missed!

As you fix any issues, you can continue to alter your site and run sitediff diff to check the changes against the old version. Once you're satisfied with the state of your site, you can inform SiteDiff that it should re-cache your site:

sitediff store

This takes a snapshot of your website and the next time you run sitediff diff, it will use this new version as the reference for comparison.

Happy diffing!

Comparing 2 sites

Sometimes you have two sites that you want to compare, for example a production site hosted on a public server and a development site hosted on your computer. SiteDiff can handle this situation, too! Just inform SiteDiff that there are two sites to compare:

sitediff init http://mysite.example.com http://localhost/mysite

Then when you run sitediff diff, it will compare the cached version of the first site with the current version of the second site.

If both the first and second sites may be changing, you should tell SiteDiff not to cache either site:

sitediff diff --cached=none

Spurious diffs

Sometimes sites have spurious differences, that you don't want to show up in a comparison. For example, many sites protect against Cross-Site Request Forgery using a semi-random token. Since this token changes on each HTTP GET, you probably don't care about such a change.

To help with issues such as this, SiteDiff allows you to normalize the HTML it fetches as it compares pages. In the sitediff.yaml configuration file, you can add "sanitization rules", which specify either DOM transformations or regular expression substitutions.

Here's an example of a rule you might add to remove CSRF-protection tokens generated by Django:

dom_transform:
  - title: Remove CSRF tokens
    type: remove
    selector: input[name=csrfmiddlewaretoken]

You can use one of the presets to apply framework-specific sanitization. Currently, SiteDiff only comes with Drupal-specific presets.

See the preset section for more details.

Command Line Options

Finding configuration files

By default SiteDiff will put everything in the sitediff folder. You can use the --directory flag to specify a different directory.

sitediff init -C my_project_folder https://example.com
sitediff diff -C my_project_folder
sitediff serve -C my_project_folder

Specifying paths

When you run sitediff diff, you can specify which pages to look at in 2 ways:

  1. The option --paths /foo /bar ....

    If you're trying to fix one page in particular, specifying just that one path will make sitediff diff run quickly!

  2. The option --paths-file FILE with a newline-delimited text file.

This is particularly useful when you're trying to eliminate all diffs. SiteDiff creates a file output/failures.txt containing all paths which had differences, so as you try to fix differences, you can run:

sitediff diff --paths-file sitediff/failures.txt

Debugging rules

When a sanitization rule isn't working quite right for you, you might run sitediff diff many times over. If fetching all the pages is taking too long, try adding the option --cached=all. This tells SiteDiff not to re-fetch the content, but just compare previously cached versions โ€” it's a lot faster!

Including and Excluding URLs

By default sitediff crawls pages that are indicated with an HTML anchor using the <A HREF syntax. Most pages linked will be HTML pages, but some links will contain binaries such as PDF documents and images.

Using the option --exclude='.*\.pdf' ensures the crawler skips links for document with a .pdf extension. Note that the regular expression is applied to the path of the URL, not the base of the URL.

For example --include='.*\.com' will not match http://www.google.com/, because the path of that URL is / while the base is www.google.com.

paths / paths-file

SiteDiff allows you to specify a list of paths that you want it to work with. Alternatively, it can crawl the entire site and detect all paths.

  • Running sitediff init configures SiteDiff for crawling and seeing differences.

  • Running sitediff crawl makes sitediff crawl your site and detect available paths. These paths are written to a paths.txt file which you can modify according to your needs.

  • You can also compute diffs only for paths specified in a custom paths file using the --paths-file parameter. This file should contain paths starting with a /, having one path per line.

    sitediff diff --paths-file=/path/to/paths.txt
    
  • You can also compute diffs for a handful of specific paths by specifying them directly on the command line using the --paths parameter. Each path should be separated by a space.

    sitediff diff --paths=/home /about /contact
    

export

Generate a gzipped tar file containing the HTML report instead of generating and serving live web pages, this option overrides --report-format, forcing HTML.

sitediff diff --export
sitediff diff -e

This will perform the diff and export the results in a gzipped tar file.

Running inside containers

If you run SiteDiff inside a container or virtual machine, the URLs in its report might not work from your host, such as localhost. You can fix this by using the --before-url-report and --after-url-report options, to tell SiteDiff to use a different URL in the report than the one it uses for fetching.

For example, if you ran sitediff init http://mysite.com http://localhost inside a Vagrant VM, you might then run something like:

sitediff diff --after-url-report=http://vagrant:8080

Configuration

SiteDiff relies on a YAML configuration file, usually called sitediff.yaml. You can create a reasonable one using sitediff init, but there are many useful things you may want to add or change manually.

In the sitediff.yaml, SiteDiff recognizes the keys described below. The config directory contains some example sitediff.yaml files. For example, sitediff.example.yaml.

before_url / after_url

before_url: http://example.com/subsite
after_url: http://localhost:8080/subsite

They can also be paths to directories on the local filesystem.

The after_url MUST provided either at the command-line or in the sitediff.yaml. If the before_url is provided, SiteDiff will compare the two sites. Otherwise, it will compare the current version of the after site with the stored version of that site, as created by sitediff init or sitediff store.

selector

Chooses the sections of HTML we wish to compare, if you don't want to compare the entire page. For example if you only want to compare breadcrumbs between your two sites, you might specify:

selector: '#breadcrumb'

sanitization

A list of regular expression rules to normalize your HTML for comparison.

Each rule should have a pattern regex, which is used to search the HTML. Each found instance is replaced with the provided substitute or deleted if no substitute is provided. A rule may also have a selector, which constrains it to operate only on HTML fragments which match that CSS selector.

For example, forms on Drupal sites have a randomly generated form_build_id on form pages:

<input type="hidden" name="form_build_id" value="form-1cac6b5b6141a72b2382928249605fb1"/>

We're not interested in comparing random content, so we could use the following rule to fix this:

sanitization:
# Remove form build IDs
  - pattern: '<input type="hidden" name="form_build_id" value="form-[a-zA-Z0-9_-]+" *\/?>'
    selector: 'input'
    substitute: '<input type="hidden" name="form_build_id" value="__form_build_id__">'

Sanitization rules may also have a path attribute, whose value is a regular expression. If present, the rule will only apply to matching paths.

ignore_whitespace

Ignore whitespace when doing the diff. This passes the -w option to the native OS diff command.

ignore_whitespace: true

On the command line, use -w or --ignore-whitespace.

sitediff diff -w

before / after

Applies rules to just one side of the comparison.

These blocks can contain any of the following sections: selector, sanitization, dom_transform. Such a section placed in before will be applied just to the before side of the comparison and similarly for after.

For example, if you wanted to let different date formatting not create diff failures, you might use the following:

before:
  sanitization:
    - pattern: '[1-2][0-9]{3}/[0-1][0-9]/[0-9]{2}'
      substitute: '__date__'
after:
  sanitization:
    - pattern:  '[A-Z][a-z]{2} [0-9]{1,2}(st|nd|rd|th) [1-2][0-9]{3}'
      substitute: '__date__'

The above rule will replace dates of the form 2004/12/05 in before and dates of the form May 12th 2004 in after with __date__.

includes

The names of other configuration YAML files to merge with this one.

includes:
  - config/sanitize_domains.yaml
  - config/strip_css_js.yaml

dom_transform

A list of transformations to apply to the HTML before comparing.

This is similar to sanitization, but it applies transformations to the structure of the HTML, instead of to the text. Each transformation has a type, and potentially other attributes. The following types are available:

remove

Given a selector, removes all elements that match it.

For example, say we have a block containing the current time, which is expected to change. To ignore that, we might choose to delete the block before comparison:

dom_transform:
# Remove current time block
  - type: remove
    selector: div#block-time

strip

Strip leading and trailing whitespace from the contents of a tag.

Uses the Ruby string strip() method. Whitespace is defined as any of the following characters: null, horizontal tab, line feed, vertical tab, form feed, carriage return, space.

To transform <h1> Foo and Bar\n </h1> to <h1>Foo and Bar<\h1>:

dom_transform:
# Strip H1 tags
  - type: strip
    selector: h1

unwrap

Given a selector, replaces all matching elements with their children. For example, your content on one side of the comparison might look like this:

<p>This is some text</p>
<img src="lola.png" alt="Lola is a cute kitten." />

But on the other side, it might be wrapped in an article tag:

<article>
  <p>This is some text</p>
  <img src="test.png"/>
</article>

You could fix it with the following configuration:

dom_transform:
  - type: unwrap
    selector: article

remove_class

Given a selector and a class, removes that class from each element that matches the selector. It can also take a list of classes, instead of just one.

For example, here are two sample rules for removing a single class and removing multiple classes from all div elements:

dom_transform:
  # Remove class foo from div elements
  - type: remove_class
    selector: div
    class: class-foo
  # Remove class bar and class baz from div elements
  - type: remove_class
    selector: div
    class:
      - class-bar
      - class-baz

unwrap_root

Replaces the entire root element with its children.

report

The settings under the report key allow you to display helpful details on the report.

report:
  title: "Updates to example.com"
  details: "This report verifies updates to example.com."
  before_note: "The old site"
  after_note: "The new site"
  before_url_report: http://example.com
  after_url_report: http://staging.example.com

title

Display a title string at the top of the report.

details

Text displays as a paragraph at the top of the report, below the title.

before_note

Display a brief explanatory note next to before URL.

after_note

Display a brief explanatory note next to after URL.

before_url_report / after_url_report

Changes how SiteDiff reports which URLs it is comparing, but don't change what it actually compares.

Suppose you are serving your 'after' website on a virtual machine with IP 192.168.2.3, and you are also running SiteDiff inside that VM. To make links in the report accessible from outside the VM, you might provide:

after_url: http://localhost
report:
  after_url_report: http://192.168.2.3

If you don't wish to have the "Before" or "After" links in the report, set to false:

report:
  after_url_report: false

Miscellaneous

preset

Presets are stored in the /lib/sitediff/presets directory of this gem. You can select a preset as follows:

settings:
  preset: drupal

Include/Exclude Paths

exclude paths

A RegEx indicating the paths that should not be crawled.

include paths

A RegEx indicating the paths that should be crawled.

Organizing configuration files

If your configuration file starts getting really big, SiteDiff lets you separate it out into multiple files. Just have one base file that includes other files:

includes:
  - sanitization.yaml
  - paths.yaml

This allows you to separate your configuration into logical groups. For example, generic rules for your site could live in a generic.yaml file, while rules pertaining to a particular update you're conducting could live in update-8.2.yaml.

Named regions

In major upgrades and migrations where there are significant changes to the markup, simple diffs will not be of much value. To assist in these cases, named regions let you define regions in the page markup and the specify order in which they should be compared. Specifying the order helps in cases where the fields are not in the same order on the new site.

For example, if you have a CMS displaying title, author, and body fields, you could define the named regions and the selectors for the three fields as follows:

  regions:
    - name: title
      selector: h1.title
    - name: author
      selector: .field-name-attribution
    - name: body
      selector: .field-name-body

(You need to define regions for both the before and after sections.)

You must then define the order that the fields should be compared, using the output key.

output:
  - title
  - author
  - body

Before the two versions are compared, SiteDiff generates markup with <region> tags and each region contains the markup matching the corresponding selector.

EG:

<region id="title">
  <h1 class="title">My Blog Post</h1>
</region>
<region id="author">
  <div class="field-name-attribution">
    <span class="label">By:</span> Alfred E. Neuman
  </div>
</region>
<region id="body">
  <div class="field-name-attribution">
    <p>Lorem ipsum...
  </div>
</region>

The regions are processed first, so you can reference the <region> tags to be more specific in your selectors for dom_transform and sanitization sections.

EG:

dom_transform:
  - name: Remove body div wrapper
    type: unwrap
    selector: region#body .field-name-attribution

Curl Options

Many options can be passed to the underlying curl library. Add --curl_options=name1:value1 name2:value2 to the command line (such as --curl_options=max_recv_speed_large:100000 (remove the CURLOPT_ prefix and write the name in lowercase) or add them to your configuration file.

settings:
  curl_opts:
    max_recv_speed_large: 10000
    ssl_verifypeer: false

These CURL options can be put under the settings section of sitediff.yaml as demonstrated above.

Throttling

A few options are also available to control how aggressively SiteDiff crawls.

  • There's a command line option --concurrency=N for sitediff init which controls the maximum number of simultaneous connections made. Lower N mean less aggressive. The default is 3. You can specify this in the sitediff.yaml file under the settings key.

  • The underlying curl library has many options such as max_recv_speed_large which can be helpful.

  • There is a special command line option --interval=T for sitediff init. This option and allows the fetcher to delay for T milliseconds between fetching pages. You can specify this in the sitediff.yaml file under the settings key.

Timeouts

By default, no timeout is set but one can be added --curl_options=timeout:60 or in your configuration file.

settings:
  curl_opts:
    timeout: 60 # In seconds; or...
    timeout_ms: 60000 # In milliseconds.

Handling security

Often development or staging sites are protected by HTTP Authentication. SiteDiff allows you to specify a username and password, by using a URL like http://user:[email protected] or by adding a userpwd setting to your file.

SiteDiff ignores untrusted certificates by default. This is equivalent to the following settings:

settings:
  curl_opts:
    ssl_verifypeer: false
    ssl_verifyhost: 0
    userpwd: "username:password"

This contains various parameters which affect the way SiteDiff works. You can have the following keys under settings.

interval

An integer indicating the number of milliseconds SiteDiff should wait for between requests.

concurrency

The maximum number of simultaneous requests that SiteDiff should make.

depth

The depth to which SiteDiff should crawl the website. Defaults to 3, which means, 3 levels deep.

curl_opts

Options to pass to the underlying curl library. Remove the CURLOPT_ prefix in this full list of options and write in lowercase. Useful for throttling.

settings:
  curl_opts:
    connecttimeout: 3
    followlocation: true
    max_recv_speed_large: 10000

Tips and Tricks

Here are some tips and tricks that we've learned using SiteDiff:

  • Use single quotes or double quotes around selectors. Remember that the # is a comment in YAML.
  • Be specific enough with selectors to not affect elements on other pages.

Removing Empty Elements

If you have an empty <p/> tag appearing in the diff, you can write the following in your sanitization lists:

  - name: remove_empty_p
    pattern: '<p/>'
    substitute: ''

HTML Tag Formatting

There are times when the HTML tags do not have newlines between them on one of the sites you wish to compare. In this case, these sanitzation rules are useful:

  - name: remove_space_before
    pattern: '\s*(\n)<'
    substitute: '\1<'

  - name: remove_space_after
    pattern: '>(\n)\s*'
    substitute: '>\1'

Empty Attributes

After writing rules, you may end up with empty attributes, like width="". Here's a sanitization rule:

  - name: remove_empty_class
    pattern: ' class=""'
    substitute: ''

Acknowledgements

SiteDiff is brought to you by Evolving Web.

sitediff's People

Contributors

amirkdv avatar carehart avatar cleaver avatar dependabot[bot] avatar dergachev avatar dhuf avatar dieterholvoet avatar fbarbanson avatar friggingee avatar gardon avatar jaqx0r avatar jigarius avatar jorgediazgutierrez avatar jrenggli avatar kdborg avatar kirk-brown-ew avatar krisre-sigmabold avatar ll782 avatar morvans avatar mvc1095 avatar shahinam avatar stevetemple avatar vasi avatar wucris avatar yelidrissi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sitediff's Issues

Option to ignore untrusted certificate

As a user I want to be able to crawl sites with untrusted certificates, in order to check sites that don't have valid certificates. Ideally it would be possible to turn of the check completely, like -k in curl.

      params = {
        :connecttimeout => 3,     # Don't hang on servers that don't exist
        :followlocation => true,  # Follow HTTP redirects (code 301 and 302)
        :ssl_verifypeer => false, # dirty workaround for broken certificates
        :ssl_verifyhost => 0,
        :headers => {
          "User-Agent" => "Sitediff - https://github.com/evolvingweb/sitediff"
        }
      }

This would do the trick, but someone who knows ruby should probably make that optional

[sitediff] ERROR (HTTP error https://www.eroswholesale.com/index.php: Unknown Error) /index.php

hey guys, first of all, congratulations for the great work.

I'm getting this error when running:

sitediff diff --cached=none

The failures.txt file doesn't show anything. Only a single "/".

See below my sitediff.yml


after:
url: https://www.eroswholesale.com
before:
url: https://www.eroswholesale.com
paths:

  • /

I'm guessing the target is blocking my requests based on the agent since this is not happening when I craw a local website

Any thoughts?

Cheers,
Rafa F

Invalid byte sequence in UTF-8

I'm running a build on commit 70869f7 of dev branch, and am getting the following error while processing a page during my sitediff diff:

Traceback (most recent call last):
	33: from /usr/bin/sitediff:23:in `<main>'
	32: from /usr/bin/sitediff:23:in `load'
	31: from /usr/lib/ruby/gems/2.5.0/gems/sitediff-0.0.4/bin/sitediff:9:in `<top (required)>'
	30: from /usr/lib/ruby/gems/2.5.0/gems/thor-0.20.3/lib/thor/base.rb:466:in `start'
	29: from /usr/lib/ruby/gems/2.5.0/gems/thor-0.20.3/lib/thor.rb:387:in `dispatch'
	28: from /usr/lib/ruby/gems/2.5.0/gems/thor-0.20.3/lib/thor/invocation.rb:126:in `invoke_command'
	27: from /usr/lib/ruby/gems/2.5.0/gems/thor-0.20.3/lib/thor/command.rb:27:in `run'
	26: from /usr/lib/ruby/gems/2.5.0/gems/sitediff-0.0.4/lib/sitediff/cli.rb:110:in `diff'
	25: from /usr/lib/ruby/gems/2.5.0/gems/sitediff-0.0.4/lib/sitediff.rb:120:in `run'
	24: from /usr/lib/ruby/gems/2.5.0/gems/sitediff-0.0.4/lib/sitediff/fetch.rb:26:in `run'
	23: from /usr/lib/ruby/gems/2.5.0/gems/typhoeus-1.3.1/lib/typhoeus/hydra/memoizable.rb:51:in `run'
	22: from /usr/lib/ruby/gems/2.5.0/gems/typhoeus-1.3.1/lib/typhoeus/hydra/runnable.rb:15:in `run'
	21: from /usr/lib/ruby/gems/2.5.0/gems/ethon-0.11.0/lib/ethon/multi/operations.rb:43:in `perform'
	20: from /usr/lib/ruby/gems/2.5.0/gems/ethon-0.11.0/lib/ethon/multi/operations.rb:164:in `run'
	19: from /usr/lib/ruby/gems/2.5.0/gems/ethon-0.11.0/lib/ethon/multi/operations.rb:151:in `check'
	18: from /usr/lib/ruby/gems/2.5.0/gems/ethon-0.11.0/lib/ethon/easy/response_callbacks.rb:68:in `complete'
	17: from /usr/lib/ruby/gems/2.5.0/gems/ethon-0.11.0/lib/ethon/easy/response_callbacks.rb:68:in `each'
	16: from /usr/lib/ruby/gems/2.5.0/gems/ethon-0.11.0/lib/ethon/easy/response_callbacks.rb:68:in `block in complete'
	15: from /usr/lib/ruby/gems/2.5.0/gems/typhoeus-1.3.1/lib/typhoeus/easy_factory.rb:164:in `block in set_callback'
	14: from /usr/lib/ruby/gems/2.5.0/gems/typhoeus-1.3.1/lib/typhoeus/request/operations.rb:35:in `finish'
	13: from /usr/lib/ruby/gems/2.5.0/gems/typhoeus-1.3.1/lib/typhoeus/request/callbacks.rb:145:in `execute_callbacks'
	12: from /usr/lib/ruby/gems/2.5.0/gems/typhoeus-1.3.1/lib/typhoeus/request/callbacks.rb:145:in `each'
	11: from /usr/lib/ruby/gems/2.5.0/gems/typhoeus-1.3.1/lib/typhoeus/request/callbacks.rb:146:in `block in execute_callbacks'
	10: from /usr/lib/ruby/gems/2.5.0/gems/sitediff-0.0.4/lib/sitediff/uriwrapper.rb:107:in `block in typhoeus_request'
	 9: from /usr/lib/ruby/gems/2.5.0/gems/sitediff-0.0.4/lib/sitediff/fetch.rb:48:in `block (2 levels) in queue_path'
	 8: from /usr/lib/ruby/gems/2.5.0/gems/sitediff-0.0.4/lib/sitediff/fetch.rb:58:in `process_results'
	 7: from /usr/lib/ruby/gems/2.5.0/gems/sitediff-0.0.4/lib/sitediff.rb:90:in `process_results'
	 6: from /usr/lib/ruby/gems/2.5.0/gems/sitediff-0.0.4/lib/sitediff.rb:78:in `sanitize'
	 5: from /usr/lib/ruby/gems/2.5.0/gems/sitediff-0.0.4/lib/sitediff.rb:78:in `map'
	 4: from /usr/lib/ruby/gems/2.5.0/gems/sitediff-0.0.4/lib/sitediff.rb:81:in `block in sanitize'
	 3: from /usr/lib/ruby/gems/2.5.0/gems/sitediff-0.0.4/lib/sitediff/sanitize.rb:35:in `sanitize'
	 2: from /usr/lib/ruby/gems/2.5.0/gems/sitediff-0.0.4/lib/sitediff/sanitize.rb:97:in `regexps'
	 1: from /usr/lib/ruby/gems/2.5.0/gems/sitediff-0.0.4/lib/sitediff/sanitize.rb:148:in `prettify'
/usr/lib/ruby/gems/2.5.0/gems/sitediff-0.0.4/lib/sitediff/sanitize.rb:148:in `sub!': invalid byte sequence in UTF-8 (ArgumentError)

I was able to fix this error with no noticeable detrimental effects to the diff report by adding the following line. The addition encodes the string to UTF-8 before sanitizing, replacing any invalid or undefined bytes with an empty string:

# Remove xml declaration and <html> tags
+ str = str.encode('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')
str.sub!(/\A<\?xml.*$\n/, '')
str.sub!(/\A^<html>$\n/, '')
str.sub!(%r{</html>\n\Z}, '')     

I'm not sure if this is a proper fix, not a Ruby dev, and not sure where to PR to, so just submitting this fix as an issue in case its useful.

Inappropriate file type or format - sitediff/cache.db

Getting this error now:

[sitediff] Visited https://subdomain.example.com, cached
/Library/Ruby/Gems/2.3.0/gems/sitediff-0.0.3/lib/sitediff/cache.rb:56:in `initialize': Inappropriate file type or format - sitediff/cache.db (Errno::EFTYPE)

Sitediff init not creating Paths.txt

Hello,

I just installed Sitediff on a fresh Ubuntu VM with Ruby 2.6.6.
When i run the Sitediff init command, tried with a few webpages, it just creates the sitediff.yaml, no paths.txt at all.
Unbenannt

A few problems - time, throttle, and filesize limit error

sitediff looks to be just what I need but i'm running into some issues. i'm trying to compare a dev and prod version of a site.

i'm running into a few problems:

  1. there doesn't seem to be a way to throttle requests. sitediff seems to be requesting as fast as possible?
  2. requests don't timeout. or maybe they have a longer timeout than a minute or two. my dev server stopped responding and this lead to sitediff just hanging until i restarted it.
  3. Filesize limit exceeded: 25 - this seems to be a common ruby problem but i'm not a ruby guy. sitediff failed to finish running with this error.

and the last problem isn't exactly a problem, but sitediff doesn't write it's cache/state as it goes, so after getting my error i'm left with an empty sitediff directory and have to start over.

Feature request - crawl sitemap.xml

We are finding sitediff works quite well, however it may not find all the URLS on a site by links from the home page. We would like (optionally) to be able to add the URLS listed in the sitemap.xml to the paths.

sitediff version throws error

Running sitediff version with sitediff 1.1.1 throws an error:

โžœ docker run -it evolvingweb/sitediff sitediff version
Traceback (most recent call last):
7: from /usr/local/bundle/bin/sitediff:23:in <main>' 6: from /usr/local/bundle/bin/sitediff:23:in load'
5: from /usr/local/bundle/gems/sitediff-1.1.1/bin/sitediff:12:in <top (required)>' 4: from /usr/local/bundle/gems/thor-0.20.3/lib/thor/base.rb:466:in start'
3: from /usr/local/bundle/gems/thor-0.20.3/lib/thor.rb:387:in dispatch' 2: from /usr/local/bundle/gems/thor-0.20.3/lib/thor/invocation.rb:126:in invoke_command'
1: from /usr/local/bundle/gems/thor-0.20.3/lib/thor/command.rb:27:in run' /usr/local/bundle/gems/sitediff-1.1.1/lib/sitediff/cli.rb:45:in version': undefined method `version' for nil:NilClass (NoMethodError)

Internal server error

I get the following error when running the Demo from the readme, as well as using the latest source and 1.0.0-rc1 tag.

On any diff, if I click on the Both button to view the side by side, I get: undefined method css' for SiteDiff::Diff:Module

Full stacktrace:

127.0.0.1 - - [15/May/2020:09:00:14 UTC] "GET /files/report.html HTTP/1.1" 200 113515
http://localhost:13080/files/report.html -> /files/report.html
[2020-05-15 09:00:16] ERROR NoMethodError: undefined method css' for SiteDiff::Diff:Module (erb):9:in do_GET'
/usr/lib/ruby/2.5.0/erb.rb:876:in eval' /usr/lib/ruby/2.5.0/erb.rb:876:in result'
/sitediff/lib/sitediff/webserver/resultserver.rb:66:in do_GET' /usr/lib/ruby/2.5.0/webrick/httpservlet/abstract.rb:105:in service'
/usr/lib/ruby/2.5.0/webrick/httpserver.rb:140:in service' /usr/lib/ruby/2.5.0/webrick/httpserver.rb:96:in run'
/usr/lib/ruby/2.5.0/webrick/server.rb:307:in `block in start_thread'
127.0.0.1 - - [15/May/2020:09:00:16 UTC] "GET /sidebyside/ HTTP/1.1" 500 329
http://localhost:13080/files/report.html -> /sidebyside/

HTML attribute order normalization

Hello,

first of all thanks for sitediff - I am currently in the process of upgrading an old unmaintained project without any tests and your program is very valuable.
One thing that would make it even more robust for me would be the possibility to compare HTML elements with attributes regardless of their order. Because of some internal changes in some libraries I use, I now often get e. g. <input type="text" class="txt"> in one version and <input class="txt" type="text"> in the other. As this has no influence on the resulting HTML function, it would be great to have option to treat both of them as the same.

Unfortunately, this is something that cannot be expressed by regexps, so I have no way how to do it in the config file without some support in sitediff itself.

Any help would be much appreciated.
Beda

Where should the output be?

---
before:
  url: https://www.a.com
  regions:
    - name: title
      selector: h1.title
  output:
    - title
after:
  url: https://beta.b.com
  regions:
    - name: title
      selector: h1.title
  output:
    - title
settings: {}

The diff is the fully html page and not only the h1.title as I trying to make it.

Where should output be set?

Diff can fail with error

A diff can fail when producing a report. The error is about nil and accessing values off of an object.

A few Questions about Sitediff

Hello, I have been using Sitediff for one site and I think it is a great tool and was thinking of using it in our Environment at work and I had a few questions about the product.

  1. is it possible to add multiple sites to check the differences ( E.G we have around 100 production and staging sites we would want to check the difference of, like https://engineering.acquiastaging.temple.edu/ to https://engineering.temple.edu/ and https://bursar.acquiastaging.temple.edu/ to https://bursar.temple.edu/)

2.if there is a way to do all 100 site differences at one time would they be produced site specifically or in one large file?

  1. is there a way in the product to list the number of differences, for instance on the site difference the production site is missing about 6 lines in total compared to the staging site

  2. does this have to be run manually, or could a script be run to produce the site diff for all sites at once?

Thanks!

Changing output/diffs color scheme

I was wondering where would we change the color scheme of the highlighted changes or diffs in the output files provided in the /output/diffs files. the red is not accessible to some forms of colorblindness and I would like to change this color with a color scheme accessible for everyone.

image

Seems to use 'after' site for both before and after

Ubuntu 18.04, sitediff installed via Docker.

1.0.0

Unless I'm doing something wrong, sitediff seems to use the "after" site twice instead of both the "before" and "after" sites.

With the exception of CSRF tokens, the rest of the markup is reported as being the same, but the two sites differ in some other respects. For example:

before:
<input type="hidden" name="action" value="guestEntries/saveEntry">

after:
<input type="hidden" name="action" value="guest-entries/save">

But sitediff shows both before and after with the after site's input.

$ sudo docker run -p 13080:13080 -t -d --name sitediff evolvingweb/sitediff:1.0.0
$ sudo docker exec -it sitediff /bin/bash

root@7004949d432e:/sitediff# bundle install
root@7004949d432e:/sitediff# bundle exec thor fixture:serve

root@7004949d432e:/sitediff# sitediff init https://[site1].com/ https://[site2].co.uk/
[success] Created /sitediff/sitediff/sitediff.yaml
Reading config file: /sitediff/sitediff/sitediff.yaml
Visited <snip>
106 page(s) found.
[done] Created /sitediff/sitediff/paths.txt.
[success] You can now run "sitediff diff".

root@7004949d432e:/sitediff# sitediff diff
Reading config file: /sitediff/sitediff/sitediff.yaml
Read 106 paths from: /sitediff/sitediff/paths.txt
Using sites from cache: before
<snip>
All diff files written to /sitediff/sitediff/diffs
All failures written to /sitediff/sitediff/failures.txt
Report generated to /sitediff/sitediff/report.html
Run "sitediff serve" to see a report.

root@7004949d432e:/sitediff# sitediff serve
Reading config file: /sitediff/sitediff/sitediff.yaml
{"before"=>"https://site1.com/", "after"=>"https://site2.co.uk/", "cached"=>["before", "after"]}
Serving at http://localhost:13080
[2020-09-18 15:48:43] INFO  WEBrick 1.6.0
[2020-09-18 15:48:43] INFO  ruby 2.7.1 (2020-03-31) [x86_64-linux]
[2020-09-18 15:48:43] INFO  WEBrick::HTTPServer#start: pid=176 port=13080

latest

I then tried with "latest" via Docker and found the same thing. Additionally, after adding dom_transform:

---
before:
  url: https://site1.com/
  dom_transform:
    - type: remove
    - selector: input[name=CRAFT_CSRF_TOKEN]
after:
  url: https://site2.co.uk/
  dom_transform:
    - type: remove
    - selector: input[name=CSRF_TOKEN]

and running sitediff diff I get:

Reading config file: /sitediff/sitediff/sitediff.yaml
Read 38 paths from: /sitediff/sitediff/paths.txt
Using sites from cache: before
Traceback (most recent call last):
	10: from /usr/local/bundle/bin/sitediff:23:in `<main>'
	 9: from /usr/local/bundle/bin/sitediff:23:in `load'
	 8: from /usr/local/bundle/gems/sitediff-1.0.0/bin/sitediff:12:in `<top (required)>'
	 7: from /usr/local/bundle/gems/thor-0.20.3/lib/thor/base.rb:466:in `start'
	 6: from /usr/local/bundle/gems/thor-0.20.3/lib/thor.rb:387:in `dispatch'
	 5: from /usr/local/bundle/gems/thor-0.20.3/lib/thor/invocation.rb:126:in `invoke_command'
	 4: from /usr/local/bundle/gems/thor-0.20.3/lib/thor/command.rb:27:in `run'
	 3: from /usr/local/bundle/gems/sitediff-1.0.0/lib/sitediff/cli.rb:158:in `diff'
	 2: from /usr/local/bundle/gems/sitediff-1.0.0/lib/sitediff.rb:186:in `run'
	 1: from /usr/local/bundle/gems/sitediff-1.0.0/lib/sitediff.rb:186:in `map'
/usr/local/bundle/gems/sitediff-1.0.0/lib/sitediff.rb:186:in `block in run': undefined method `success?' for nil:NilClass (NoMethodError)

sitediff store ends up in ArgumentError

Steps to reproduce:

docker run -p 13080:13080 -t -it --name sitediff --rm -v $(pwd):/project --workdir=/project evolvingweb/sitediff:latest /bin/bash
root@f80772921e74:/project# sitediff store
Reading config file: /project/sitediff/sitediff.yaml
/usr/local/bundle/gems/sitediff-1.2.3/lib/sitediff/fetch.rb:13:in `initialize': wrong number of arguments (given 6, expected 3..5) (ArgumentError)
        from /usr/local/bundle/gems/sitediff-1.2.3/lib/sitediff/api.rb:232:in `new'
        from /usr/local/bundle/gems/sitediff-1.2.3/lib/sitediff/api.rb:232:in `store'
        from /usr/local/bundle/gems/sitediff-1.2.3/lib/sitediff/cli.rb:221:in `store'
        from /usr/local/bundle/gems/thor-1.2.1/lib/thor/command.rb:27:in `run'
        from /usr/local/bundle/gems/thor-1.2.1/lib/thor/invocation.rb:127:in `invoke_command'
        from /usr/local/bundle/gems/thor-1.2.1/lib/thor.rb:392:in `dispatch'
        from /usr/local/bundle/gems/thor-1.2.1/lib/thor/base.rb:485:in `start'
        from /usr/local/bundle/gems/sitediff-1.2.3/bin/sitediff:12:in `<top (required)>'
        from /usr/local/bundle/bin/sitediff:25:in `load'
        from /usr/local/bundle/bin/sitediff:25:in `<main>'
root@f80772921e74:/project#

Used image: https://hub.docker.com/layers/evolvingweb/sitediff/latest/images/sha256-a2529b4a26e58d5c1288372d832e90094b96096ef870702dd5e40c2e5cf572bf?context=explore

Do you need the configuration / contents of the mounted folder?

Follow redirects

currently the tool fails if the URLs are redirected, e.g. from http to https

it would be good to follow redirects, and/or enable a mixture of http and https paths within the same sitediff.yaml file

Diff on different URL

When we do a diff on sites on two different URL, the tool shouldn't report the change in base URL as a diff...

Like, if we scan examplea.com and exampleb.com, and my site generate all HREF as absolute URL, for now I get a diff flagged on the HREF, even if the rest of the URL is the same...

Ignoring the whole href with a regex isn't an option since there may be a change in the second part of this url that needs to be flagged...

Future ideas

I'll start the list here.

  • For demo (or free tier purposes), crawl only X pages (10? 100)
    • Maybe use the Google Analytics API to download list of top 100 pages, just crawl those

Error: HASH: Out of overflow pages. Increase page size

I'm not sure what this means, and the documentation doesn't say anything about pages that overflow. Some clarification about what this error means and how to fix it would be helpful. Regardless, thank you for this great tool.

Limit which page are crawled with regexes

Hi,

Is there a way to limit which pages are crawled on a site with some regex rules ? I'd like to limit our crawl on one language on our website, and some sections, but let the site admins play with the URL as they want. So limiting it with direct URL in the settings.yaml file isn't a sustainable option...

Thanks!

Error with settings key

Unknown configuration key (/Users/josipanic/Projekti/test/sitediff/sitediff.yaml): 'settings'

Exclude does not seem to work for crawl

I never managed to get exclude to work, e.g. exclude pdf files like in the doku and example config. sitediff version used is 1.1.1.

The command line doesn't seem to know a exclude switch:

# sitediff crawl --exclude='.*.pdf'
Unknown switches "--exclude=.*.pdf"

Also no effect when added to the sitediff.yaml config file like shown here, pdf files are still crawled:
https://github.com/evolvingweb/sitediff/blob/master/config/sitediff.example.yaml#L18

Also, could you please add an example how to exclude multiple patterns, not just one?

Improve docume ntation for "Export" option

Hello,

I may have overlooked it in the documentation, but it wasn't obvious to me that to use the "export" feature you can run sitediff diff with the -e or --export switch. For example: sitediff diff --paths-file sitediff/paths_sample.txt --export

Can you please update the documentation to make this obvious?

Thank you!

nokogiri requires Ruby version >= 2.1.0.

Excited to try using this, following the documented instructions I get the following:

gem install nokogiri --no-rdoc --no-ri -- --use-system-libraries=true --with-xml2-include=/usr/include/libxml2
Fetching: mini_portile2-2.1.0.gem (100%)
Fetching: nokogiri-1.7.0.1.gem (100%)
ERROR:  Error installing nokogiri:
	nokogiri requires Ruby version >= 2.1.0.

Merging sanitisation rules from includes

I'd like to merge two sets of sanitisation rules - one "base" set that we apply to all sites, and then a set that is tailored for a site.

I had hoped to leverage the "include" functionality to put our base rules in a file in a parent directory (this part works), and then add additional rules in the sitediff.yaml. However it appears that the sanitisation: block in the sitediff.yaml overriddes that in the include. I also tried using YAML anchors, but the parser reports an "unknown alias".

I'm not sure if I just have the wrong syntax or if this just currently isn't possible?

Example:
testsite/sitediff.yaml:

includes:
  - "../base-rules.yaml"
before:
  url: https://beforesite.com
after:
  url: https://aftersite.com
sanitization:
- *base_sanitisation
- title: Strip domain names from absolute URLs
  pattern: https?:\/\/[^\/]+
  substitute: __domain__
  disabled: false
...

base-rules.yaml:

sanitization: &base_sanitisation
# base sanitisation rules
- title: Strip revision differences
  pattern: \?rev=[^"]+
  disabled: false

Paths with trailing slashes always have the trailing slash removed

We are working with a site that has a number of index pages, by category. It uses URLS of the form /content//, and uses the trailing slash to differentiate between other filters and a category filter. As such, removing the trailing slash results in a URL that is not on the site.

This is done in uriwrapper.rb:194

I understand the desire to canonicalise the URLs, however it might be useful if it were optional for some sites.

Invalid byte sequence in US-ASCII (ArgumentError) when running `sitediff diff`

Summary

I have no problem running sitediff store or sitediff crawl; however, when running sitediff diff I keep getting the following error:

sitediff diff
/usr/local/bundle/gems/sitediff-1.1.1/lib/sitediff/cache.rb:44:in `split': invalid byte sequence in US-ASCII (ArgumentError)
	from /usr/local/bundle/gems/sitediff-1.1.1/lib/sitediff/cache.rb:44:in `get'
	from /usr/local/bundle/gems/sitediff-1.1.1/lib/sitediff/fetch.rb:46:in `block in queue_path'
	from /usr/local/bundle/gems/sitediff-1.1.1/lib/sitediff/fetch.rb:45:in `each'
	from /usr/local/bundle/gems/sitediff-1.1.1/lib/sitediff/fetch.rb:45:in `queue_path'
	from /usr/local/bundle/gems/sitediff-1.1.1/lib/sitediff/fetch.rb:35:in `block in run'
	from /usr/local/bundle/gems/sitediff-1.1.1/lib/sitediff/fetch.rb:35:in `each'
	from /usr/local/bundle/gems/sitediff-1.1.1/lib/sitediff/fetch.rb:35:in `run'
	from /usr/local/bundle/gems/sitediff-1.1.1/lib/sitediff.rb:184:in `run'
	from /usr/local/bundle/gems/sitediff-1.1.1/lib/sitediff/api.rb:117:in `diff'
	from /usr/local/bundle/gems/sitediff-1.1.1/lib/sitediff/cli.rb:127:in `diff'
	from /usr/local/bundle/gems/thor-0.20.3/lib/thor/command.rb:27:in `run'
	from /usr/local/bundle/gems/thor-0.20.3/lib/thor/invocation.rb:126:in `invoke_command'
	from /usr/local/bundle/gems/thor-0.20.3/lib/thor.rb:387:in `dispatch'
	from /usr/local/bundle/gems/thor-0.20.3/lib/thor/base.rb:466:in `start'
	from /usr/local/bundle/gems/sitediff-1.1.1/bin/sitediff:12:in `<top (required)>'
	from /usr/local/bundle/bin/sitediff:23:in `load'
	from /usr/local/bundle/bin/sitediff:23:in `<main>'
Reading config file: /website/sitediff/sitediff.yaml
Read 4582 paths from: /website/sitediff/paths.txt

Solution attempts

I was able to pass that error by patching it with:

sed -i 's/path.split(File::SEPARATOR)/path.encode('\''UTF-8'\'', :invalid => :replace).split(File::SEPARATOR)/g' /usr/local/bundle/gems/sitediff-1.1.1/lib/sitediff/cache.rb

But then I started to get other error:

sitediff diff
/usr/local/bundle/gems/addressable-2.5.2/lib/addressable/uri.rb:107:in `scan': invalid byte sequence in US-ASCII (ArgumentError)
	from /usr/local/bundle/gems/addressable-2.5.2/lib/addressable/uri.rb:107:in `parse'
	from /usr/local/bundle/gems/sitediff-1.1.1/lib/sitediff/uriwrapper.rb:52:in `initialize'
	from /usr/local/bundle/gems/sitediff-1.1.1/lib/sitediff/fetch.rb:54:in `new'
	from /usr/local/bundle/gems/sitediff-1.1.1/lib/sitediff/fetch.rb:54:in `block in queue_path'
	from /usr/local/bundle/gems/sitediff-1.1.1/lib/sitediff/fetch.rb:45:in `each'
	from /usr/local/bundle/gems/sitediff-1.1.1/lib/sitediff/fetch.rb:45:in `queue_path'
	from /usr/local/bundle/gems/sitediff-1.1.1/lib/sitediff/fetch.rb:35:in `block in run'
	from /usr/local/bundle/gems/sitediff-1.1.1/lib/sitediff/fetch.rb:35:in `each'
	from /usr/local/bundle/gems/sitediff-1.1.1/lib/sitediff/fetch.rb:35:in `run'
	from /usr/local/bundle/gems/sitediff-1.1.1/lib/sitediff.rb:184:in `run'
	from /usr/local/bundle/gems/sitediff-1.1.1/lib/sitediff/api.rb:117:in `diff'
	from /usr/local/bundle/gems/sitediff-1.1.1/lib/sitediff/cli.rb:127:in `diff'
	from /usr/local/bundle/gems/thor-0.20.3/lib/thor/command.rb:27:in `run'
	from /usr/local/bundle/gems/thor-0.20.3/lib/thor/invocation.rb:126:in `invoke_command'
	from /usr/local/bundle/gems/thor-0.20.3/lib/thor.rb:387:in `dispatch'
	from /usr/local/bundle/gems/thor-0.20.3/lib/thor/base.rb:466:in `start'
	from /usr/local/bundle/gems/sitediff-1.1.1/bin/sitediff:12:in `<top (required)>'
	from /usr/local/bundle/bin/sitediff:23:in `load'
	from /usr/local/bundle/bin/sitediff:23:in `<main>'
Reading config file: /website/sitediff/sitediff.yaml
Read 4581 paths from: /website/sitediff/paths.txt
Using sites from cache: before

I have also tried to declare the encoding in the container before running/installing it with no success:

export LANG="en_US.UTF-8"
export LANGUAGE="en_US.UTF-8"
export LC_CTYPE="en_US.UTF-8"
export LC_NUMERIC="en_US.UTF-8"
export LC_TIME="en_US.UTF-8"
export LC_COLLATE="en_US.UTF-8"
export LC_MONETARY="en_US.UTF-8"
export LC_MESSAGES="en_US.UTF-8"
export LC_PAPER="en_US.UTF-8"
export LC_NAME="en_US.UTF-8"
export LC_ADDRESS="en_US.UTF-8"
export LC_TELEPHONE="en_US.UTF-8"
export LC_MEASUREMENT="en_US.UTF-8"
export LC_IDENTIFICATION="en_US.UTF-8"

Any thoughts?

Tech stack

  • Ubuntu 21.04
  • Docker container Ruby v2.6.9
  • Sitediff v1.1.1

Always getting binary diffs, even for html pages

I'm always getting binary diffs even on html pages.

Something similar to what is described here: https://github.com/evolvingweb/sitediff/blob/master/lib/sitediff.rb#L103

In case my setup could be affecting, I'm adding sitediff as part of a continuous integration system.
When a commit is made, I start two python simple http servers hosting the updated and old version of the page, and do:

/sitediff/bin/sitediff init http://0.0.0.0:8311/ http://0.0.0.0:8312/ 
/sitediff/bin/sitediff diff --cached=none

Save the result, and later then check offline the generated report.html
I also save full copies of the before and after pages, and can open them fine with a web browser.

Option to ignore whitespace

I'm using sitediff to check a site's content during a CMS migration. The new system generates different whitespace in HTML elements, which I'd like to ignore. So far my attempts to use the sanitization and dom_transform parameters to canonicalise whitespace differences is met with failure, because

I'd like a parameter that works like diff(1)'s -b flag (--ignore-space-change) or -w (--ignore-all-space).

Thanks.

Command "sitediff version" throws error

Traceback (most recent call last):
        7: from /usr/local/bundle/bin/sitediff:23:in `<main>'
        6: from /usr/local/bundle/bin/sitediff:23:in `load'
        5: from /usr/local/bundle/gems/sitediff-1.0.0/bin/sitediff:12:in `<top (required)>'
        4: from /usr/local/bundle/gems/thor-0.20.3/lib/thor/base.rb:466:in `start'
        3: from /usr/local/bundle/gems/thor-0.20.3/lib/thor.rb:387:in `dispatch'
        2: from /usr/local/bundle/gems/thor-0.20.3/lib/thor/invocation.rb:126:in `invoke_command'
        1: from /usr/local/bundle/gems/thor-0.20.3/lib/thor/command.rb:27:in `run'
/usr/local/bundle/gems/sitediff-1.0.0/lib/sitediff/cli.rb:51:in `version': undefined method `version' for nil:NilClass (NoMethodError)

error reading config file in `sitediff serve`

I have a sitediff that I initialized. I can do sitediff diff, and get results, but the serve command crashes.

~/sitediffs/www.mysite.com$ sitediff serve
[sitediff] Reading config file: /home/user/sitediffs/www.mysite.com/sitediff/sitediff.yaml
/usr/lib/ruby/2.1.0/psych.rb:464:in `initialize': No such file or directory @ rb_sysopen - output/settings.yaml (Errno::ENOENT)
	from /usr/lib/ruby/2.1.0/psych.rb:464:in `open'
	from /usr/lib/ruby/2.1.0/psych.rb:464:in `load_file'
	from /var/lib/gems/2.1.0/gems/sitediff-0.0.3/lib/sitediff/webserver/resultserver.rb:56:in `initialize'
	from /var/lib/gems/2.1.0/gems/sitediff-0.0.3/lib/sitediff/cli.rb:131:in `new'
	from /var/lib/gems/2.1.0/gems/sitediff-0.0.3/lib/sitediff/cli.rb:131:in `serve'
	from /var/lib/gems/2.1.0/gems/thor-0.19.4/lib/thor/command.rb:27:in `run'
	from /var/lib/gems/2.1.0/gems/thor-0.19.4/lib/thor/invocation.rb:126:in `invoke_command'
	from /var/lib/gems/2.1.0/gems/thor-0.19.4/lib/thor.rb:369:in `dispatch'
	from /var/lib/gems/2.1.0/gems/thor-0.19.4/lib/thor/base.rb:444:in `start'
	from /var/lib/gems/2.1.0/gems/sitediff-0.0.3/bin/sitediff:10:in `<top (required)>'
	from /usr/local/bin/sitediff:23:in `load'
	from /usr/local/bin/sitediff:23:in `<main>'

$  ruby -v
ruby 2.1.5p273 (2014-11-13) [x86_64-linux-gnu]

I tried to get the version from sitediff, but it didn't recognize -v or --version. I installed it ~ two days ago.

Sitediff seems to crawl the canonical url

It seems that sitediff crawls the canonical url if there is one specified. It would be great if we can disable that as it is impossible to crawl that local page now.

Allow the use of capture groups in substitutions

What I'd like to do is:

sanitization:
  - title: Remove trailing whitespace from class values
    pattern: class="([\w\s]+)\s"
    substitute: class="$1"

With the desire that this would not be a diff:

+ class="foo baa "
- class="foo baa"

BTW awesome tool! Thank you. I've not written Ruby before but if I can help I'd like to.

Can sitediff load pages from disk?

Hello, I am glad to contact you about sitediff toolchain. I try to exploit sitediff in my project to compare two pages that both have been stored in disk locally. However, I found that the sitediff only fetches the page by crawling from Internet. The sitediff have provided a similar feature about loading page from disk?

Under AWS Linux 2 running "sitediff store" command throws an exception

Following the CentOS directions we assumed sitediff would also work under AWS Linux 2.

When running "sitediff store" we get the following error.
$ sitediff store Reading config file: /home/ec2-user/sitediff/www/sitediff/sitediff.yaml Traceback (most recent call last): 10: from /home/ec2-user/.rvm/gems/ruby-2.7.1/bin/ruby_executable_hooks:24:in

'
9: from /home/ec2-user/.rvm/gems/ruby-2.7.1/bin/ruby_executable_hooks:24:in eval' 8: from /home/ec2-user/.rvm/gems/ruby-2.7.1/bin/sitediff:23:in '
7: from /home/ec2-user/.rvm/gems/ruby-2.7.1/bin/sitediff:23:in load' 6: from /home/ec2-user/.rvm/gems/ruby-2.7.1/gems/sitediff-1.1.1/bin/sitediff:12:in <top (required)>'
5: from /home/ec2-user/.rvm/gems/ruby-2.7.1/gems/thor-0.20.3/lib/thor/base.rb:466:in start' 4: from /home/ec2-user/.rvm/gems/ruby-2.7.1/gems/thor-0.20.3/lib/thor.rb:387:in dispatch'
3: from /home/ec2-user/.rvm/gems/ruby-2.7.1/gems/thor-0.20.3/lib/thor/invocation.rb:126:in invoke_command' 2: from /home/ec2-user/.rvm/gems/ruby-2.7.1/gems/thor-0.20.3/lib/thor/command.rb:27:in run'
1: from /home/ec2-user/.rvm/gems/ruby-2.7.1/gems/sitediff-1.1.1/lib/sitediff/cli.rb:220:in store' /home/ec2-user/.rvm/gems/ruby-2.7.1/gems/sitediff-1.1.1/lib/sitediff/api.rb:226:in store': undefined method get_curl_opts' for #<SiteDiff::Api:0x0000000002a6b010> (NoMethodError)
Running all other sitediff commands seem to work as expected.

Any suggestions? We really like the concept of this tool.

Thanks
Jason

sitediff fails with "Not a directory @ apply2files" if crawl only produces one page

Sitediff fails to compare 2 single-paged URLs/sites: the before/after entries in the snapshots directory are files, not directories containing other entries, so this should also be taken into account

Error occurs on Linux ubuntu laptop and on Macbook Air 2020 M1 with MacOS Ventura 13.1 when installing sitediff via homebrew in latest version

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.