Giter VIP home page Giter VIP logo

sitediff's Issues

Always getting binary diffs, even for html pages

I'm always getting binary diffs even on html pages.

Something similar to what is described here: https://github.com/evolvingweb/sitediff/blob/master/lib/sitediff.rb#L103

In case my setup could be affecting, I'm adding sitediff as part of a continuous integration system.
When a commit is made, I start two python simple http servers hosting the updated and old version of the page, and do:

/sitediff/bin/sitediff init http://0.0.0.0:8311/ http://0.0.0.0:8312/ 
/sitediff/bin/sitediff diff --cached=none

Save the result, and later then check offline the generated report.html
I also save full copies of the before and after pages, and can open them fine with a web browser.

Inappropriate file type or format - sitediff/cache.db

Getting this error now:

[sitediff] Visited https://subdomain.example.com, cached
/Library/Ruby/Gems/2.3.0/gems/sitediff-0.0.3/lib/sitediff/cache.rb:56:in `initialize': Inappropriate file type or format - sitediff/cache.db (Errno::EFTYPE)

sitediff version throws error

Running sitediff version with sitediff 1.1.1 throws an error:

โžœ docker run -it evolvingweb/sitediff sitediff version
Traceback (most recent call last):
7: from /usr/local/bundle/bin/sitediff:23:in <main>' 6: from /usr/local/bundle/bin/sitediff:23:in load'
5: from /usr/local/bundle/gems/sitediff-1.1.1/bin/sitediff:12:in <top (required)>' 4: from /usr/local/bundle/gems/thor-0.20.3/lib/thor/base.rb:466:in start'
3: from /usr/local/bundle/gems/thor-0.20.3/lib/thor.rb:387:in dispatch' 2: from /usr/local/bundle/gems/thor-0.20.3/lib/thor/invocation.rb:126:in invoke_command'
1: from /usr/local/bundle/gems/thor-0.20.3/lib/thor/command.rb:27:in run' /usr/local/bundle/gems/sitediff-1.1.1/lib/sitediff/cli.rb:45:in version': undefined method `version' for nil:NilClass (NoMethodError)

Future ideas

I'll start the list here.

  • For demo (or free tier purposes), crawl only X pages (10? 100)
    • Maybe use the Google Analytics API to download list of top 100 pages, just crawl those

Command "sitediff version" throws error

Traceback (most recent call last):
        7: from /usr/local/bundle/bin/sitediff:23:in `<main>'
        6: from /usr/local/bundle/bin/sitediff:23:in `load'
        5: from /usr/local/bundle/gems/sitediff-1.0.0/bin/sitediff:12:in `<top (required)>'
        4: from /usr/local/bundle/gems/thor-0.20.3/lib/thor/base.rb:466:in `start'
        3: from /usr/local/bundle/gems/thor-0.20.3/lib/thor.rb:387:in `dispatch'
        2: from /usr/local/bundle/gems/thor-0.20.3/lib/thor/invocation.rb:126:in `invoke_command'
        1: from /usr/local/bundle/gems/thor-0.20.3/lib/thor/command.rb:27:in `run'
/usr/local/bundle/gems/sitediff-1.0.0/lib/sitediff/cli.rb:51:in `version': undefined method `version' for nil:NilClass (NoMethodError)

Paths with trailing slashes always have the trailing slash removed

We are working with a site that has a number of index pages, by category. It uses URLS of the form /content//, and uses the trailing slash to differentiate between other filters and a category filter. As such, removing the trailing slash results in a URL that is not on the site.

This is done in uriwrapper.rb:194

I understand the desire to canonicalise the URLs, however it might be useful if it were optional for some sites.

Where should the output be?

---
before:
  url: https://www.a.com
  regions:
    - name: title
      selector: h1.title
  output:
    - title
after:
  url: https://beta.b.com
  regions:
    - name: title
      selector: h1.title
  output:
    - title
settings: {}

The diff is the fully html page and not only the h1.title as I trying to make it.

Where should output be set?

Sitediff seems to crawl the canonical url

It seems that sitediff crawls the canonical url if there is one specified. It would be great if we can disable that as it is impossible to crawl that local page now.

Improve docume ntation for "Export" option

Hello,

I may have overlooked it in the documentation, but it wasn't obvious to me that to use the "export" feature you can run sitediff diff with the -e or --export switch. For example: sitediff diff --paths-file sitediff/paths_sample.txt --export

Can you please update the documentation to make this obvious?

Thank you!

[sitediff] ERROR (HTTP error https://www.eroswholesale.com/index.php: Unknown Error) /index.php

hey guys, first of all, congratulations for the great work.

I'm getting this error when running:

sitediff diff --cached=none

The failures.txt file doesn't show anything. Only a single "/".

See below my sitediff.yml


after:
url: https://www.eroswholesale.com
before:
url: https://www.eroswholesale.com
paths:

  • /

I'm guessing the target is blocking my requests based on the agent since this is not happening when I craw a local website

Any thoughts?

Cheers,
Rafa F

Diff on different URL

When we do a diff on sites on two different URL, the tool shouldn't report the change in base URL as a diff...

Like, if we scan examplea.com and exampleb.com, and my site generate all HREF as absolute URL, for now I get a diff flagged on the HREF, even if the rest of the URL is the same...

Ignoring the whole href with a regex isn't an option since there may be a change in the second part of this url that needs to be flagged...

A few Questions about Sitediff

Hello, I have been using Sitediff for one site and I think it is a great tool and was thinking of using it in our Environment at work and I had a few questions about the product.

  1. is it possible to add multiple sites to check the differences ( E.G we have around 100 production and staging sites we would want to check the difference of, like https://engineering.acquiastaging.temple.edu/ to https://engineering.temple.edu/ and https://bursar.acquiastaging.temple.edu/ to https://bursar.temple.edu/)

2.if there is a way to do all 100 site differences at one time would they be produced site specifically or in one large file?

  1. is there a way in the product to list the number of differences, for instance on the site difference the production site is missing about 6 lines in total compared to the staging site

  2. does this have to be run manually, or could a script be run to produce the site diff for all sites at once?

Thanks!

Seems to use 'after' site for both before and after

Ubuntu 18.04, sitediff installed via Docker.

1.0.0

Unless I'm doing something wrong, sitediff seems to use the "after" site twice instead of both the "before" and "after" sites.

With the exception of CSRF tokens, the rest of the markup is reported as being the same, but the two sites differ in some other respects. For example:

before:
<input type="hidden" name="action" value="guestEntries/saveEntry">

after:
<input type="hidden" name="action" value="guest-entries/save">

But sitediff shows both before and after with the after site's input.

$ sudo docker run -p 13080:13080 -t -d --name sitediff evolvingweb/sitediff:1.0.0
$ sudo docker exec -it sitediff /bin/bash

root@7004949d432e:/sitediff# bundle install
root@7004949d432e:/sitediff# bundle exec thor fixture:serve

root@7004949d432e:/sitediff# sitediff init https://[site1].com/ https://[site2].co.uk/
[success] Created /sitediff/sitediff/sitediff.yaml
Reading config file: /sitediff/sitediff/sitediff.yaml
Visited <snip>
106 page(s) found.
[done] Created /sitediff/sitediff/paths.txt.
[success] You can now run "sitediff diff".

root@7004949d432e:/sitediff# sitediff diff
Reading config file: /sitediff/sitediff/sitediff.yaml
Read 106 paths from: /sitediff/sitediff/paths.txt
Using sites from cache: before
<snip>
All diff files written to /sitediff/sitediff/diffs
All failures written to /sitediff/sitediff/failures.txt
Report generated to /sitediff/sitediff/report.html
Run "sitediff serve" to see a report.

root@7004949d432e:/sitediff# sitediff serve
Reading config file: /sitediff/sitediff/sitediff.yaml
{"before"=>"https://site1.com/", "after"=>"https://site2.co.uk/", "cached"=>["before", "after"]}
Serving at http://localhost:13080
[2020-09-18 15:48:43] INFO  WEBrick 1.6.0
[2020-09-18 15:48:43] INFO  ruby 2.7.1 (2020-03-31) [x86_64-linux]
[2020-09-18 15:48:43] INFO  WEBrick::HTTPServer#start: pid=176 port=13080

latest

I then tried with "latest" via Docker and found the same thing. Additionally, after adding dom_transform:

---
before:
  url: https://site1.com/
  dom_transform:
    - type: remove
    - selector: input[name=CRAFT_CSRF_TOKEN]
after:
  url: https://site2.co.uk/
  dom_transform:
    - type: remove
    - selector: input[name=CSRF_TOKEN]

and running sitediff diff I get:

Reading config file: /sitediff/sitediff/sitediff.yaml
Read 38 paths from: /sitediff/sitediff/paths.txt
Using sites from cache: before
Traceback (most recent call last):
	10: from /usr/local/bundle/bin/sitediff:23:in `<main>'
	 9: from /usr/local/bundle/bin/sitediff:23:in `load'
	 8: from /usr/local/bundle/gems/sitediff-1.0.0/bin/sitediff:12:in `<top (required)>'
	 7: from /usr/local/bundle/gems/thor-0.20.3/lib/thor/base.rb:466:in `start'
	 6: from /usr/local/bundle/gems/thor-0.20.3/lib/thor.rb:387:in `dispatch'
	 5: from /usr/local/bundle/gems/thor-0.20.3/lib/thor/invocation.rb:126:in `invoke_command'
	 4: from /usr/local/bundle/gems/thor-0.20.3/lib/thor/command.rb:27:in `run'
	 3: from /usr/local/bundle/gems/sitediff-1.0.0/lib/sitediff/cli.rb:158:in `diff'
	 2: from /usr/local/bundle/gems/sitediff-1.0.0/lib/sitediff.rb:186:in `run'
	 1: from /usr/local/bundle/gems/sitediff-1.0.0/lib/sitediff.rb:186:in `map'
/usr/local/bundle/gems/sitediff-1.0.0/lib/sitediff.rb:186:in `block in run': undefined method `success?' for nil:NilClass (NoMethodError)

Feature request - crawl sitemap.xml

We are finding sitediff works quite well, however it may not find all the URLS on a site by links from the home page. We would like (optionally) to be able to add the URLS listed in the sitemap.xml to the paths.

Sitediff init not creating Paths.txt

Hello,

I just installed Sitediff on a fresh Ubuntu VM with Ruby 2.6.6.
When i run the Sitediff init command, tried with a few webpages, it just creates the sitediff.yaml, no paths.txt at all.
Unbenannt

Merging sanitisation rules from includes

I'd like to merge two sets of sanitisation rules - one "base" set that we apply to all sites, and then a set that is tailored for a site.

I had hoped to leverage the "include" functionality to put our base rules in a file in a parent directory (this part works), and then add additional rules in the sitediff.yaml. However it appears that the sanitisation: block in the sitediff.yaml overriddes that in the include. I also tried using YAML anchors, but the parser reports an "unknown alias".

I'm not sure if I just have the wrong syntax or if this just currently isn't possible?

Example:
testsite/sitediff.yaml:

includes:
  - "../base-rules.yaml"
before:
  url: https://beforesite.com
after:
  url: https://aftersite.com
sanitization:
- *base_sanitisation
- title: Strip domain names from absolute URLs
  pattern: https?:\/\/[^\/]+
  substitute: __domain__
  disabled: false
...

base-rules.yaml:

sanitization: &base_sanitisation
# base sanitisation rules
- title: Strip revision differences
  pattern: \?rev=[^"]+
  disabled: false

Error with settings key

Unknown configuration key (/Users/josipanic/Projekti/test/sitediff/sitediff.yaml): 'settings'

Invalid byte sequence in US-ASCII (ArgumentError) when running `sitediff diff`

Summary

I have no problem running sitediff store or sitediff crawl; however, when running sitediff diff I keep getting the following error:

sitediff diff
/usr/local/bundle/gems/sitediff-1.1.1/lib/sitediff/cache.rb:44:in `split': invalid byte sequence in US-ASCII (ArgumentError)
	from /usr/local/bundle/gems/sitediff-1.1.1/lib/sitediff/cache.rb:44:in `get'
	from /usr/local/bundle/gems/sitediff-1.1.1/lib/sitediff/fetch.rb:46:in `block in queue_path'
	from /usr/local/bundle/gems/sitediff-1.1.1/lib/sitediff/fetch.rb:45:in `each'
	from /usr/local/bundle/gems/sitediff-1.1.1/lib/sitediff/fetch.rb:45:in `queue_path'
	from /usr/local/bundle/gems/sitediff-1.1.1/lib/sitediff/fetch.rb:35:in `block in run'
	from /usr/local/bundle/gems/sitediff-1.1.1/lib/sitediff/fetch.rb:35:in `each'
	from /usr/local/bundle/gems/sitediff-1.1.1/lib/sitediff/fetch.rb:35:in `run'
	from /usr/local/bundle/gems/sitediff-1.1.1/lib/sitediff.rb:184:in `run'
	from /usr/local/bundle/gems/sitediff-1.1.1/lib/sitediff/api.rb:117:in `diff'
	from /usr/local/bundle/gems/sitediff-1.1.1/lib/sitediff/cli.rb:127:in `diff'
	from /usr/local/bundle/gems/thor-0.20.3/lib/thor/command.rb:27:in `run'
	from /usr/local/bundle/gems/thor-0.20.3/lib/thor/invocation.rb:126:in `invoke_command'
	from /usr/local/bundle/gems/thor-0.20.3/lib/thor.rb:387:in `dispatch'
	from /usr/local/bundle/gems/thor-0.20.3/lib/thor/base.rb:466:in `start'
	from /usr/local/bundle/gems/sitediff-1.1.1/bin/sitediff:12:in `<top (required)>'
	from /usr/local/bundle/bin/sitediff:23:in `load'
	from /usr/local/bundle/bin/sitediff:23:in `<main>'
Reading config file: /website/sitediff/sitediff.yaml
Read 4582 paths from: /website/sitediff/paths.txt

Solution attempts

I was able to pass that error by patching it with:

sed -i 's/path.split(File::SEPARATOR)/path.encode('\''UTF-8'\'', :invalid => :replace).split(File::SEPARATOR)/g' /usr/local/bundle/gems/sitediff-1.1.1/lib/sitediff/cache.rb

But then I started to get other error:

sitediff diff
/usr/local/bundle/gems/addressable-2.5.2/lib/addressable/uri.rb:107:in `scan': invalid byte sequence in US-ASCII (ArgumentError)
	from /usr/local/bundle/gems/addressable-2.5.2/lib/addressable/uri.rb:107:in `parse'
	from /usr/local/bundle/gems/sitediff-1.1.1/lib/sitediff/uriwrapper.rb:52:in `initialize'
	from /usr/local/bundle/gems/sitediff-1.1.1/lib/sitediff/fetch.rb:54:in `new'
	from /usr/local/bundle/gems/sitediff-1.1.1/lib/sitediff/fetch.rb:54:in `block in queue_path'
	from /usr/local/bundle/gems/sitediff-1.1.1/lib/sitediff/fetch.rb:45:in `each'
	from /usr/local/bundle/gems/sitediff-1.1.1/lib/sitediff/fetch.rb:45:in `queue_path'
	from /usr/local/bundle/gems/sitediff-1.1.1/lib/sitediff/fetch.rb:35:in `block in run'
	from /usr/local/bundle/gems/sitediff-1.1.1/lib/sitediff/fetch.rb:35:in `each'
	from /usr/local/bundle/gems/sitediff-1.1.1/lib/sitediff/fetch.rb:35:in `run'
	from /usr/local/bundle/gems/sitediff-1.1.1/lib/sitediff.rb:184:in `run'
	from /usr/local/bundle/gems/sitediff-1.1.1/lib/sitediff/api.rb:117:in `diff'
	from /usr/local/bundle/gems/sitediff-1.1.1/lib/sitediff/cli.rb:127:in `diff'
	from /usr/local/bundle/gems/thor-0.20.3/lib/thor/command.rb:27:in `run'
	from /usr/local/bundle/gems/thor-0.20.3/lib/thor/invocation.rb:126:in `invoke_command'
	from /usr/local/bundle/gems/thor-0.20.3/lib/thor.rb:387:in `dispatch'
	from /usr/local/bundle/gems/thor-0.20.3/lib/thor/base.rb:466:in `start'
	from /usr/local/bundle/gems/sitediff-1.1.1/bin/sitediff:12:in `<top (required)>'
	from /usr/local/bundle/bin/sitediff:23:in `load'
	from /usr/local/bundle/bin/sitediff:23:in `<main>'
Reading config file: /website/sitediff/sitediff.yaml
Read 4581 paths from: /website/sitediff/paths.txt
Using sites from cache: before

I have also tried to declare the encoding in the container before running/installing it with no success:

export LANG="en_US.UTF-8"
export LANGUAGE="en_US.UTF-8"
export LC_CTYPE="en_US.UTF-8"
export LC_NUMERIC="en_US.UTF-8"
export LC_TIME="en_US.UTF-8"
export LC_COLLATE="en_US.UTF-8"
export LC_MONETARY="en_US.UTF-8"
export LC_MESSAGES="en_US.UTF-8"
export LC_PAPER="en_US.UTF-8"
export LC_NAME="en_US.UTF-8"
export LC_ADDRESS="en_US.UTF-8"
export LC_TELEPHONE="en_US.UTF-8"
export LC_MEASUREMENT="en_US.UTF-8"
export LC_IDENTIFICATION="en_US.UTF-8"

Any thoughts?

Tech stack

  • Ubuntu 21.04
  • Docker container Ruby v2.6.9
  • Sitediff v1.1.1

Option to ignore untrusted certificate

As a user I want to be able to crawl sites with untrusted certificates, in order to check sites that don't have valid certificates. Ideally it would be possible to turn of the check completely, like -k in curl.

      params = {
        :connecttimeout => 3,     # Don't hang on servers that don't exist
        :followlocation => true,  # Follow HTTP redirects (code 301 and 302)
        :ssl_verifypeer => false, # dirty workaround for broken certificates
        :ssl_verifyhost => 0,
        :headers => {
          "User-Agent" => "Sitediff - https://github.com/evolvingweb/sitediff"
        }
      }

This would do the trick, but someone who knows ruby should probably make that optional

Under AWS Linux 2 running "sitediff store" command throws an exception

Following the CentOS directions we assumed sitediff would also work under AWS Linux 2.

When running "sitediff store" we get the following error.
$ sitediff store Reading config file: /home/ec2-user/sitediff/www/sitediff/sitediff.yaml Traceback (most recent call last): 10: from /home/ec2-user/.rvm/gems/ruby-2.7.1/bin/ruby_executable_hooks:24:in

'
9: from /home/ec2-user/.rvm/gems/ruby-2.7.1/bin/ruby_executable_hooks:24:in eval' 8: from /home/ec2-user/.rvm/gems/ruby-2.7.1/bin/sitediff:23:in '
7: from /home/ec2-user/.rvm/gems/ruby-2.7.1/bin/sitediff:23:in load' 6: from /home/ec2-user/.rvm/gems/ruby-2.7.1/gems/sitediff-1.1.1/bin/sitediff:12:in <top (required)>'
5: from /home/ec2-user/.rvm/gems/ruby-2.7.1/gems/thor-0.20.3/lib/thor/base.rb:466:in start' 4: from /home/ec2-user/.rvm/gems/ruby-2.7.1/gems/thor-0.20.3/lib/thor.rb:387:in dispatch'
3: from /home/ec2-user/.rvm/gems/ruby-2.7.1/gems/thor-0.20.3/lib/thor/invocation.rb:126:in invoke_command' 2: from /home/ec2-user/.rvm/gems/ruby-2.7.1/gems/thor-0.20.3/lib/thor/command.rb:27:in run'
1: from /home/ec2-user/.rvm/gems/ruby-2.7.1/gems/sitediff-1.1.1/lib/sitediff/cli.rb:220:in store' /home/ec2-user/.rvm/gems/ruby-2.7.1/gems/sitediff-1.1.1/lib/sitediff/api.rb:226:in store': undefined method get_curl_opts' for #<SiteDiff::Api:0x0000000002a6b010> (NoMethodError)
Running all other sitediff commands seem to work as expected.

Any suggestions? We really like the concept of this tool.

Thanks
Jason

Invalid byte sequence in UTF-8

I'm running a build on commit 70869f7 of dev branch, and am getting the following error while processing a page during my sitediff diff:

Traceback (most recent call last):
	33: from /usr/bin/sitediff:23:in `<main>'
	32: from /usr/bin/sitediff:23:in `load'
	31: from /usr/lib/ruby/gems/2.5.0/gems/sitediff-0.0.4/bin/sitediff:9:in `<top (required)>'
	30: from /usr/lib/ruby/gems/2.5.0/gems/thor-0.20.3/lib/thor/base.rb:466:in `start'
	29: from /usr/lib/ruby/gems/2.5.0/gems/thor-0.20.3/lib/thor.rb:387:in `dispatch'
	28: from /usr/lib/ruby/gems/2.5.0/gems/thor-0.20.3/lib/thor/invocation.rb:126:in `invoke_command'
	27: from /usr/lib/ruby/gems/2.5.0/gems/thor-0.20.3/lib/thor/command.rb:27:in `run'
	26: from /usr/lib/ruby/gems/2.5.0/gems/sitediff-0.0.4/lib/sitediff/cli.rb:110:in `diff'
	25: from /usr/lib/ruby/gems/2.5.0/gems/sitediff-0.0.4/lib/sitediff.rb:120:in `run'
	24: from /usr/lib/ruby/gems/2.5.0/gems/sitediff-0.0.4/lib/sitediff/fetch.rb:26:in `run'
	23: from /usr/lib/ruby/gems/2.5.0/gems/typhoeus-1.3.1/lib/typhoeus/hydra/memoizable.rb:51:in `run'
	22: from /usr/lib/ruby/gems/2.5.0/gems/typhoeus-1.3.1/lib/typhoeus/hydra/runnable.rb:15:in `run'
	21: from /usr/lib/ruby/gems/2.5.0/gems/ethon-0.11.0/lib/ethon/multi/operations.rb:43:in `perform'
	20: from /usr/lib/ruby/gems/2.5.0/gems/ethon-0.11.0/lib/ethon/multi/operations.rb:164:in `run'
	19: from /usr/lib/ruby/gems/2.5.0/gems/ethon-0.11.0/lib/ethon/multi/operations.rb:151:in `check'
	18: from /usr/lib/ruby/gems/2.5.0/gems/ethon-0.11.0/lib/ethon/easy/response_callbacks.rb:68:in `complete'
	17: from /usr/lib/ruby/gems/2.5.0/gems/ethon-0.11.0/lib/ethon/easy/response_callbacks.rb:68:in `each'
	16: from /usr/lib/ruby/gems/2.5.0/gems/ethon-0.11.0/lib/ethon/easy/response_callbacks.rb:68:in `block in complete'
	15: from /usr/lib/ruby/gems/2.5.0/gems/typhoeus-1.3.1/lib/typhoeus/easy_factory.rb:164:in `block in set_callback'
	14: from /usr/lib/ruby/gems/2.5.0/gems/typhoeus-1.3.1/lib/typhoeus/request/operations.rb:35:in `finish'
	13: from /usr/lib/ruby/gems/2.5.0/gems/typhoeus-1.3.1/lib/typhoeus/request/callbacks.rb:145:in `execute_callbacks'
	12: from /usr/lib/ruby/gems/2.5.0/gems/typhoeus-1.3.1/lib/typhoeus/request/callbacks.rb:145:in `each'
	11: from /usr/lib/ruby/gems/2.5.0/gems/typhoeus-1.3.1/lib/typhoeus/request/callbacks.rb:146:in `block in execute_callbacks'
	10: from /usr/lib/ruby/gems/2.5.0/gems/sitediff-0.0.4/lib/sitediff/uriwrapper.rb:107:in `block in typhoeus_request'
	 9: from /usr/lib/ruby/gems/2.5.0/gems/sitediff-0.0.4/lib/sitediff/fetch.rb:48:in `block (2 levels) in queue_path'
	 8: from /usr/lib/ruby/gems/2.5.0/gems/sitediff-0.0.4/lib/sitediff/fetch.rb:58:in `process_results'
	 7: from /usr/lib/ruby/gems/2.5.0/gems/sitediff-0.0.4/lib/sitediff.rb:90:in `process_results'
	 6: from /usr/lib/ruby/gems/2.5.0/gems/sitediff-0.0.4/lib/sitediff.rb:78:in `sanitize'
	 5: from /usr/lib/ruby/gems/2.5.0/gems/sitediff-0.0.4/lib/sitediff.rb:78:in `map'
	 4: from /usr/lib/ruby/gems/2.5.0/gems/sitediff-0.0.4/lib/sitediff.rb:81:in `block in sanitize'
	 3: from /usr/lib/ruby/gems/2.5.0/gems/sitediff-0.0.4/lib/sitediff/sanitize.rb:35:in `sanitize'
	 2: from /usr/lib/ruby/gems/2.5.0/gems/sitediff-0.0.4/lib/sitediff/sanitize.rb:97:in `regexps'
	 1: from /usr/lib/ruby/gems/2.5.0/gems/sitediff-0.0.4/lib/sitediff/sanitize.rb:148:in `prettify'
/usr/lib/ruby/gems/2.5.0/gems/sitediff-0.0.4/lib/sitediff/sanitize.rb:148:in `sub!': invalid byte sequence in UTF-8 (ArgumentError)

I was able to fix this error with no noticeable detrimental effects to the diff report by adding the following line. The addition encodes the string to UTF-8 before sanitizing, replacing any invalid or undefined bytes with an empty string:

# Remove xml declaration and <html> tags
+ str = str.encode('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')
str.sub!(/\A<\?xml.*$\n/, '')
str.sub!(/\A^<html>$\n/, '')
str.sub!(%r{</html>\n\Z}, '')     

I'm not sure if this is a proper fix, not a Ruby dev, and not sure where to PR to, so just submitting this fix as an issue in case its useful.

HTML attribute order normalization

Hello,

first of all thanks for sitediff - I am currently in the process of upgrading an old unmaintained project without any tests and your program is very valuable.
One thing that would make it even more robust for me would be the possibility to compare HTML elements with attributes regardless of their order. Because of some internal changes in some libraries I use, I now often get e. g. <input type="text" class="txt"> in one version and <input class="txt" type="text"> in the other. As this has no influence on the resulting HTML function, it would be great to have option to treat both of them as the same.

Unfortunately, this is something that cannot be expressed by regexps, so I have no way how to do it in the config file without some support in sitediff itself.

Any help would be much appreciated.
Beda

sitediff fails with "Not a directory @ apply2files" if crawl only produces one page

Sitediff fails to compare 2 single-paged URLs/sites: the before/after entries in the snapshots directory are files, not directories containing other entries, so this should also be taken into account

Error occurs on Linux ubuntu laptop and on Macbook Air 2020 M1 with MacOS Ventura 13.1 when installing sitediff via homebrew in latest version

A few problems - time, throttle, and filesize limit error

sitediff looks to be just what I need but i'm running into some issues. i'm trying to compare a dev and prod version of a site.

i'm running into a few problems:

  1. there doesn't seem to be a way to throttle requests. sitediff seems to be requesting as fast as possible?
  2. requests don't timeout. or maybe they have a longer timeout than a minute or two. my dev server stopped responding and this lead to sitediff just hanging until i restarted it.
  3. Filesize limit exceeded: 25 - this seems to be a common ruby problem but i'm not a ruby guy. sitediff failed to finish running with this error.

and the last problem isn't exactly a problem, but sitediff doesn't write it's cache/state as it goes, so after getting my error i'm left with an empty sitediff directory and have to start over.

Option to ignore whitespace

I'm using sitediff to check a site's content during a CMS migration. The new system generates different whitespace in HTML elements, which I'd like to ignore. So far my attempts to use the sanitization and dom_transform parameters to canonicalise whitespace differences is met with failure, because

I'd like a parameter that works like diff(1)'s -b flag (--ignore-space-change) or -w (--ignore-all-space).

Thanks.

Error: HASH: Out of overflow pages. Increase page size

I'm not sure what this means, and the documentation doesn't say anything about pages that overflow. Some clarification about what this error means and how to fix it would be helpful. Regardless, thank you for this great tool.

Allow the use of capture groups in substitutions

What I'd like to do is:

sanitization:
  - title: Remove trailing whitespace from class values
    pattern: class="([\w\s]+)\s"
    substitute: class="$1"

With the desire that this would not be a diff:

+ class="foo baa "
- class="foo baa"

BTW awesome tool! Thank you. I've not written Ruby before but if I can help I'd like to.

Follow redirects

currently the tool fails if the URLs are redirected, e.g. from http to https

it would be good to follow redirects, and/or enable a mixture of http and https paths within the same sitediff.yaml file

nokogiri requires Ruby version >= 2.1.0.

Excited to try using this, following the documented instructions I get the following:

gem install nokogiri --no-rdoc --no-ri -- --use-system-libraries=true --with-xml2-include=/usr/include/libxml2
Fetching: mini_portile2-2.1.0.gem (100%)
Fetching: nokogiri-1.7.0.1.gem (100%)
ERROR:  Error installing nokogiri:
	nokogiri requires Ruby version >= 2.1.0.

Can sitediff load pages from disk?

Hello, I am glad to contact you about sitediff toolchain. I try to exploit sitediff in my project to compare two pages that both have been stored in disk locally. However, I found that the sitediff only fetches the page by crawling from Internet. The sitediff have provided a similar feature about loading page from disk?

Diff can fail with error

A diff can fail when producing a report. The error is about nil and accessing values off of an object.

Changing output/diffs color scheme

I was wondering where would we change the color scheme of the highlighted changes or diffs in the output files provided in the /output/diffs files. the red is not accessible to some forms of colorblindness and I would like to change this color with a color scheme accessible for everyone.

image

Internal server error

I get the following error when running the Demo from the readme, as well as using the latest source and 1.0.0-rc1 tag.

On any diff, if I click on the Both button to view the side by side, I get: undefined method css' for SiteDiff::Diff:Module

Full stacktrace:

127.0.0.1 - - [15/May/2020:09:00:14 UTC] "GET /files/report.html HTTP/1.1" 200 113515
http://localhost:13080/files/report.html -> /files/report.html
[2020-05-15 09:00:16] ERROR NoMethodError: undefined method css' for SiteDiff::Diff:Module (erb):9:in do_GET'
/usr/lib/ruby/2.5.0/erb.rb:876:in eval' /usr/lib/ruby/2.5.0/erb.rb:876:in result'
/sitediff/lib/sitediff/webserver/resultserver.rb:66:in do_GET' /usr/lib/ruby/2.5.0/webrick/httpservlet/abstract.rb:105:in service'
/usr/lib/ruby/2.5.0/webrick/httpserver.rb:140:in service' /usr/lib/ruby/2.5.0/webrick/httpserver.rb:96:in run'
/usr/lib/ruby/2.5.0/webrick/server.rb:307:in `block in start_thread'
127.0.0.1 - - [15/May/2020:09:00:16 UTC] "GET /sidebyside/ HTTP/1.1" 500 329
http://localhost:13080/files/report.html -> /sidebyside/

error reading config file in `sitediff serve`

I have a sitediff that I initialized. I can do sitediff diff, and get results, but the serve command crashes.

~/sitediffs/www.mysite.com$ sitediff serve
[sitediff] Reading config file: /home/user/sitediffs/www.mysite.com/sitediff/sitediff.yaml
/usr/lib/ruby/2.1.0/psych.rb:464:in `initialize': No such file or directory @ rb_sysopen - output/settings.yaml (Errno::ENOENT)
	from /usr/lib/ruby/2.1.0/psych.rb:464:in `open'
	from /usr/lib/ruby/2.1.0/psych.rb:464:in `load_file'
	from /var/lib/gems/2.1.0/gems/sitediff-0.0.3/lib/sitediff/webserver/resultserver.rb:56:in `initialize'
	from /var/lib/gems/2.1.0/gems/sitediff-0.0.3/lib/sitediff/cli.rb:131:in `new'
	from /var/lib/gems/2.1.0/gems/sitediff-0.0.3/lib/sitediff/cli.rb:131:in `serve'
	from /var/lib/gems/2.1.0/gems/thor-0.19.4/lib/thor/command.rb:27:in `run'
	from /var/lib/gems/2.1.0/gems/thor-0.19.4/lib/thor/invocation.rb:126:in `invoke_command'
	from /var/lib/gems/2.1.0/gems/thor-0.19.4/lib/thor.rb:369:in `dispatch'
	from /var/lib/gems/2.1.0/gems/thor-0.19.4/lib/thor/base.rb:444:in `start'
	from /var/lib/gems/2.1.0/gems/sitediff-0.0.3/bin/sitediff:10:in `<top (required)>'
	from /usr/local/bin/sitediff:23:in `load'
	from /usr/local/bin/sitediff:23:in `<main>'

$  ruby -v
ruby 2.1.5p273 (2014-11-13) [x86_64-linux-gnu]

I tried to get the version from sitediff, but it didn't recognize -v or --version. I installed it ~ two days ago.

Exclude does not seem to work for crawl

I never managed to get exclude to work, e.g. exclude pdf files like in the doku and example config. sitediff version used is 1.1.1.

The command line doesn't seem to know a exclude switch:

# sitediff crawl --exclude='.*.pdf'
Unknown switches "--exclude=.*.pdf"

Also no effect when added to the sitediff.yaml config file like shown here, pdf files are still crawled:
https://github.com/evolvingweb/sitediff/blob/master/config/sitediff.example.yaml#L18

Also, could you please add an example how to exclude multiple patterns, not just one?

sitediff store ends up in ArgumentError

Steps to reproduce:

docker run -p 13080:13080 -t -it --name sitediff --rm -v $(pwd):/project --workdir=/project evolvingweb/sitediff:latest /bin/bash
root@f80772921e74:/project# sitediff store
Reading config file: /project/sitediff/sitediff.yaml
/usr/local/bundle/gems/sitediff-1.2.3/lib/sitediff/fetch.rb:13:in `initialize': wrong number of arguments (given 6, expected 3..5) (ArgumentError)
        from /usr/local/bundle/gems/sitediff-1.2.3/lib/sitediff/api.rb:232:in `new'
        from /usr/local/bundle/gems/sitediff-1.2.3/lib/sitediff/api.rb:232:in `store'
        from /usr/local/bundle/gems/sitediff-1.2.3/lib/sitediff/cli.rb:221:in `store'
        from /usr/local/bundle/gems/thor-1.2.1/lib/thor/command.rb:27:in `run'
        from /usr/local/bundle/gems/thor-1.2.1/lib/thor/invocation.rb:127:in `invoke_command'
        from /usr/local/bundle/gems/thor-1.2.1/lib/thor.rb:392:in `dispatch'
        from /usr/local/bundle/gems/thor-1.2.1/lib/thor/base.rb:485:in `start'
        from /usr/local/bundle/gems/sitediff-1.2.3/bin/sitediff:12:in `<top (required)>'
        from /usr/local/bundle/bin/sitediff:25:in `load'
        from /usr/local/bundle/bin/sitediff:25:in `<main>'
root@f80772921e74:/project#

Used image: https://hub.docker.com/layers/evolvingweb/sitediff/latest/images/sha256-a2529b4a26e58d5c1288372d832e90094b96096ef870702dd5e40c2e5cf572bf?context=explore

Do you need the configuration / contents of the mounted folder?

Limit which page are crawled with regexes

Hi,

Is there a way to limit which pages are crawled on a site with some regex rules ? I'd like to limit our crawl on one language on our website, and some sections, but let the site admins play with the URL as they want. So limiting it with direct URL in the settings.yaml file isn't a sustainable option...

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.