Hi, I've been experimenting with browsertrix-crawler and replayweb.page for a few

Of course, here's a <a href="https://drive.google.com/file/d/10gnabnd2MhGU4TvILOgPcd6i

Thanks for reporting, the main issue was that the URLs that had a <code class="notrans

Incorretly decoding URLs or not correctly loading images about replayweb.page HOT 7 CLOSED

webrecorder commented on June 9, 2024

Incorretly decoding URLs or not correctly loading images

from replayweb.page.

Comments (7)

DrMint commented on June 9, 2024

I have tried again at smaller scale with this command:
docker run -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl --url https://w.atwiki.jp/sinoalice_kousatu/pages/31.html --include "31|https:\/\/img\.atwikiimg\.com" --workers 3 --text --generateWACZ

Unfortunately, it's not replicating the problem.
That being said, I also tried:
docker run -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl --url https://w.atwiki.jp/sinoalice_kousatu/pages/31.html --text --generateWACZ --depth 0

and when clicking on a missing image, it cannot even load the live page image:

despite the image being available online if we copy/paste the link into a browser.
I'm wondering if this could be a misinterpretation of the URL? I see it's adding "null" at the end.

from replayweb.page.

ikreymer commented on June 9, 2024

Thanks for reporting, Could you share the original WARC and WACZ that are loading differently? You can share them via email if you'd like.

The last issue could be something else, so want to look at one thing at a time. (The 'null' at the end bug was fixed in the replayweb.page hosted version, and will be in the next app release coming shortly as well)

from replayweb.page.

DrMint commented on June 9, 2024

Of course, here's a link to the collection it created in my first example. I wish I could have recreated this problem on a smaller crawl but I'm haven't been able to make the seed and config file work so I'm limited to the command line parameters.

The image in the first screenshot was located in rec-20210814070623379161-76e848972cc4.warc.gz

Just for clarity, one problem I had with this crawl is that the wiki uses collapsible on most pages (content hiding until you click on a + button). This images displayed within those collapsed section are lazy-loaded so they are not included in the recording. In order to download them, I've included the /upload/XXX.html pages which contain a list of all the images included in page XXX. And those images are seemingly always stored on another domain (img.atwikiimg.com).

Another question, is it possible to re-generate a WARCZ from the WARCs? Maybe it just got corrupted the first time?
Edit: just found py-wacz, I'll try to do just that.

Thanks for your response!

from replayweb.page.

DrMint commented on June 9, 2024

Okay so same behavior after using py-wacz, but I think I understood a lot of things.

In the WARCs, theses images are categorized in the URLs tab, and in the WARCZ, they are in the Pages tab. None of the images works when they are considered pages. There are actually 146 images in the URL tab from the domain https://img.atwikiimg.com in the resulting WARCZ, and they work perfectly. The images are from https://w.atwiki.jp/sinoalice_kousatu/pages/100.html, https://w.atwiki.jp/sinoalice_kousatu/pages/99.html, and https://w.atwiki.jp/sinoalice_kousatu/pages/98.html, and sure enough, those pages have all the images displayed when browsing them.

Also there are WAY more images in the WARCs then the resulting WARCZ: when searching for "https://img.atwikiimg.com" in URLs and Pages, we get a total of 171 images. When doing the same in rec-20210814070623379161-76e848972cc4.warc.gz, there are none in the Pages, but 854 in the URLs.

from replayweb.page.

ikreymer commented on June 9, 2024

Thanks for reporting, I think I've been able to repro the issue with the images and am taking a look.

from replayweb.page.

ikreymer commented on June 9, 2024

Thanks for reporting, the main issue was that the URLs that had a www65. in them were being incorrectly canonicalized (due to a bug in warcio.js), preventing them from being loaded! This was specific to this URL pattern, and quite an edge case, but glad it was detected.

The pages in the WACZ should work now in the latest replayweb.page site.

There's a separate issue about images not loading in the prefix query in the URL tab, that will be fixed in a follow up update.
There should be no difference with how a page loads via either the WARC or the WACZ. (The page I tested specifically was https://w.atwiki.jp/sinoalice_kousatu/pages/98.html).

(There is also the ad images, which appear to load in Firefox but not chrome, but that's a separate, complicated issue)

Let us know if the issue is still happening.

from replayweb.page.

DrMint commented on June 9, 2024

Hi! I've just quickly skimmed through the archive using the latest version, and it seems like everything is loading fine now: awesome! I'm glad this got fixed because other than that, it's the best set of tools I've found for the purpose of archiving and replaying websites. Thanks a bunch!

from replayweb.page.

Incorretly decoding URLs or not correctly loading images about replayweb.page HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent