Giter VIP home page Giter VIP logo

Comments (7)

DrMint avatar DrMint commented on June 9, 2024

I have tried again at smaller scale with this command:
docker run -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl --url https://w.atwiki.jp/sinoalice_kousatu/pages/31.html --include "31|https:\/\/img\.atwikiimg\.com" --workers 3 --text --generateWACZ

Unfortunately, it's not replicating the problem.
That being said, I also tried:
docker run -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl --url https://w.atwiki.jp/sinoalice_kousatu/pages/31.html --text --generateWACZ --depth 0

and when clicking on a missing image, it cannot even load the live page image:
image
despite the image being available online if we copy/paste the link into a browser.
I'm wondering if this could be a misinterpretation of the URL? I see it's adding "null" at the end.

from replayweb.page.

ikreymer avatar ikreymer commented on June 9, 2024

Thanks for reporting, Could you share the original WARC and WACZ that are loading differently? You can share them via email if you'd like.

The last issue could be something else, so want to look at one thing at a time. (The 'null' at the end bug was fixed in the replayweb.page hosted version, and will be in the next app release coming shortly as well)

from replayweb.page.

DrMint avatar DrMint commented on June 9, 2024

Of course, here's a link to the collection it created in my first example. I wish I could have recreated this problem on a smaller crawl but I'm haven't been able to make the seed and config file work so I'm limited to the command line parameters.

The image in the first screenshot was located in rec-20210814070623379161-76e848972cc4.warc.gz

Just for clarity, one problem I had with this crawl is that the wiki uses collapsible on most pages (content hiding until you click on a + button). This images displayed within those collapsed section are lazy-loaded so they are not included in the recording. In order to download them, I've included the /upload/XXX.html pages which contain a list of all the images included in page XXX. And those images are seemingly always stored on another domain (img.atwikiimg.com).

Another question, is it possible to re-generate a WARCZ from the WARCs? Maybe it just got corrupted the first time?
Edit: just found py-wacz, I'll try to do just that.

Thanks for your response!

from replayweb.page.

DrMint avatar DrMint commented on June 9, 2024

Okay so same behavior after using py-wacz, but I think I understood a lot of things.

In the WARCs, theses images are categorized in the URLs tab, and in the WARCZ, they are in the Pages tab. None of the images works when they are considered pages. There are actually 146 images in the URL tab from the domain https://img.atwikiimg.com in the resulting WARCZ, and they work perfectly. The images are from https://w.atwiki.jp/sinoalice_kousatu/pages/100.html, https://w.atwiki.jp/sinoalice_kousatu/pages/99.html, and https://w.atwiki.jp/sinoalice_kousatu/pages/98.html, and sure enough, those pages have all the images displayed when browsing them.

Also there are WAY more images in the WARCs then the resulting WARCZ: when searching for "https://img.atwikiimg.com" in URLs and Pages, we get a total of 171 images. When doing the same in rec-20210814070623379161-76e848972cc4.warc.gz, there are none in the Pages, but 854 in the URLs.

from replayweb.page.

ikreymer avatar ikreymer commented on June 9, 2024

Thanks for reporting, I think I've been able to repro the issue with the images and am taking a look.

from replayweb.page.

ikreymer avatar ikreymer commented on June 9, 2024

Thanks for reporting, the main issue was that the URLs that had a www65. in them were being incorrectly canonicalized (due to a bug in warcio.js), preventing them from being loaded! This was specific to this URL pattern, and quite an edge case, but glad it was detected.

The pages in the WACZ should work now in the latest replayweb.page site.

There's a separate issue about images not loading in the prefix query in the URL tab, that will be fixed in a follow up update.
There should be no difference with how a page loads via either the WARC or the WACZ. (The page I tested specifically was https://w.atwiki.jp/sinoalice_kousatu/pages/98.html).

(There is also the ad images, which appear to load in Firefox but not chrome, but that's a separate, complicated issue)

Let us know if the issue is still happening.

from replayweb.page.

DrMint avatar DrMint commented on June 9, 2024

Hi! I've just quickly skimmed through the archive using the latest version, and it seems like everything is loading fine now: awesome! I'm glad this got fixed because other than that, it's the best set of tools I've found for the purpose of archiving and replaying websites. Thanks a bunch!

from replayweb.page.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.