Comments (7)
I have tried again at smaller scale with this command:
docker run -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl --url https://w.atwiki.jp/sinoalice_kousatu/pages/31.html --include "31|https:\/\/img\.atwikiimg\.com" --workers 3 --text --generateWACZ
Unfortunately, it's not replicating the problem.
That being said, I also tried:
docker run -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl --url https://w.atwiki.jp/sinoalice_kousatu/pages/31.html --text --generateWACZ --depth 0
and when clicking on a missing image, it cannot even load the live page image:
despite the image being available online if we copy/paste the link into a browser.
I'm wondering if this could be a misinterpretation of the URL? I see it's adding "null" at the end.
from replayweb.page.
Thanks for reporting, Could you share the original WARC and WACZ that are loading differently? You can share them via email if you'd like.
The last issue could be something else, so want to look at one thing at a time. (The 'null' at the end bug was fixed in the replayweb.page hosted version, and will be in the next app release coming shortly as well)
from replayweb.page.
Of course, here's a link to the collection it created in my first example. I wish I could have recreated this problem on a smaller crawl but I'm haven't been able to make the seed and config file work so I'm limited to the command line parameters.
The image in the first screenshot was located in rec-20210814070623379161-76e848972cc4.warc.gz
Just for clarity, one problem I had with this crawl is that the wiki uses collapsible on most pages (content hiding until you click on a + button). This images displayed within those collapsed section are lazy-loaded so they are not included in the recording. In order to download them, I've included the /upload/XXX.html pages which contain a list of all the images included in page XXX. And those images are seemingly always stored on another domain (img.atwikiimg.com).
Another question, is it possible to re-generate a WARCZ from the WARCs? Maybe it just got corrupted the first time?
Edit: just found py-wacz, I'll try to do just that.
Thanks for your response!
from replayweb.page.
Okay so same behavior after using py-wacz, but I think I understood a lot of things.
In the WARCs, theses images are categorized in the URLs tab, and in the WARCZ, they are in the Pages tab. None of the images works when they are considered pages. There are actually 146 images in the URL tab from the domain https://img.atwikiimg.com in the resulting WARCZ, and they work perfectly. The images are from https://w.atwiki.jp/sinoalice_kousatu/pages/100.html, https://w.atwiki.jp/sinoalice_kousatu/pages/99.html, and https://w.atwiki.jp/sinoalice_kousatu/pages/98.html, and sure enough, those pages have all the images displayed when browsing them.
Also there are WAY more images in the WARCs then the resulting WARCZ: when searching for "https://img.atwikiimg.com" in URLs and Pages, we get a total of 171 images. When doing the same in rec-20210814070623379161-76e848972cc4.warc.gz, there are none in the Pages, but 854 in the URLs.
from replayweb.page.
Thanks for reporting, I think I've been able to repro the issue with the images and am taking a look.
from replayweb.page.
Thanks for reporting, the main issue was that the URLs that had a www65.
in them were being incorrectly canonicalized (due to a bug in warcio.js), preventing them from being loaded! This was specific to this URL pattern, and quite an edge case, but glad it was detected.
The pages in the WACZ should work now in the latest replayweb.page site.
There's a separate issue about images not loading in the prefix query in the URL tab, that will be fixed in a follow up update.
There should be no difference with how a page loads via either the WARC or the WACZ. (The page I tested specifically was https://w.atwiki.jp/sinoalice_kousatu/pages/98.html
).
(There is also the ad images, which appear to load in Firefox but not chrome, but that's a separate, complicated issue)
Let us know if the issue is still happening.
from replayweb.page.
Hi! I've just quickly skimmed through the archive using the latest version, and it seems like everything is loading fine now: awesome! I'm glad this got fixed because other than that, it's the best set of tools I've found for the purpose of archiving and replaying websites. Thanks a bunch!
from replayweb.page.
Related Issues (20)
- Update page/resource navigation sidebar HOT 1
- Update page list HOT 1
- Update resource (URLs) browser HOT 1
- [Bug]: WARC files are very slow to load in Firefox HOT 1
- Player keeps loading on a 404 page
- HEAD Fallback Mechanism to GET 0-0
- Document loading from replay.json HOT 1
- ReplayWebpage V2 Documentation Update
- PWA Manifest Not Available on Deployed Site
- ReplayWebpage V2 Docs Content Reorganization HOT 1
- WACZ range request error HOT 2
- Document `liveRedirectOnNotFound`
- Inconsistently Loading Videos in Embedded Player HOT 2
- [Replay Bug]: the reply of image galleries sometimes mixes links to different subpages
- [Replay Bug]: replay shows the wrong video to a news article at dr.dk HOT 1
- [Replay Bug]: Ads are missing on some sites or creates ads placeholders or images which is not seen online
- [Bug]: Missing ads on news sites HOT 3
- [Replay Bug]: Failure to render websites created with Shorthand.com
- [Replay Bug]: Failure to render websites published on Microsoft SharePoint
- [Bug]: Safari can't open wacz stored on Dropbox, Firefox & Chrome can HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from replayweb.page.