Giter VIP home page Giter VIP logo

Comments (5)

gitreich avatar gitreich commented on September 27, 2024 1

Still now the problem with hang/restart seems to be gone with cralwer version 0.10.2

the other issue 999 is filed here:
#999

from browsertrix.

ikreymer avatar ikreymer commented on September 27, 2024

What version of browsertrix-crawler are you using? The crawler is versioned separately, and the version is printed in the beginning of the log file. We've just released 0.10.2 yesterday.
The time threshold is an internal setting, where the crawler pod gets restarted after that amount of time, but should continue to run the crawl (this is precisely to avoid the issue of a long running process that could hang). This value is set in values.yaml

# max time in seconds after which crawler will restart, if set
crawler_session_time_limit_seconds: 18000

Was that message the last thing you saw in the log? If so, can you share the log file?
Also, if you're using an older version of the crawler, recommend updating to 0.10.2, by setting

crawler_image: "webrecorder/browsertrix-crawler:0.10.2"

or

crawler_pull_policy: "Always"

from browsertrix.

gitreich avatar gitreich commented on September 27, 2024

Sorry for missing this information; Browsertrix-Crawler 0.9.0-beta.1 (with warcio.js 1.6.2 pywb 2.7.3)
values.yaml has setted the default values as you expected:
crawler_session_time_limit_seconds: 18000
crawler_pull_policy has been setted on IfNotPresent

I attach you the last Log File of one of these crawls, I think the intresting part is why no new crawler instance have been started. The other 3 Log Files are to huge to upload it here on git hub
The seed was: https://www.parlament.gv.at/
crawl-20230706205043842.log

from browsertrix.

gitreich avatar gitreich commented on September 27, 2024

Ah and there is another intresting pheonomen, which could be related to this problem of not restarting the crawler:
Sheduled Crawls are not starting at all. Maybe it would make sense to catch this error on my instance first, and retest this problem here if sheduled crawls are running again. But i can not really tell where the error is coming from. In the Data Directory there is not even an entry for redis-sheduled or crawl-sheduled. Maybe you got a hint where a sheduled crawl should be triggered and if there is any kind of logging for it?
Here is the result of microk8s.kubectl get all
image

it looks for me like job.batch is hanging entirely
shall i just delete them?

from browsertrix.

ikreymer avatar ikreymer commented on September 27, 2024

Sorry for missing this information; Browsertrix-Crawler 0.9.0-beta.1 (with warcio.js 1.6.2 pywb 2.7.3)

Please try again with latest 0.10.2 - a lot has changed since 0.9.0 beta, and hopefully this issue (crawler hanging w/o restart) has been fixed.

Scheduled Crawls are not starting at all.

This is likely a separate issue, we should probably have a separate issue for this - is it possible to check if newly created scheduled crawls run?

from browsertrix.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.