Browsertrix Cloud Version v1.6.0-beta.1-4c8de31 <h3 dir="auto"

[Bug]: Time threshold reached but configuration was unlimited about browsertrix HOT 5 CLOSED

gitreich commented on September 27, 2024

[Bug]: Time threshold reached but configuration was unlimited

from browsertrix.

Comments (5)

gitreich commented on September 27, 2024 1

Still now the problem with hang/restart seems to be gone with cralwer version 0.10.2

the other issue 999 is filed here:
#999

from browsertrix.

ikreymer commented on September 27, 2024

What version of browsertrix-crawler are you using? The crawler is versioned separately, and the version is printed in the beginning of the log file. We've just released 0.10.2 yesterday.
The time threshold is an internal setting, where the crawler pod gets restarted after that amount of time, but should continue to run the crawl (this is precisely to avoid the issue of a long running process that could hang). This value is set in values.yaml

# max time in seconds after which crawler will restart, if set
crawler_session_time_limit_seconds: 18000

Was that message the last thing you saw in the log? If so, can you share the log file?
Also, if you're using an older version of the crawler, recommend updating to 0.10.2, by setting

crawler_image: "webrecorder/browsertrix-crawler:0.10.2"

crawler_pull_policy: "Always"

from browsertrix.

gitreich commented on September 27, 2024

Sorry for missing this information; Browsertrix-Crawler 0.9.0-beta.1 (with warcio.js 1.6.2 pywb 2.7.3)
values.yaml has setted the default values as you expected:
crawler_session_time_limit_seconds: 18000
crawler_pull_policy has been setted on IfNotPresent

I attach you the last Log File of one of these crawls, I think the intresting part is why no new crawler instance have been started. The other 3 Log Files are to huge to upload it here on git hub
The seed was: https://www.parlament.gv.at/
crawl-20230706205043842.log

from browsertrix.

gitreich commented on September 27, 2024

Ah and there is another intresting pheonomen, which could be related to this problem of not restarting the crawler:
Sheduled Crawls are not starting at all. Maybe it would make sense to catch this error on my instance first, and retest this problem here if sheduled crawls are running again. But i can not really tell where the error is coming from. In the Data Directory there is not even an entry for redis-sheduled or crawl-sheduled. Maybe you got a hint where a sheduled crawl should be triggered and if there is any kind of logging for it?
Here is the result of microk8s.kubectl get all

it looks for me like job.batch is hanging entirely
shall i just delete them?

from browsertrix.

ikreymer commented on September 27, 2024

Sorry for missing this information; Browsertrix-Crawler 0.9.0-beta.1 (with warcio.js 1.6.2 pywb 2.7.3)

Please try again with latest 0.10.2 - a lot has changed since 0.9.0 beta, and hopefully this issue (crawler hanging w/o restart) has been fixed.

Scheduled Crawls are not starting at all.

This is likely a separate issue, we should probably have a separate issue for this - is it possible to check if newly created scheduled crawls run?

from browsertrix.

[Bug]: Time threshold reached but configuration was unlimited about browsertrix HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent