Comments (5)
Still now the problem with hang/restart seems to be gone with cralwer version 0.10.2
the other issue 999 is filed here:
#999
from browsertrix.
What version of browsertrix-crawler are you using? The crawler is versioned separately, and the version is printed in the beginning of the log file. We've just released 0.10.2 yesterday.
The time threshold is an internal setting, where the crawler pod gets restarted after that amount of time, but should continue to run the crawl (this is precisely to avoid the issue of a long running process that could hang). This value is set in values.yaml
# max time in seconds after which crawler will restart, if set
crawler_session_time_limit_seconds: 18000
Was that message the last thing you saw in the log? If so, can you share the log file?
Also, if you're using an older version of the crawler, recommend updating to 0.10.2, by setting
crawler_image: "webrecorder/browsertrix-crawler:0.10.2"
or
crawler_pull_policy: "Always"
from browsertrix.
Sorry for missing this information; Browsertrix-Crawler 0.9.0-beta.1 (with warcio.js 1.6.2 pywb 2.7.3)
values.yaml has setted the default values as you expected:
crawler_session_time_limit_seconds: 18000
crawler_pull_policy has been setted on IfNotPresent
I attach you the last Log File of one of these crawls, I think the intresting part is why no new crawler instance have been started. The other 3 Log Files are to huge to upload it here on git hub
The seed was: https://www.parlament.gv.at/
crawl-20230706205043842.log
from browsertrix.
Ah and there is another intresting pheonomen, which could be related to this problem of not restarting the crawler:
Sheduled Crawls are not starting at all. Maybe it would make sense to catch this error on my instance first, and retest this problem here if sheduled crawls are running again. But i can not really tell where the error is coming from. In the Data Directory there is not even an entry for redis-sheduled or crawl-sheduled. Maybe you got a hint where a sheduled crawl should be triggered and if there is any kind of logging for it?
Here is the result of microk8s.kubectl get all
it looks for me like job.batch is hanging entirely
shall i just delete them?
from browsertrix.
Sorry for missing this information; Browsertrix-Crawler 0.9.0-beta.1 (with warcio.js 1.6.2 pywb 2.7.3)
Please try again with latest 0.10.2 - a lot has changed since 0.9.0 beta, and hopefully this issue (crawler hanging w/o restart) has been fixed.
Scheduled Crawls are not starting at all.
This is likely a separate issue, we should probably have a separate issue for this - is it possible to check if newly created scheduled crawls run?
from browsertrix.
Related Issues (20)
- Enforce storage quota and execution minutes quota exceeded in the same way HOT 2
- [Chore]: Clean up home page component
- [Bug]: crawl waiting for resources indefinitely
- [Bug]: Browser title bar doesn't update to reflect page
- [Chore]: Migrate to `BtrixElement`
- Feature: Send automated emails to org admins when quotas are nearly reached
- [Change]: Move exec minute history table to the admin billing tab
- [Feature]: Default workflow configuration per org
- [Bug]: Fix failed login attempts expiry
- [Bug]: Orgs list is visible to non-superadmin HOT 1
- [Bug]: Missing warc-files HOT 1
- Return user details with org info in backend API login endpoint response
- chore: Use user info from login response
- API endpoint for workflow config defaults HOT 3
- Allow org admins to set default workflow configs HOT 1
- [Bug]: Browser profile description overflow in dropdown
- [Feature]: Create job "channels" with separate and different numbers af harvesterinstances
- [Feature]: total disk usage in Overview also in local browsertrix installations HOT 1
- [Bug]: Ensure the crawl settings with and without workflow show the same info
- [Feature]: Adjust workflow job type options HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from browsertrix.