Comments (4)
We had a bunch of issues with aeron (and we also run in GKE 😉), and for long running jobs here is our de-facto settings, to be taken with a grain of salt I might add...
aeron.properties
# Timeout for client liveness in nanoseconds.
aeron.client.liveness.timeout=20000000000
# Timeout for image liveness in nanoseconds.
aeron.image.liveness.timeout=20000000000
# Increase the size of the maximum transmission unit to reduce system calls in a throughput scenario.
aeron.mtu.length=16384
# Set the initial window for flow control to account for BDP.
#aeron.rcv.initial.window.length=2097152
# Increase the size of OS socket receive buffer (SO_RCVBUF) to account for Bandwidth Delay Product (BDP) on a high bandwidth network.
#aeron.socket.so_rcvbuf=2097152
# Increase the size of OS socket send buffer (SO_SNDBUF) to account for Bandwidth Delay Product (BDP) on a high bandwidth network.
#aeron.socket.so_sndbuf=2097152
# Length (in bytes) of the log buffers for publication terms.
aeron.term.buffer.length=65536
# Do not use sparse files for the term buffers to avoid page faults.
aeron.term.buffer.sparse.file=true
# Disable bound checking to reduce instruction path on private secure networks.
agrona.disable.bounds.checks=true
and we run aeron is a sidecar container with the /dev/shm
directory mounted as an in memory dir
...
- args:
- /oscaro/etc/aeron.properties
env:
- name: JAVA_OPTS
value: -Xmx256m
- name: PROMETHEUS_METRICS_PORT
value: "8091"
image: eu.gcr.io/oscaro-cloud/oscaro/aeron-driver:1.9.3-e678e95
imagePullPolicy: IfNotPresent
name: aeron
ports:
- containerPort: 40200
protocol: TCP
- containerPort: 40200
protocol: UDP
- containerPort: 8091
name: aeron-metrics
protocol: TCP
resources:
limits:
cpu: 500m
memory: 512Mi
requests:
cpu: 250m
memory: 256Mi
volumeMounts:
- mountPath: /oscaro/etc
name: config
- mountPath: /dev/shm
name: aeron
...
volumes:
- configMap:
name: pipeline
name: config
- emptyDir:
medium: Memory
name: aeron
Apparantly as was mentioned elsewhere, we should not be setting cpu limits to the container. We'll see what happens but for now it seems relatively stable even it is a bit of a dampening in terms of latency. We are unable to set the UDP buffers as we run our cluster on COS and it just doesn't allow as of 1.10 to change systemctl parameters within the pods.
from onyx.
Just for reference here are our aeron.properties
:
aeron.socket.so_sndbuf=2097152
aeron.socket.so_rcvbuf=2097152
aeron.term.buffer.length=65536
aeron.image.liveness.timeout=10000000000
aeron.conductor.idle.strategy=org.agrona.concurrent.BusySpinIdleStrategy
We also spent countless hours trying to find a good configuration in archived Slack discussions, so this thread is much appreciated.
from onyx.
We took our onyx cluster (well the 0.14 one) and isolated it into its own pool and dropped the CPU limits. It seems to have done the trick. I didn't want to jinx it over the weekend, but we've been running since Friday afternoon with no Aeron exceptions. Previously we couldn't go 24 hours without the exception and a killed job.
No matter which way you slice, even if we get the exception today this is a tremendous improvement.
from onyx.
That's a ton of great information, thanks!
I was reluctant to increase settings like the liveness timeout (beyond our current 10 seconds) because I was afraid we were just masking the issue.
Did you confirm that the cpu throttling is your issue and you're just trying to mitigate at this point?
from onyx.
Related Issues (20)
- output task with a window does not record the window checkpoint size correctly
- Is Dire still used in Onyx platform? HOT 7
- Upgrade Apache Curator framework to 4.0.x - SSL Support HOT 6
- Flow conditions validation shows bad error message when tasks are not connected. HOT 3
- Peer group manager where communicator fails to start is recoverable HOT 2
- Validation error for min-max-n-peers for flux policy is not printed correctly.
- Input plugin's poll! continues to be invoked after completed? HOT 2
- Move task-lifecycle backoff-until-task-start! into state machine.
- http://www.onyxplatform.org/ links to https://github.com/onyx-platform/onyx/releases/tag/0.13.x which does not exist
- Resume point AssertionError: Assert failed: (= slot-migration :direct) for {:mode :initialize}
- Output from job-snapshot-coordinates does not match input schema for build-resume-point
- Onyx patch versions should not require new tenancy-id HOT 2
- Clojure 1.10.0-beta4 isn't happy with Onyx HOT 1
- Co-located task scheduler does not respect capacity contrains
- Output plugin :after-task-stop lifecycle fn doesn't receive :onyx.core/scheduler-event HOT 1
- Onyx hangs when provide ":onyx.messaging.aeron/media-driver-dir" setting in peer-config HOT 1
- Project maintanence going forward HOT 28
- Feature request: Do not try to recover output checkpoints for plugins that don't use it HOT 3
- IndexOutOfBoundsException from aeron HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from onyx.