Giter VIP home page Giter VIP logo

Comments (4)

neuromantik33 avatar neuromantik33 commented on June 1, 2024 1

We had a bunch of issues with aeron (and we also run in GKE 😉), and for long running jobs here is our de-facto settings, to be taken with a grain of salt I might add...

aeron.properties

# Timeout for client liveness in nanoseconds.
aeron.client.liveness.timeout=20000000000

# Timeout for image liveness in nanoseconds.
aeron.image.liveness.timeout=20000000000

# Increase the size of the maximum transmission unit to reduce system calls in a throughput scenario.
aeron.mtu.length=16384

# Set the initial window for flow control to account for BDP.
#aeron.rcv.initial.window.length=2097152

# Increase the size of OS socket receive buffer (SO_RCVBUF) to account for Bandwidth Delay Product (BDP) on a high bandwidth network.
#aeron.socket.so_rcvbuf=2097152

# Increase the size of OS socket send buffer (SO_SNDBUF) to account for Bandwidth Delay Product (BDP) on a high bandwidth network.
#aeron.socket.so_sndbuf=2097152

# Length (in bytes) of the log buffers for publication terms.
aeron.term.buffer.length=65536

# Do not use sparse files for the term buffers to avoid page faults.
aeron.term.buffer.sparse.file=true

# Disable bound checking to reduce instruction path on private secure networks.
agrona.disable.bounds.checks=true

and we run aeron is a sidecar container with the /dev/shm directory mounted as an in memory dir

...
- args:
- /oscaro/etc/aeron.properties
env:
- name: JAVA_OPTS
  value: -Xmx256m
- name: PROMETHEUS_METRICS_PORT
  value: "8091"
image: eu.gcr.io/oscaro-cloud/oscaro/aeron-driver:1.9.3-e678e95
imagePullPolicy: IfNotPresent
name: aeron
ports:
- containerPort: 40200
  protocol: TCP
- containerPort: 40200
  protocol: UDP
- containerPort: 8091
  name: aeron-metrics
  protocol: TCP
resources:
  limits:
    cpu: 500m
    memory: 512Mi
  requests:
    cpu: 250m
    memory: 256Mi
volumeMounts:
- mountPath: /oscaro/etc
  name: config
- mountPath: /dev/shm
  name: aeron
...
volumes:
- configMap:
    name: pipeline
  name: config
- emptyDir:
    medium: Memory
  name: aeron

Apparantly as was mentioned elsewhere, we should not be setting cpu limits to the container. We'll see what happens but for now it seems relatively stable even it is a bit of a dampening in terms of latency. We are unable to set the UDP buffers as we run our cluster on COS and it just doesn't allow as of 1.10 to change systemctl parameters within the pods.

from onyx.

thenonameguy avatar thenonameguy commented on June 1, 2024 1

Just for reference here are our aeron.properties:

aeron.socket.so_sndbuf=2097152
aeron.socket.so_rcvbuf=2097152
aeron.term.buffer.length=65536
aeron.image.liveness.timeout=10000000000
aeron.conductor.idle.strategy=org.agrona.concurrent.BusySpinIdleStrategy

We also spent countless hours trying to find a good configuration in archived Slack discussions, so this thread is much appreciated.

from onyx.

jgerman avatar jgerman commented on June 1, 2024 1

We took our onyx cluster (well the 0.14 one) and isolated it into its own pool and dropped the CPU limits. It seems to have done the trick. I didn't want to jinx it over the weekend, but we've been running since Friday afternoon with no Aeron exceptions. Previously we couldn't go 24 hours without the exception and a killed job.

No matter which way you slice, even if we get the exception today this is a tremendous improvement.

from onyx.

jgerman avatar jgerman commented on June 1, 2024

That's a ton of great information, thanks!

I was reluctant to increase settings like the liveness timeout (beyond our current 10 seconds) because I was afraid we were just masking the issue.

Did you confirm that the cpu throttling is your issue and you're just trying to mitigate at this point?

from onyx.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.