Giter VIP home page Giter VIP logo

openmessaging-benchmark's People

Contributors

ballard26 avatar bewaremypower avatar bharathv avatar cbornet avatar cdbartholomew avatar codelipenghui avatar dave2wave avatar dependabot[bot] avatar eladleev avatar emaxerrno avatar eolivelli avatar gousteris avatar hcoyote avatar hello-ming avatar hellojungle avatar hscarb avatar lucperkins avatar merlimat avatar patrickangeles avatar rkruze avatar rockwotj avatar rystsov avatar sijie avatar tmgstevens avatar travisdowns avatar vongosling avatar voutilad avatar ymwneu avatar zhaijack avatar zhangjidi2016 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

openmessaging-benchmark's Issues

Small message sizes operates inefficiently in client

The java producer client is particularly inefficient at filling the buffer with small messages at high volumes. (As an example, attempt to send 1.5GB/sec of 100 byte messages to reproduce with any reasonable number of client nodes and memory configuration.)

Client threads also appear to abort without notice, and the benckmark job does not properly detect that they have stopped sending traffic, does not report it, take corrective action, or abort.

All topics are deleted by default

result is true by default, and this causes all topics to be deleted (not just topics created by this tool). It would be better if only topics that were created by this tool were deleted.

A few snippets where a change would be required:
https://github.com/redpanda-data/openmessaging-benchmark/blob/main/driver-redpanda/src/main/java/io/openmessaging/benchmark/driver/redpanda/RedpandaBenchmarkDriver.java#L91-L102
https://github.com/redpanda-data/openmessaging-benchmark/blob/main/driver-kafka/src/main/java/io/openmessaging/benchmark/driver/kafka/KafkaBenchmarkDriver.java#L91-L101

benchmarking producer client issue with high volume (1.5 million/sec) low size (100 byte) messages

Recently in attempting to generate traffic to provide 1.5 GB/sec to an appropriately sized cluster, using 100 byte message sizes (customer is moving from SQS) we have encountered difficulties with the OMB producers. Messages per second seems to fall apart somewhere around 1 to 1.2 million messages per second. The same cluster can handle 1.5GB w/ 1024 message sizes with excellent performance. 1.5GB w 100 byte messages (same batch size, etc) results in producers erroring out and aborting.

@travisdowns has additional details around the nature of these failures he can add to this issue.

Additional experiments using other client technologies have been in progress by @larsenpanda .

Ultimately we need to update our client so that these types of workloads are successful out of the box and define limits for the client in documentation so customers know what the upper boundaries of the producers and consumers are so they do not infer poor performance by Redpanda.

Lack of instance_type setting should not cause critical failure

This code introduces a regression where people using already existing hosts files receive a critical failure because instance_type is not set. Default values should have been added here as an option. (Blame: @tmgstevens )

line: "name: Redpanda{{ '+SASL' if sasl_enabled | default(False) | bool == True }}{{ '+TLS' if tls_enabled | default(False)|bool == True }}+{{ groups['redpanda'] | length }}x{{ instance_type }}"

Dockerfile.build needs -Dlicence.skip=true

I believe the mvn install command inside the Dockerfile.build file is incomplete based on errors I got while building it.

#10 2.830 [ERROR] Failed to execute goal com.mycila:license-maven-plugin:3.0:check (default) on project messaging-benchmark: Some files do not have the expected license header -> [Help 1]
#10 2.830 [ERROR]
#10 2.830 [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
#10 2.830 [ERROR] Re-run Maven using the -X switch to enable full debug logging.
#10 2.830 [ERROR]
#10 2.830 [ERROR] For more information about the errors and possible solutions, please read the following articles:
#10 2.830 [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException

I think it needs to be
mvn install -Dlicense.skip=true

which is consistent with the maven install command given elsewhere in the repo, and allows the docker image to successfully build. It could probably use an appropriate image tag to make it a little easier for users to spin up.

Errors when running redpanda benchmark tests

Greetings,
Im running redpanda locally on ubuntu via docker. I tried to run benchmark tests via the command:
sudo bin/benchmark -d driver-redpanda/redpanda-ack-all-group-linger-10ms.yaml \ workloads/blog/1-topic-100-partitions-1kb-4-producers-500k-rate.yaml
I just changed partition count to 1, since I only have one node/broker.

I have few issues. Would be great if you can point me to right direction.

After running the command, I see this kind of errors:
error

What might be causing this indexOutOfBoundException and how can I fix it?
2)
I see these kind of stats being printed periodically. What should be the final outcome of the test?
stats2

  1. I set 1 min to test duration, but it does not finish at all. I see bunch of errors that I mentioned above. Can it be a reason why the test does not finish?

@rkruze Can you share your thoughs about it?

Better error messages needed when only 1 client worker deployed

03:20:20.240 [main] INFO - Using DistributedWorkersEnsemble workers topology
Exception in thread "main" java.lang.IllegalArgumentException
	at com.google.common.base.Preconditions.checkArgument(Preconditions.java:128)
	at io.openmessaging.benchmark.worker.DistributedWorkersEnsemble.<init>(DistributedWorkersEnsemble.java:71)
	at io.openmessaging.benchmark.Benchmark.main(Benchmark.java:158)

DistributedWorkersEnsemble throws this arg error if there happens to be only one configured worker.

Swarm at least throws a useful error message telling you it requires more than 1 worker.

03:27:43.539 [main] INFO - Using SwarmWorker workers topology
Exception in thread "main" java.lang.IllegalArgumentException: Workers must be > 1
	at com.google.common.base.Preconditions.checkArgument(Preconditions.java:142)
	at io.openmessaging.benchmark.worker.SwarmWorker.<init>(SwarmWorker.java:119)
	at io.openmessaging.benchmark.Benchmark.main(Benchmark.java:161)

TF should also probably throw an error if clients is less than 2

Kafka Benchmark Tests Erroring out

Hi,

I'm getting the same errors as described in this closed issue. Per the closed issue, I updated the org.hdrhistogram.HdrHistogram dependency to 2.1.12. I also had to comment out the com.mycila.license-maven-plugin in the parent pom as the build would complain of various files missing license agreement wording. I ran mvn install and everything built successfully. I then launched two workers locally (on different ports) and then the driver locally using Kafka and the workload, 1-topic-1-partition-100b.yaml which is set to run for 15 minutes. After about 2 minutes, the producer starts dumping out similar WARN level messages as the linked issue:

09:03:14.712 [kafka-producer-network-thread | producer-1] WARN - Write error on message
java.util.concurrent.CompletionException: java.lang.ArrayIndexOutOfBoundsException: value 67283144 outside of histogram covered range. Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 1311060 out of bounds for length 1310720
        at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314) ~[?:?]
        at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319) ~[?:?]
        at java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:787) ~[?:?]
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) [?:?]
        at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2073) [?:?]
        at io.openmessaging.benchmark.driver.kafka.KafkaBenchmarkProducer.lambda$sendAsync$0(KafkaBenchmarkProducer.java:49) [driver-kafka-0.0.1-SNAPSHOT.jar:?]
        at org.apache.kafka.clients.producer.KafkaProducer$InterceptorCallback.onCompletion(KafkaProducer.java:1363) [kafka-clients-2.6.0.jar:?]
        at org.apache.kafka.clients.producer.internals.ProducerBatch.completeFutureAndFireCallbacks(ProducerBatch.java:228) [kafka-clients-2.6.0.jar:?]
        at org.apache.kafka.clients.producer.internals.ProducerBatch.done(ProducerBatch.java:197) [kafka-clients-2.6.0.jar:?]
        at org.apache.kafka.clients.producer.internals.Sender.completeBatch(Sender.java:653) [kafka-clients-2.6.0.jar:?]
        at org.apache.kafka.clients.producer.internals.Sender.completeBatch(Sender.java:634) [kafka-clients-2.6.0.jar:?]
        at org.apache.kafka.clients.producer.internals.Sender.handleProduceResponse(Sender.java:554) [kafka-clients-2.6.0.jar:?]
        at org.apache.kafka.clients.producer.internals.Sender.lambda$sendProduceRequest$0(Sender.java:743) [kafka-clients-2.6.0.jar:?]
        at org.apache.kafka.clients.ClientResponse.onComplete(ClientResponse.java:109) [kafka-clients-2.6.0.jar:?]
        at org.apache.kafka.clients.NetworkClient.completeResponses(NetworkClient.java:566) [kafka-clients-2.6.0.jar:?]
        at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:558) [kafka-clients-2.6.0.jar:?]
        at org.apache.kafka.clients.producer.internals.Sender.runOnce(Sender.java:325) [kafka-clients-2.6.0.jar:?]
        at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:240) [kafka-clients-2.6.0.jar:?]
        at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: java.lang.ArrayIndexOutOfBoundsException: value 67283144 outside of histogram covered range. Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 1311060 out of bounds for length 1310720
        at org.HdrHistogram.AbstractHistogram.handleRecordException(AbstractHistogram.java:571) ~[HdrHistogram-2.1.12.jar:2.1.12]
        at org.HdrHistogram.AbstractHistogram.recordSingleValue(AbstractHistogram.java:563) ~[HdrHistogram-2.1.12.jar:2.1.12]
        at org.HdrHistogram.AbstractHistogram.recordValue(AbstractHistogram.java:467) ~[HdrHistogram-2.1.12.jar:2.1.12]
        at org.HdrHistogram.Recorder.recordValue(Recorder.java:136) ~[HdrHistogram-2.1.12.jar:2.1.12]
        at io.openmessaging.benchmark.worker.LocalWorker.lambda$submitProducersToExecutor$8(LocalWorker.java:266) ~[classes/:?]
        at java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:783) ~[?:?]

I am on JDK11. Due to this error, the Pub rate starts dropping until it hits zero. Also noticed these errors as well in the producer:

09:04:48.734 [kafka-producer-network-thread | producer-1] WARN - Write error on message
java.util.concurrent.CompletionException: org.apache.kafka.common.errors.TimeoutException: Expiring 1180 record(s) for test-topic-eYYvPlM-0000-0:123237 ms has passed since batch creation
        at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331) ~[?:?]
        at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:346) ~[?:?]
        at java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:777) ~[?:?]
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) [?:?]
        at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088) [?:?]
        at io.openmessaging.benchmark.driver.kafka.KafkaBenchmarkProducer.lambda$sendAsync$0(KafkaBenchmarkProducer.java:47) [driver-kafka-0.0.1-SNAPSHOT.jar:?]
        at org.apache.kafka.clients.producer.KafkaProducer$InterceptorCallback.onCompletion(KafkaProducer.java:1363) [kafka-clients-2.6.0.jar:?]
        at org.apache.kafka.clients.producer.internals.ProducerBatch.completeFutureAndFireCallbacks(ProducerBatch.java:231) [kafka-clients-2.6.0.jar:?]
        at org.apache.kafka.clients.producer.internals.ProducerBatch.done(ProducerBatch.java:197) [kafka-clients-2.6.0.jar:?]
        at org.apache.kafka.clients.producer.internals.Sender.failBatch(Sender.java:676) [kafka-clients-2.6.0.jar:?]
        at org.apache.kafka.clients.producer.internals.Sender.sendProducerData(Sender.java:381) [kafka-clients-2.6.0.jar:?]
        at org.apache.kafka.clients.producer.internals.Sender.runOnce(Sender.java:324) [kafka-clients-2.6.0.jar:?]
        at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:240) [kafka-clients-2.6.0.jar:?]
        at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: org.apache.kafka.common.errors.TimeoutException: Expiring 1180 record(s) for test-topic-eYYvPlM-0000-0:123237 ms has passed since batch

Any guidance on how I might be able to get this sample workload to pass?

Thanks!

Prometheus and grafana port should not be accessible from any address, only myip

#Prometheus/Dashboard access
ingress {
from_port = 9090
to_port = 9090
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
ingress {
from_port = 3000
to_port = 3000
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}

This should probably be locked down to same address as used for general access to the security group from the tester's home node.

 cidr_blocks = ["${chomp(data.http.myip.body)}/32"]

Ansible galaxy node exporter download fails intermittantly

This issue crops up quite regularly on larger build out. The work around is to rerun. We should consider some internal retries.

TASK [geerlingguy.node_exporter : Download and unarchive node_exporter into temporary location.] ***********************************************************************************************************************
fatal: [35.247.77.147]: FAILED! => {"changed": false, "msg": "Failure downloading https://github.com/prometheus/node_exporter/releases/download/v1.6.0/node_exporter-1.6.0.linux-amd64.tar.gz, Request failed: <urlopen error [Errno 104] Connection reset by peer>"}
fatal: [35.247.13.169]: FAILED! => {"changed": false, "msg": "Failure downloading https://github.com/prometheus/node_exporter/releases/download/v1.6.0/node_exporter-1.6.0.linux-amd64.tar.gz, Request failed: <urlopen error [Errno 104] Connection reset by peer>"}
changed: [35.247.15.158]
changed: [34.82.201.249]
changed: [34.127.49.156]
fatal: [34.127.124.253]: FAILED! => {"changed": false, "msg": "Failure downloading https://github.com/prometheus/node_exporter/releases/download/v1.6.0/node_exporter-1.6.0.linux-amd64.tar.gz, Request failed: <urlopen error [Errno 104] Connection reset by peer>"}
changed: [35.233.238.241]
changed: [34.145.49.5]
changed: [35.230.12.29]
changed: [34.82.118.174]
changed: [35.230.87.240]

document in driver-redpanda producer workload tunings for high volume producer configs.

https://redpandadata.slack.com/archives/C01ND4SVB6Z/p1694729911166379

Need to document some additional OMB workload configuration detail for high volume testing.

There's a few items in this thread talking about how to get the producer to keep up with the expected rates

  1. Possible quirks with key distributor not keeping up with the expected rate. Random Nano seems to act weird in high rate setups across many producers and partition spreads. NoKey and Round Robin seem to keep up. this could be related to next issue
  2. for high volume produce rates (>1million/s) across many producers (tens) going to many partitions (thousands), java client may also need to have buffer.memory significantly increased to handle the amount of data being generated in the batch.

In example test, @travisdowns calculated that for 1.8m messages/sec on 10 partitions with thousands of partitions each coming from ~100 producers, the buffer size needed was likely 3-4x larger than what we were setting in the test (around 32-33MB).

2300 partitions per topic * 32000 batch size = 73.6 MB

according to java client docs, when doing larger batch sizes

A very large batch size may use memory a bit more wastefully as we will always allocate a buffer of the specified batch size in anticipation of additional records.

so we may not have been able to fill the batch due to buffer limits in the original tests.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.