redpanda-data / openmessaging-benchmark Goto Github PK
View Code? Open in Web Editor NEWLicense: Apache License 2.0
License: Apache License 2.0
The java producer client is particularly inefficient at filling the buffer with small messages at high volumes. (As an example, attempt to send 1.5GB/sec of 100 byte messages to reproduce with any reasonable number of client nodes and memory configuration.)
Client threads also appear to abort without notice, and the benckmark job does not properly detect that they have stopped sending traffic, does not report it, take corrective action, or abort.
result
is true by default, and this causes all topics to be deleted (not just topics created by this tool). It would be better if only topics that were created by this tool were deleted.
A few snippets where a change would be required:
https://github.com/redpanda-data/openmessaging-benchmark/blob/main/driver-redpanda/src/main/java/io/openmessaging/benchmark/driver/redpanda/RedpandaBenchmarkDriver.java#L91-L102
https://github.com/redpanda-data/openmessaging-benchmark/blob/main/driver-kafka/src/main/java/io/openmessaging/benchmark/driver/kafka/KafkaBenchmarkDriver.java#L91-L101
Recently in attempting to generate traffic to provide 1.5 GB/sec to an appropriately sized cluster, using 100 byte message sizes (customer is moving from SQS) we have encountered difficulties with the OMB producers. Messages per second seems to fall apart somewhere around 1 to 1.2 million messages per second. The same cluster can handle 1.5GB w/ 1024 message sizes with excellent performance. 1.5GB w 100 byte messages (same batch size, etc) results in producers erroring out and aborting.
@travisdowns has additional details around the nature of these failures he can add to this issue.
Additional experiments using other client technologies have been in progress by @larsenpanda .
Ultimately we need to update our client so that these types of workloads are successful out of the box and define limits for the client in documentation so customers know what the upper boundaries of the producers and consumers are so they do not infer poor performance by Redpanda.
This code introduces a regression where people using already existing hosts files receive a critical failure because instance_type is not set. Default values should have been added here as an option. (Blame: @tmgstevens )
line: "name: Redpanda{{ '+SASL' if sasl_enabled | default(False) | bool == True }}{{ '+TLS' if tls_enabled | default(False)|bool == True }}+{{ groups['redpanda'] | length }}x{{ instance_type }}"
Recent updates to the repo has introduced a requirement for ansible.posix but it was not added to the requirements file.
Follow up on this PR's comment
I believe the mvn install
command inside the Dockerfile.build file is incomplete based on errors I got while building it.
#10 2.830 [ERROR] Failed to execute goal com.mycila:license-maven-plugin:3.0:check (default) on project messaging-benchmark: Some files do not have the expected license header -> [Help 1]
#10 2.830 [ERROR]
#10 2.830 [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
#10 2.830 [ERROR] Re-run Maven using the -X switch to enable full debug logging.
#10 2.830 [ERROR]
#10 2.830 [ERROR] For more information about the errors and possible solutions, please read the following articles:
#10 2.830 [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
I think it needs to be
mvn install -Dlicense.skip=true
which is consistent with the maven install command given elsewhere in the repo, and allows the docker image to successfully build. It could probably use an appropriate image tag to make it a little easier for users to spin up.
Greetings,
Im running redpanda locally on ubuntu via docker. I tried to run benchmark tests via the command:
sudo bin/benchmark -d driver-redpanda/redpanda-ack-all-group-linger-10ms.yaml \ workloads/blog/1-topic-100-partitions-1kb-4-producers-500k-rate.yaml
I just changed partition count to 1, since I only have one node/broker.
I have few issues. Would be great if you can point me to right direction.
After running the command, I see this kind of errors:
What might be causing this indexOutOfBoundException and how can I fix it?
2)
I see these kind of stats being printed periodically. What should be the final outcome of the test?
@rkruze Can you share your thoughs about it?
03:20:20.240 [main] INFO - Using DistributedWorkersEnsemble workers topology
Exception in thread "main" java.lang.IllegalArgumentException
at com.google.common.base.Preconditions.checkArgument(Preconditions.java:128)
at io.openmessaging.benchmark.worker.DistributedWorkersEnsemble.<init>(DistributedWorkersEnsemble.java:71)
at io.openmessaging.benchmark.Benchmark.main(Benchmark.java:158)
DistributedWorkersEnsemble throws this arg error if there happens to be only one configured worker.
Swarm at least throws a useful error message telling you it requires more than 1 worker.
03:27:43.539 [main] INFO - Using SwarmWorker workers topology
Exception in thread "main" java.lang.IllegalArgumentException: Workers must be > 1
at com.google.common.base.Preconditions.checkArgument(Preconditions.java:142)
at io.openmessaging.benchmark.worker.SwarmWorker.<init>(SwarmWorker.java:119)
at io.openmessaging.benchmark.Benchmark.main(Benchmark.java:161)
TF should also probably throw an error if clients is less than 2
Hi,
I'm getting the same errors as described in this closed issue. Per the closed issue, I updated the org.hdrhistogram.HdrHistogram
dependency to 2.1.12
. I also had to comment out the com.mycila.license-maven-plugin
in the parent pom as the build would complain of various files missing license agreement wording. I ran mvn install
and everything built successfully. I then launched two workers locally (on different ports) and then the driver locally using Kafka and the workload, 1-topic-1-partition-100b.yaml
which is set to run for 15 minutes. After about 2 minutes, the producer starts dumping out similar WARN level messages as the linked issue:
09:03:14.712 [kafka-producer-network-thread | producer-1] WARN - Write error on message
java.util.concurrent.CompletionException: java.lang.ArrayIndexOutOfBoundsException: value 67283144 outside of histogram covered range. Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 1311060 out of bounds for length 1310720
at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314) ~[?:?]
at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319) ~[?:?]
at java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:787) ~[?:?]
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) [?:?]
at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2073) [?:?]
at io.openmessaging.benchmark.driver.kafka.KafkaBenchmarkProducer.lambda$sendAsync$0(KafkaBenchmarkProducer.java:49) [driver-kafka-0.0.1-SNAPSHOT.jar:?]
at org.apache.kafka.clients.producer.KafkaProducer$InterceptorCallback.onCompletion(KafkaProducer.java:1363) [kafka-clients-2.6.0.jar:?]
at org.apache.kafka.clients.producer.internals.ProducerBatch.completeFutureAndFireCallbacks(ProducerBatch.java:228) [kafka-clients-2.6.0.jar:?]
at org.apache.kafka.clients.producer.internals.ProducerBatch.done(ProducerBatch.java:197) [kafka-clients-2.6.0.jar:?]
at org.apache.kafka.clients.producer.internals.Sender.completeBatch(Sender.java:653) [kafka-clients-2.6.0.jar:?]
at org.apache.kafka.clients.producer.internals.Sender.completeBatch(Sender.java:634) [kafka-clients-2.6.0.jar:?]
at org.apache.kafka.clients.producer.internals.Sender.handleProduceResponse(Sender.java:554) [kafka-clients-2.6.0.jar:?]
at org.apache.kafka.clients.producer.internals.Sender.lambda$sendProduceRequest$0(Sender.java:743) [kafka-clients-2.6.0.jar:?]
at org.apache.kafka.clients.ClientResponse.onComplete(ClientResponse.java:109) [kafka-clients-2.6.0.jar:?]
at org.apache.kafka.clients.NetworkClient.completeResponses(NetworkClient.java:566) [kafka-clients-2.6.0.jar:?]
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:558) [kafka-clients-2.6.0.jar:?]
at org.apache.kafka.clients.producer.internals.Sender.runOnce(Sender.java:325) [kafka-clients-2.6.0.jar:?]
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:240) [kafka-clients-2.6.0.jar:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: java.lang.ArrayIndexOutOfBoundsException: value 67283144 outside of histogram covered range. Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 1311060 out of bounds for length 1310720
at org.HdrHistogram.AbstractHistogram.handleRecordException(AbstractHistogram.java:571) ~[HdrHistogram-2.1.12.jar:2.1.12]
at org.HdrHistogram.AbstractHistogram.recordSingleValue(AbstractHistogram.java:563) ~[HdrHistogram-2.1.12.jar:2.1.12]
at org.HdrHistogram.AbstractHistogram.recordValue(AbstractHistogram.java:467) ~[HdrHistogram-2.1.12.jar:2.1.12]
at org.HdrHistogram.Recorder.recordValue(Recorder.java:136) ~[HdrHistogram-2.1.12.jar:2.1.12]
at io.openmessaging.benchmark.worker.LocalWorker.lambda$submitProducersToExecutor$8(LocalWorker.java:266) ~[classes/:?]
at java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:783) ~[?:?]
I am on JDK11. Due to this error, the Pub rate
starts dropping until it hits zero. Also noticed these errors as well in the producer:
09:04:48.734 [kafka-producer-network-thread | producer-1] WARN - Write error on message
java.util.concurrent.CompletionException: org.apache.kafka.common.errors.TimeoutException: Expiring 1180 record(s) for test-topic-eYYvPlM-0000-0:123237 ms has passed since batch creation
at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331) ~[?:?]
at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:346) ~[?:?]
at java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:777) ~[?:?]
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) [?:?]
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088) [?:?]
at io.openmessaging.benchmark.driver.kafka.KafkaBenchmarkProducer.lambda$sendAsync$0(KafkaBenchmarkProducer.java:47) [driver-kafka-0.0.1-SNAPSHOT.jar:?]
at org.apache.kafka.clients.producer.KafkaProducer$InterceptorCallback.onCompletion(KafkaProducer.java:1363) [kafka-clients-2.6.0.jar:?]
at org.apache.kafka.clients.producer.internals.ProducerBatch.completeFutureAndFireCallbacks(ProducerBatch.java:231) [kafka-clients-2.6.0.jar:?]
at org.apache.kafka.clients.producer.internals.ProducerBatch.done(ProducerBatch.java:197) [kafka-clients-2.6.0.jar:?]
at org.apache.kafka.clients.producer.internals.Sender.failBatch(Sender.java:676) [kafka-clients-2.6.0.jar:?]
at org.apache.kafka.clients.producer.internals.Sender.sendProducerData(Sender.java:381) [kafka-clients-2.6.0.jar:?]
at org.apache.kafka.clients.producer.internals.Sender.runOnce(Sender.java:324) [kafka-clients-2.6.0.jar:?]
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:240) [kafka-clients-2.6.0.jar:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: org.apache.kafka.common.errors.TimeoutException: Expiring 1180 record(s) for test-topic-eYYvPlM-0000-0:123237 ms has passed since batch
Any guidance on how I might be able to get this sample workload to pass?
Thanks!
openmessaging-benchmark/driver-redpanda/deploy/provision-redpanda-aws.tf
Lines 176 to 188 in 8411e4a
This should probably be locked down to same address as used for general access to the security group from the tester's home node.
cidr_blocks = ["${chomp(data.http.myip.body)}/32"]
This issue crops up quite regularly on larger build out. The work around is to rerun. We should consider some internal retries.
TASK [geerlingguy.node_exporter : Download and unarchive node_exporter into temporary location.] ***********************************************************************************************************************
fatal: [35.247.77.147]: FAILED! => {"changed": false, "msg": "Failure downloading https://github.com/prometheus/node_exporter/releases/download/v1.6.0/node_exporter-1.6.0.linux-amd64.tar.gz, Request failed: <urlopen error [Errno 104] Connection reset by peer>"}
fatal: [35.247.13.169]: FAILED! => {"changed": false, "msg": "Failure downloading https://github.com/prometheus/node_exporter/releases/download/v1.6.0/node_exporter-1.6.0.linux-amd64.tar.gz, Request failed: <urlopen error [Errno 104] Connection reset by peer>"}
changed: [35.247.15.158]
changed: [34.82.201.249]
changed: [34.127.49.156]
fatal: [34.127.124.253]: FAILED! => {"changed": false, "msg": "Failure downloading https://github.com/prometheus/node_exporter/releases/download/v1.6.0/node_exporter-1.6.0.linux-amd64.tar.gz, Request failed: <urlopen error [Errno 104] Connection reset by peer>"}
changed: [35.233.238.241]
changed: [34.145.49.5]
changed: [35.230.12.29]
changed: [34.82.118.174]
changed: [35.230.87.240]
https://redpandadata.slack.com/archives/C01ND4SVB6Z/p1694729911166379
Need to document some additional OMB workload configuration detail for high volume testing.
There's a few items in this thread talking about how to get the producer to keep up with the expected rates
In example test, @travisdowns calculated that for 1.8m messages/sec on 10 partitions with thousands of partitions each coming from ~100 producers, the buffer size needed was likely 3-4x larger than what we were setting in the test (around 32-33MB).
2300 partitions per topic * 32000 batch size = 73.6 MB
according to java client docs, when doing larger batch sizes
A very large batch size may use memory a bit more wastefully as we will always allocate a buffer of the specified batch size in anticipation of additional records.
so we may not have been able to fill the batch due to buffer limits in the original tests.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.