redpanda-data / openmessaging-benchmark Goto Github PK

View Code? Open in Web Editor NEW

35.0 35.0 27.0 1.83 MB

License: Apache License 2.0

Java 68.60% Shell 7.93% Python 9.20% Dockerfile 0.84% HCL 12.29% Mustache 0.28% Smarty 0.86%

openmessaging-benchmark's People

Contributors

Stargazers

Watchers

openmessaging-benchmark's Issues

Small message sizes operates inefficiently in client

The java producer client is particularly inefficient at filling the buffer with small messages at high volumes. (As an example, attempt to send 1.5GB/sec of 100 byte messages to reproduce with any reasonable number of client nodes and memory configuration.)

Client threads also appear to abort without notice, and the benckmark job does not properly detect that they have stopped sending traffic, does not report it, take corrective action, or abort.

All topics are deleted by default

result is true by default, and this causes all topics to be deleted (not just topics created by this tool). It would be better if only topics that were created by this tool were deleted.

A few snippets where a change would be required:
https://github.com/redpanda-data/openmessaging-benchmark/blob/main/driver-redpanda/src/main/java/io/openmessaging/benchmark/driver/redpanda/RedpandaBenchmarkDriver.java#L91-L102
https://github.com/redpanda-data/openmessaging-benchmark/blob/main/driver-kafka/src/main/java/io/openmessaging/benchmark/driver/kafka/KafkaBenchmarkDriver.java#L91-L101

benchmarking producer client issue with high volume (1.5 million/sec) low size (100 byte) messages

Recently in attempting to generate traffic to provide 1.5 GB/sec to an appropriately sized cluster, using 100 byte message sizes (customer is moving from SQS) we have encountered difficulties with the OMB producers. Messages per second seems to fall apart somewhere around 1 to 1.2 million messages per second. The same cluster can handle 1.5GB w/ 1024 message sizes with excellent performance. 1.5GB w 100 byte messages (same batch size, etc) results in producers erroring out and aborting.

@travisdowns has additional details around the nature of these failures he can add to this issue.

Additional experiments using other client technologies have been in progress by @larsenpanda .

Ultimately we need to update our client so that these types of workloads are successful out of the box and define limits for the client in documentation so customers know what the upper boundaries of the producers and consumers are so they do not infer poor performance by Redpanda.

Lack of instance_type setting should not cause critical failure

This code introduces a regression where people using already existing hosts files receive a critical failure because instance_type is not set. Default values should have been added here as an option. (Blame: @tmgstevens )

line: "name: Redpanda{{ '+SASL' if sasl_enabled | default(False) | bool == True }}{{ '+TLS' if tls_enabled | default(False)|bool == True }}+{{ groups['redpanda'] | length }}x{{ instance_type }}"

Ansible galaxy requirements does not include posix

Recent updates to the repo has introduced a requirement for ansible.posix but it was not added to the requirements file.

Count offsets commit failures

Follow up on this PR's comment

Dockerfile.build needs -Dlicence.skip=true

I believe the mvn install command inside the Dockerfile.build file is incomplete based on errors I got while building it.

#10 2.830 [ERROR] Failed to execute goal com.mycila:license-maven-plugin:3.0:check (default) on project messaging-benchmark: Some files do not have the expected license header -> [Help 1]
#10 2.830 [ERROR]
#10 2.830 [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
#10 2.830 [ERROR] Re-run Maven using the -X switch to enable full debug logging.
#10 2.830 [ERROR]
#10 2.830 [ERROR] For more information about the errors and possible solutions, please read the following articles:
#10 2.830 [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException

I think it needs to be
mvn install -Dlicense.skip=true

which is consistent with the maven install command given elsewhere in the repo, and allows the docker image to successfully build. It could probably use an appropriate image tag to make it a little easier for users to spin up.

Errors when running redpanda benchmark tests

Greetings,
Im running redpanda locally on ubuntu via docker. I tried to run benchmark tests via the command:
sudo bin/benchmark -d driver-redpanda/redpanda-ack-all-group-linger-10ms.yaml \ workloads/blog/1-topic-100-partitions-1kb-4-producers-500k-rate.yaml
I just changed partition count to 1, since I only have one node/broker.

I have few issues. Would be great if you can point me to right direction.

After running the command, I see this kind of errors:

What might be causing this indexOutOfBoundException and how can I fix it?
2)
I see these kind of stats being printed periodically. What should be the final outcome of the test?

I set 1 min to test duration, but it does not finish at all. I see bunch of errors that I mentioned above. Can it be a reason why the test does not finish?

@rkruze Can you share your thoughs about it?

Better error messages needed when only 1 client worker deployed

03:20:20.240 [main] INFO - Using DistributedWorkersEnsemble workers topology
Exception in thread "main" java.lang.IllegalArgumentException
	at com.google.common.base.Preconditions.checkArgument(Preconditions.java:128)
	at io.openmessaging.benchmark.worker.DistributedWorkersEnsemble.<init>(DistributedWorkersEnsemble.java:71)
	at io.openmessaging.benchmark.Benchmark.main(Benchmark.java:158)

DistributedWorkersEnsemble throws this arg error if there happens to be only one configured worker.

Swarm at least throws a useful error message telling you it requires more than 1 worker.

03:27:43.539 [main] INFO - Using SwarmWorker workers topology
Exception in thread "main" java.lang.IllegalArgumentException: Workers must be > 1
	at com.google.common.base.Preconditions.checkArgument(Preconditions.java:142)
	at io.openmessaging.benchmark.worker.SwarmWorker.<init>(SwarmWorker.java:119)
	at io.openmessaging.benchmark.Benchmark.main(Benchmark.java:161)

TF should also probably throw an error if clients is less than 2

Kafka Benchmark Tests Erroring out

Hi,

I'm getting the same errors as described in this closed issue. Per the closed issue, I updated the org.hdrhistogram.HdrHistogram dependency to 2.1.12. I also had to comment out the com.mycila.license-maven-plugin in the parent pom as the build would complain of various files missing license agreement wording. I ran mvn install and everything built successfully. I then launched two workers locally (on different ports) and then the driver locally using Kafka and the workload, 1-topic-1-partition-100b.yaml which is set to run for 15 minutes. After about 2 minutes, the producer starts dumping out similar WARN level messages as the linked issue:

09:03:14.712 [kafka-producer-network-thread | producer-1] WARN - Write error on message
java.util.concurrent.CompletionException: java.lang.ArrayIndexOutOfBoundsException: value 67283144 outside of histogram covered range. Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 1311060 out of bounds for length 1310720
        at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314) ~[?:?]
        at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319) ~[?:?]
        at java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:787) ~[?:?]
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) [?:?]
        at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2073) [?:?]
        at io.openmessaging.benchmark.driver.kafka.KafkaBenchmarkProducer.lambda$sendAsync$0(KafkaBenchmarkProducer.java:49) [driver-kafka-0.0.1-SNAPSHOT.jar:?]
        at org.apache.kafka.clients.producer.KafkaProducer$InterceptorCallback.onCompletion(KafkaProducer.java:1363) [kafka-clients-2.6.0.jar:?]
        at org.apache.kafka.clients.producer.internals.ProducerBatch.completeFutureAndFireCallbacks(ProducerBatch.java:228) [kafka-clients-2.6.0.jar:?]
        at org.apache.kafka.clients.producer.internals.ProducerBatch.done(ProducerBatch.java:197) [kafka-clients-2.6.0.jar:?]
        at org.apache.kafka.clients.producer.internals.Sender.completeBatch(Sender.java:653) [kafka-clients-2.6.0.jar:?]
        at org.apache.kafka.clients.producer.internals.Sender.completeBatch(Sender.java:634) [kafka-clients-2.6.0.jar:?]
        at org.apache.kafka.clients.producer.internals.Sender.handleProduceResponse(Sender.java:554) [kafka-clients-2.6.0.jar:?]
        at org.apache.kafka.clients.producer.internals.Sender.lambda$sendProduceRequest$0(Sender.java:743) [kafka-clients-2.6.0.jar:?]
        at org.apache.kafka.clients.ClientResponse.onComplete(ClientResponse.java:109) [kafka-clients-2.6.0.jar:?]
        at org.apache.kafka.clients.NetworkClient.completeResponses(NetworkClient.java:566) [kafka-clients-2.6.0.jar:?]
        at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:558) [kafka-clients-2.6.0.jar:?]
        at org.apache.kafka.clients.producer.internals.Sender.runOnce(Sender.java:325) [kafka-clients-2.6.0.jar:?]
        at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:240) [kafka-clients-2.6.0.jar:?]
        at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: java.lang.ArrayIndexOutOfBoundsException: value 67283144 outside of histogram covered range. Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 1311060 out of bounds for length 1310720
        at org.HdrHistogram.AbstractHistogram.handleRecordException(AbstractHistogram.java:571) ~[HdrHistogram-2.1.12.jar:2.1.12]
        at org.HdrHistogram.AbstractHistogram.recordSingleValue(AbstractHistogram.java:563) ~[HdrHistogram-2.1.12.jar:2.1.12]
        at org.HdrHistogram.AbstractHistogram.recordValue(AbstractHistogram.java:467) ~[HdrHistogram-2.1.12.jar:2.1.12]
        at org.HdrHistogram.Recorder.recordValue(Recorder.java:136) ~[HdrHistogram-2.1.12.jar:2.1.12]
        at io.openmessaging.benchmark.worker.LocalWorker.lambda$submitProducersToExecutor$8(LocalWorker.java:266) ~[classes/:?]
        at java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:783) ~[?:?]

I am on JDK11. Due to this error, the Pub rate starts dropping until it hits zero. Also noticed these errors as well in the producer:

09:04:48.734 [kafka-producer-network-thread | producer-1] WARN - Write error on message
java.util.concurrent.CompletionException: org.apache.kafka.common.errors.TimeoutException: Expiring 1180 record(s) for test-topic-eYYvPlM-0000-0:123237 ms has passed since batch creation
        at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331) ~[?:?]
        at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:346) ~[?:?]
        at java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:777) ~[?:?]
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) [?:?]
        at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088) [?:?]
        at io.openmessaging.benchmark.driver.kafka.KafkaBenchmarkProducer.lambda$sendAsync$0(KafkaBenchmarkProducer.java:47) [driver-kafka-0.0.1-SNAPSHOT.jar:?]
        at org.apache.kafka.clients.producer.KafkaProducer$InterceptorCallback.onCompletion(KafkaProducer.java:1363) [kafka-clients-2.6.0.jar:?]
        at org.apache.kafka.clients.producer.internals.ProducerBatch.completeFutureAndFireCallbacks(ProducerBatch.java:231) [kafka-clients-2.6.0.jar:?]
        at org.apache.kafka.clients.producer.internals.ProducerBatch.done(ProducerBatch.java:197) [kafka-clients-2.6.0.jar:?]
        at org.apache.kafka.clients.producer.internals.Sender.failBatch(Sender.java:676) [kafka-clients-2.6.0.jar:?]
        at org.apache.kafka.clients.producer.internals.Sender.sendProducerData(Sender.java:381) [kafka-clients-2.6.0.jar:?]
        at org.apache.kafka.clients.producer.internals.Sender.runOnce(Sender.java:324) [kafka-clients-2.6.0.jar:?]
        at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:240) [kafka-clients-2.6.0.jar:?]
        at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: org.apache.kafka.common.errors.TimeoutException: Expiring 1180 record(s) for test-topic-eYYvPlM-0000-0:123237 ms has passed since batch

Any guidance on how I might be able to get this sample workload to pass?

Thanks!

Prometheus and grafana port should not be accessible from any address, only myip

openmessaging-benchmark/driver-redpanda/deploy/provision-redpanda-aws.tf

Lines 176 to 188 in 8411e4a

 #Prometheus/Dashboard access 

 ingress { 

 from_port = 9090 

 to_port = 9090 

 protocol = "tcp" 

 cidr_blocks = ["0.0.0.0/0"] 

 } 

 ingress { 

 from_port = 3000 

 to_port = 3000 

 protocol = "tcp" 

 cidr_blocks = ["0.0.0.0/0"] 

 }

This should probably be locked down to same address as used for general access to the security group from the tester's home node.

 cidr_blocks = ["${chomp(data.http.myip.body)}/32"]

Ansible galaxy node exporter download fails intermittantly

This issue crops up quite regularly on larger build out. The work around is to rerun. We should consider some internal retries.

TASK [geerlingguy.node_exporter : Download and unarchive node_exporter into temporary location.] ***********************************************************************************************************************
fatal: [35.247.77.147]: FAILED! => {"changed": false, "msg": "Failure downloading https://github.com/prometheus/node_exporter/releases/download/v1.6.0/node_exporter-1.6.0.linux-amd64.tar.gz, Request failed: <urlopen error [Errno 104] Connection reset by peer>"}
fatal: [35.247.13.169]: FAILED! => {"changed": false, "msg": "Failure downloading https://github.com/prometheus/node_exporter/releases/download/v1.6.0/node_exporter-1.6.0.linux-amd64.tar.gz, Request failed: <urlopen error [Errno 104] Connection reset by peer>"}
changed: [35.247.15.158]
changed: [34.82.201.249]
changed: [34.127.49.156]
fatal: [34.127.124.253]: FAILED! => {"changed": false, "msg": "Failure downloading https://github.com/prometheus/node_exporter/releases/download/v1.6.0/node_exporter-1.6.0.linux-amd64.tar.gz, Request failed: <urlopen error [Errno 104] Connection reset by peer>"}
changed: [35.233.238.241]
changed: [34.145.49.5]
changed: [35.230.12.29]
changed: [34.82.118.174]
changed: [35.230.87.240]

document in driver-redpanda producer workload tunings for high volume producer configs.

https://redpandadata.slack.com/archives/C01ND4SVB6Z/p1694729911166379

Need to document some additional OMB workload configuration detail for high volume testing.

There's a few items in this thread talking about how to get the producer to keep up with the expected rates

Possible quirks with key distributor not keeping up with the expected rate. Random Nano seems to act weird in high rate setups across many producers and partition spreads. NoKey and Round Robin seem to keep up. this could be related to next issue
for high volume produce rates (>1million/s) across many producers (tens) going to many partitions (thousands), java client may also need to have buffer.memory significantly increased to handle the amount of data being generated in the batch.

In example test, @travisdowns calculated that for 1.8m messages/sec on 10 partitions with thousands of partitions each coming from ~100 producers, the buffer size needed was likely 3-4x larger than what we were setting in the test (around 32-33MB).

2300 partitions per topic * 32000 batch size = 73.6 MB

according to java client docs, when doing larger batch sizes

A very large batch size may use memory a bit more wastefully as we will always allocate a buffer of the specified batch size in anticipation of additional records.

so we may not have been able to fill the batch due to buffer limits in the original tests.

	#Prometheus/Dashboard access
	ingress {
	from_port = 9090
	to_port = 9090
	protocol = "tcp"
	cidr_blocks = ["0.0.0.0/0"]
	}
	ingress {
	from_port = 3000
	to_port = 3000
	protocol = "tcp"
	cidr_blocks = ["0.0.0.0/0"]
	}

redpanda-data / openmessaging-benchmark Goto Github PK

openmessaging-benchmark's People

Contributors

Stargazers

Watchers

Forkers

openmessaging-benchmark's Issues

Recommend Projects

Recommend Topics

Recommend Org