Comments (10)
GREAT question @fykaa 💯
Here's my understanding of it, but I'll let @serathius chime in to provide additional context that I maybe missing.
I think 12 was arbitrary, I don't recall there being a specific reason for it.
I think wrt coverage we should be good in terms of the properties we want to test. It would've been a concern if we chose fail points at random per client, but we do that before we start simulating traffic, so we are good even in terms of the code we are hitting:
etcd/tests/robustness/main_test.go
Line 77 in 00a6097
Having higher concurrency is beneficial because it can help us test more realistic and complex histories from a linearization perspective and how etcd handles the load, however if we can make do with 10 clients then that validation is also good enough. And plus the upside is we can get a clearer signal out of robustness tests if we can reduce the flake rate that happens because of this specific timeout.
from etcd.
I understand that reducing the number of concurrent clients might help with the timeouts. I'm curious, though: Is there a specific reason we initially set the ClientCount
to 12? Could there be potential trade-offs or impacts on the test coverage and robustness if we reduce it to 10? I'm keen to understand the broader context before making changes.
from etcd.
Hey @MadhavJivrajani, I have been looking at the recent failures and found the reason for linearization for timeout. #18214 should already help a lot (reduced linearization time from timeout to 15s on my machine). I also have a draft that should help reduce it back to 6s. If everything works well we might not need to reduce the concurrency at all.
from etcd.
One request, before me make changes in the concurrency I would like to remove all other flake sources. #18240
Then we can measure if we are happy with the timeout chance and if not reduce the concurrency.
from etcd.
@fykaa these are most of the knobs that exist:
etcd/tests/robustness/traffic/traffic.go
Lines 33 to 51 in 9314ef7
etcd/tests/robustness/traffic/kubernetes.go
Lines 35 to 47 in 9314ef7
Feel free to assign the PR to me once you have it up.
from etcd.
full on etcd model to generate the revisions
Wouldn't this also increase the work porcupine has to do?
model has a bug this will make it much harder to debug.
+1
@fykaa do you want to send in a PR? Maybe we can try letting that bake in the CI for a a couple of days to see how it works out and can then take a call based on if we should revert, keep it that way or change something.
from etcd.
Thanks, @MadhavJivrajani! That makes a lot of sense. I appreciate the explanation about the trade-offs between higher concurrency and reducing flakiness.
Also, I am trying to understand, are there any other areas in the robustness tests where we might need to adjust configurations to balance between test coverage and reliability?
from etcd.
Or based on the today's robustness test results I'm wrong. Sorry for confusion but results cannot be guaranteed until we make a change and wait a day for the CI to get confirmation.
Linearization timeout in CI in 2 tests on 5 minutes
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-etcd-robustness-amd64/1805934408956383232
On my local machine it succeeded after 118s and 141s. Even with my latest ideas (main...serathius:etcd:robustness-history-patching) I only take improve the time for one of the cases (141s -> 1s). The other 118s is just lowered to 116s, which is still expected to timeout on CI :(
Based on the linearization results, I'm not surprized. Compaction caused watch events to not be delivered, so we don't know a lot of revisions.
from etcd.
Might try another approach with using full on etcd model to generate the revisions, I don't like the idea too much as if the model has a bug this will make it much harder to debug.
from etcd.
Thanks for the detailed insights, everyone!
@MadhavJivrajani, I will go ahead and create a PR to reduce the client concurrency from 12 to 10 as discussed.
@serathius, I understand that other flake sources need to be addressed as well, I would raise the PR now, and once you've resolved the other flakes, we can re-run the tests to get a clearer picture of the impact?
Thanks again for all the guidance!
from etcd.
Related Issues (20)
- grpc-proxy stops sending watch events HOT 1
- Reduce log spam on missing member HOT 4
- etcd leader info status wrong HOT 2
- Failpoint `raftAfterSave=sleep(1s)` is flaking in robustness test HOT 8
- Enabling authentication causes noisy logs for every /readyz call HOT 9
- Plan to release etcd v3.5.15 HOT 20
- Filtering GET request via With{Min,Max}{Create,Mod}Rev has wrong result Count HOT 4
- Bump go to 1.21.12 / 1.22.5 HOT 9
- [robustness tests] Flakiness due to Progress notify does not match for beforeSendWatchResponse=sleep(1s)
- Failure to get list of machines HOT 2
- Code that relies on gRPC metadata formatting should be fixed HOT 11
- Bootstraping etcd using DNS in a DHCP network HOT 2
- etcd cluster error HOT 1
- dial-timeout option does not take effect on the watch command. HOT 1
- HashKV should compute the hash of all MVCC keys up to a given revision HOT 6
- Completion command is not included in releases of etcdctl HOT 1
- Etcd compilation process error HOT 4
- Etcd send a corrupt snapshot or missing hash snapshot to a snapshot api call which causes the restoration to fail. HOT 7
- option go_package in proto files HOT 4
- Client v3 and GRPC versions conflict HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from etcd.