Giter VIP home page Giter VIP logo

Comments (10)

MadhavJivrajani avatar MadhavJivrajani commented on August 16, 2024 4

GREAT question @fykaa 💯

Here's my understanding of it, but I'll let @serathius chime in to provide additional context that I maybe missing.

I think 12 was arbitrary, I don't recall there being a specific reason for it.

I think wrt coverage we should be good in terms of the properties we want to test. It would've been a concern if we chose fail points at random per client, but we do that before we start simulating traffic, so we are good even in terms of the code we are hitting:

if s.failpoint == nil {

Having higher concurrency is beneficial because it can help us test more realistic and complex histories from a linearization perspective and how etcd handles the load, however if we can make do with 10 clients then that validation is also good enough. And plus the upside is we can get a clearer signal out of robustness tests if we can reduce the flake rate that happens because of this specific timeout.

from etcd.

fykaa avatar fykaa commented on August 16, 2024 3

I understand that reducing the number of concurrent clients might help with the timeouts. I'm curious, though: Is there a specific reason we initially set the ClientCount to 12? Could there be potential trade-offs or impacts on the test coverage and robustness if we reduce it to 10? I'm keen to understand the broader context before making changes.

from etcd.

serathius avatar serathius commented on August 16, 2024 2

Hey @MadhavJivrajani, I have been looking at the recent failures and found the reason for linearization for timeout. #18214 should already help a lot (reduced linearization time from timeout to 15s on my machine). I also have a draft that should help reduce it back to 6s. If everything works well we might not need to reduce the concurrency at all.

from etcd.

serathius avatar serathius commented on August 16, 2024 2

One request, before me make changes in the concurrency I would like to remove all other flake sources. #18240

Then we can measure if we are happy with the timeout chance and if not reduce the concurrency.

from etcd.

MadhavJivrajani avatar MadhavJivrajani commented on August 16, 2024 1

@fykaa these are most of the knobs that exist:

var (
DefaultLeaseTTL int64 = 7200
RequestTimeout = 200 * time.Millisecond
WatchTimeout = time.Second
MultiOpTxnOpCount = 4
LowTraffic = Profile{
MinimalQPS: 100,
MaximalQPS: 200,
ClientCount: 8,
MaxNonUniqueRequestConcurrency: 3,
}
HighTrafficProfile = Profile{
MinimalQPS: 200,
MaximalQPS: 1000,
ClientCount: 12,
MaxNonUniqueRequestConcurrency: 3,
}
)

var (
Kubernetes Traffic = kubernetesTraffic{
averageKeyCount: 10,
resource: "pods",
namespace: "default",
writeChoices: []random.ChoiceWeight[KubernetesRequestType]{
{Choice: KubernetesUpdate, Weight: 85},
{Choice: KubernetesDelete, Weight: 5},
{Choice: KubernetesCreate, Weight: 5},
{Choice: KubernetesCompact, Weight: 5},
},
}
)

Feel free to assign the PR to me once you have it up.

from etcd.

MadhavJivrajani avatar MadhavJivrajani commented on August 16, 2024 1

@serathius

full on etcd model to generate the revisions

Wouldn't this also increase the work porcupine has to do?

model has a bug this will make it much harder to debug.

+1

@fykaa do you want to send in a PR? Maybe we can try letting that bake in the CI for a a couple of days to see how it works out and can then take a call based on if we should revert, keep it that way or change something.

from etcd.

fykaa avatar fykaa commented on August 16, 2024

Thanks, @MadhavJivrajani! That makes a lot of sense. I appreciate the explanation about the trade-offs between higher concurrency and reducing flakiness.

Also, I am trying to understand, are there any other areas in the robustness tests where we might need to adjust configurations to balance between test coverage and reliability?

from etcd.

serathius avatar serathius commented on August 16, 2024

Or based on the today's robustness test results I'm wrong. Sorry for confusion but results cannot be guaranteed until we make a change and wait a day for the CI to get confirmation.

Linearization timeout in CI in 2 tests on 5 minutes
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-etcd-robustness-amd64/1805934408956383232

On my local machine it succeeded after 118s and 141s. Even with my latest ideas (main...serathius:etcd:robustness-history-patching) I only take improve the time for one of the cases (141s -> 1s). The other 118s is just lowered to 116s, which is still expected to timeout on CI :(

Based on the linearization results, I'm not surprized. Compaction caused watch events to not be delivered, so we don't know a lot of revisions.
Selection_018

from etcd.

serathius avatar serathius commented on August 16, 2024

Might try another approach with using full on etcd model to generate the revisions, I don't like the idea too much as if the model has a bug this will make it much harder to debug.

from etcd.

fykaa avatar fykaa commented on August 16, 2024

Thanks for the detailed insights, everyone!

@MadhavJivrajani, I will go ahead and create a PR to reduce the client concurrency from 12 to 10 as discussed.

@serathius, I understand that other flake sources need to be addressed as well, I would raise the PR now, and once you've resolved the other flakes, we can re-run the tests to get a clearer picture of the impact?

Thanks again for all the guidance!

from etcd.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.