Giter VIP home page Giter VIP logo

Comments (3)

dmuino avatar dmuino commented on August 10, 2024

We used to keep track of the last poll time, and we had a special case for missed polls, but we decided to simplify the behavior:

29d77a4#diff-2b45014017c4137b6080dc2c186ca63f

We used to return NaN instead of the rate based on the deltas though, but the major motivation was the significant reduction in the memory footprint of the class.

from servo.

brharrington avatar brharrington commented on August 10, 2024

it seems that implementation assumes that polling will never really lag and does not try handle stragglers gracefully

Generally with telemetry at Netflix, we distinguish between operational and business use-cases. For operational use-cases the key question we are trying to answer is what is going on right now and we have SLAs around when data must be actionable. That is we have alerts and automated remediation based on the data that is coming in and we have to know when the data is complete and can be trusted so we can act on it. There has also been a push over time to reduce the mean-time to detect meaning we need to keep the SLA as tight as possible. So generally we do not want the operational data to lag or cover up that it is failing to meet the publishing SLA, but that does mean that it is more likely there will be some data loss.

For business intelligence use-cases the no data loss is a big concern, but our team is not focused on those use-cases.

from servo.

cykl avatar cykl commented on August 10, 2024

Thank you both for your input. It makes sense.

Generally with telemetry at Netflix, we distinguish between operational and business use-cases. [...] So generally we do not want the operational data to lag or cover up that it is failing to meet the publishing SLA, but that does mean that it is more likely there will be some data loss. For business intelligence use-cases the no data loss is a big concern

My use of "straggler" was perhaps misleading. My use-case is not not a BI one. What led me to filling this issue was a feature team questioning me about the following graph (CW: 5mn period):

5mn period

They were puzzled to only see 0s. "Why a Counter would be reactived if there is no increment?" Zooming to a 1mn period and throubleshooting their service I was able to explain them what is going on (thread poll is satured, some steps are not polled, data is lost, you will never know what the error rate was).

1mn period

However, they then asked "Why would Servo discard missed steps rather than gracefully handle them by computing the rate over the observed polling period?" I wanted to provide them first hand answer.

BTW, If Servo expects that polling must not be too delayed, which is reasonable, wouldn't it be useful to warn if late polling is observed? (IIRC HikariCP does something like that). I know how to fix their issue, but am afraid some other feature team does the same mistake. How do you prevent such broken config to reach production?

from servo.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.