Hi friends,
I've tested the sidecar successfully for 24 hours.
There's a couple of takeaways from this experience.
The implementation I was attempting was at a fairly large scale, there are 4 instances of prometheus per cluster with 6 clusters in total. This gives us a total of 24 sidecars active.
In 24 hours, these sidecars were responsible for a whopping amount of 82,099,610 API calls.
This is an absolutely unacceptable amount.
Assuming that we would average the number of metrics between instances, this translates into 3420817 per sidecar per day.
The aggregated cost of these calls (and only the API calls) was 820.99 £.
One could argue that reducing the amount of unused income metrics would be an efficient way to reduce cost (and would be right), however some of the retrying done by the sidecar is absolutely unreasonable.
Example 1:
Metric name checks are made by sending calls to the API and then checking the error response.
The regex for the name is available in the google docs, so, there is absolutely no reason why the checks can't be done locally and save everyone using your software some cash.
Example 2:
Tremendously aggressive retrying. Surely if we weren't able to upload a metric the result is not likely to change suddenly, my suggestion here would be to add exponential timeouts. 1 sec before 1st retry, 2 sec before 2nd, 4 sec before 3rd, 8 sec before 4th, etc.. (up to a certain point where it would cap)
Example 3:
There is a restriction of 1 time series per minute for a specific metric imposed by GCP, which means that if varied exporters scrape the endpoints at the rate of 3 times per minute and report on 25 metrics (20 seconds between scrapes is not that uncommon), stackdriver will make 25 valid API calls and 50 invalid ones. If you extrapolate the magnitude of metrics, this becomes absolutely unbearable and a waste of resources.
As an example of what i'm talking about here, here's the API statistics for the 24 hours:
API: Stackdriver Monitoring API
Number of requests: 82,099,610
% of Error API CALLS: 89
This means that out of the 82 million calls in 1 day, 9 million were actually successful.
I intend to help fix some of these problems when I get a chance to contribute to your project, just thought i'd raise these so that you are aware or at least that you put a warning in the documentation of this project so that people can do their research on pricing and the metrics they have before deploying it.
Thank your for your time and effort.
Miguel