service-capacity-modeling's Issues

Add support for gp3 volume types

Now that gp3 is a thing let's add it to the hardware descriptions so we can use it.

Improve C* scaling logic when including EVCache in KV plan

In our current logic (https://github.com/Netflix-Skunkworks/service-capacity-modeling/blob/main/service_capacity_modeling/models/org/netflix/key_value.py#L85), we scale the C* cluster by a factor of 1 - estimated_kv_cache_hit_rate, where estimated_kv_cache_hit_rate is configurable (default 0.8).

Per a previous convo with @jolynch and @szimmer1, we discussed possibly tying in the read/write ratio from the user desires into this calculation.

One toy example:

estimated_cache_hit_rate = extra_model_arguments.get("estimated_cache_hit_rate", 0.8)
estimated_cache_miss_rate = 1 - estimated_cache_hit_rate
rps_interval.scale(min(estimated_cache_miss_rate, max(0.1, 1 - read_write_ratio)))

Capacity plans should return recommended autoscaling policies

Right now we just make a recommendation like "12 m5d.2xlarge" but for software that can autoscale (stateless java apps, elasticsearch etc ...) it would be nice if we could return a hint of the autoscaling policy.

Step 1: Define how we will represent a scaling policy (e.g. how to represent various metrics like CPU utilization etc ...)
Step 2: Make the models return them

Unclear repetition

I'm working on summarizing the cost, cpu, disk (local & attached) for both regional and zonal clusters. I want there to be more consistency in the way repetition is represented.

us-east-1: # trimmed
us-west-2:
  least_regret:
    - candidate_clusters:
        total_annual_cost: # redacted
        zonal:
          - cluster_type: cassandra # trimmed
          - cluster_type: cassandra # trimmed
          - cluster_type: cassandra # trimmed
        regional:
          - cluster_type: dgwkv
            total_annual_cost: # redacted
            count: 3
            instance:
              total_annual_cost: # redacted
              name: r5.large
            attached_drives:
              - name: gp2
                size_gib: 20
                annual_cost_per_gib: # redacted
                annual_cost_per_read_io: # redacted
                annual_cost_per_write_io: # redacted

In the sample above there are:

enumerated list of regions
duplicated zonals
instance with a count property

For rightsizing ask for existing compute usage

Right now most models are split into two parts:

Try to determine the resources you need for a desire using math on the desire (CPU, RAM, Disk, Network, etc ...). Example
Size and price clusters based on that particular service deployment mode (e.g. C* has to scale by factors of 2, and deploys to zones). Example

This makes sense for provisioning where we are trying to guess CPU time from e.g. payload sizes and RPS and such. For rightsizing it might makes more sense to just provide existing choices in the desire along with utilization and then the model can produce a ideal hardware for that specific requirement. Perhaps modify CapacityDesires to have an additional field called existing_deployment that takes either a Requirements or a Clusters. Maybe with the modification of instead of supplying a frequency to requriements, have a hardware shape/count (the cpu_count would be cpu * utilization for example).

Then models can short circuit the requirements generation or at least use the provided numbers as good defaults. RAM is the only one that seems tricky to me that might require merging.

Adding a new model

Greetings,

I was asked to add a new model to your capacity planner. Are there general directions or documentation on what it takes to add a new model? I see that the existing models vary significantly on how they are implemented. Since the model is provided to the capacity planner, there must be a protocol somewhere that must be followed so the planner understands the model, but if there is, I can't find it.

Thank you,

netflix-skunkworks / service-capacity-modeling Goto Github PK

service-capacity-modeling's Issues

Add support for gp3 volume types

Improve C* scaling logic when including EVCache in KV plan

Capacity plans should return recommended autoscaling policies

Unclear repetition

For rightsizing ask for existing compute usage

Adding a new model

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent