The microservices-observability_and_monitoring from codecritics

Note: For the screenshots, you can store all of your answer images in the answer-img directory.

Verify the monitoring installation

TODO: run kubectl command to show the running pods and services for all components. Take a screenshot of the output and include it here to verify the installation

Setup the Jaeger and Prometheus source

TODO: Expose Grafana to the internet and then setup Prometheus as a data source. Provide a screenshot of the home page after logging into Grafana.

Create a Basic Dashboard

TODO: Create a dashboard in Grafana that shows Prometheus as a source. Take a screenshot and include it here.

Describe SLO/SLI

TODO: Describe, in your own words, what the SLIs are, based on an SLO of monthly uptime and request response time.

Describe, in your own words, what the SLIs are, based on an SLO of monthly uptime and request response time. SLIs stand for Service Level Indicators, it is a specific metric used to measure a performance of a service. Based on an SLO of monthly uptime and request response time, SLIs are the measurement of the SLO. SLIs expose on real time how a service is performing to show whether the SLOs is achieved.

Creating SLI metrics.

TODO: It is important to know why we want to measure certain metrics for our customer. Describe in detail 5 metrics to measure these SLIs.

Request Latency: The time taken to serve a request (usually measured in ms).
System Throughput: The requests per second.
Uptime: A percentage of availability during a specific period (minutes or hours ect..).
Traffic: The amount of stress on a system from demand (such as the number of HTTP requests/second.
Error Rate: The errors often expressed as a fraction of all requests received. (eg. percentage of HTTP 500 responses).

Create a Dashboard to measure our SLIs

TODO: Create a dashboard to measure the uptime of the frontend and backend services We will also want to measure to measure 40x and 50x errors. Create a dashboard that show these values over a 24 hour period and take a screenshot.

Tracing our Flask App

TODO: We will create a Jaeger span to measure the processes on the backend. Once you fill in the span, provide a screenshot of it here. Also provide a (screenshot) sample Python file containing a trace and span code used to perform Jaeger traces on the backend service.

Jaeger in Dashboards

TODO: Now that the trace is running, let's add the metric to our current Grafana dashboard. Once this is completed, provide a screenshot of it here.

Report Error

TODO: Using the template below, write a trouble ticket for the developers, to explain the errors that you are seeing (400, 500, latency) and to let them know the file that is causing the issue also include a screenshot of the tracer span to demonstrate how we can user a tracer to locate errors easily.

TROUBLE TICKET

Name: Error on trial/app/app.py

Date: 02/08/2022

Subject: Cannot retrieve the number of jobs from provided URL

Affected Area: Endpoint: File "/app/app.py", line 66, in homepage

Severity: High

Description: JSONDecodeError: There's an issue around the way the request-response data is structured, cannot evaluate the length of the JSON output.

Creating SLIs and SLOs

TODO: We want to create an SLO guaranteeing that our application has a 99.95% uptime per month. Name four SLIs that you would use to measure the success of this SLO.

Uptime - Services should be up and running for at least 99.999% of the time on monthly basis.
Http Error Rate - Services returning 20x HTTP status should be <= 98% (per month) - Error per second <= 0.05% of the requests.
Http request latency - Requests responses should less than 50ms.
CPU and Memory usage - Services should not overload.

Building KPIs for our plan

TODO: Now that we have our SLIs and SLOs, create a list of 2-3 KPIs to accurately measure these metrics as well as a description of why those KPIs were chosen. We will make a dashboard for this, but first write them down here.

Uptime (pod uptime):

Backend uptime
Frontend uptime

4XX and 5XX errors

Number of successful request / number of failing requests (for frontend|backend)

Traffic:

Average response time

Resources Usage:

CPU usage
RAM usage

Final Dashboard

TODO: Create a Dashboard containing graphs that capture all the metrics of your KPIs and adequately representing your SLIs and SLOs. Include a screenshot of the dashboard here, and write a text description of what graphs are represented in the dashboard.

Measure Uptime of Backend and frontend
Measure Backend and Frontend Http [4XX,5XX] Errors
Measure average response time of the last 30s
4a. Measure CPU usage of backend and frontend
4b. Measure Memory usage of backend and frontend

codecritics / microservices-observability_and_monitoring Goto Github PK

microservices-observability_and_monitoring's Introduction

Verify the monitoring installation

Setup the Jaeger and Prometheus source

Create a Basic Dashboard

Describe SLO/SLI

Creating SLI metrics.

Create a Dashboard to measure our SLIs

Tracing our Flask App

Jaeger in Dashboards

Report Error

Creating SLIs and SLOs

Building KPIs for our plan

Final Dashboard

microservices-observability_and_monitoring's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent