Giter VIP home page Giter VIP logo

azure-synapse-spark-metrics's Introduction

Azure Synapse Spark Metrics

Introduction

This project mainly aims to provide:

  • Azure Synapse Apache Spark metrics monitoring for Azure Synapse Spark applications by leveraging Prometheus, Grafana and Azure APIs.
  • Azure Synapse Prometheus connector for connecting the on-premises Prometheus server to Azure Synapse Analytics workspace metrics API.
  • Grafana dashboards for synapse spark metrics visualization.
  • Helm chart for Prometheus and Grafana deployment on AKS, including the connector, Prometheus servers and Grafana dashboards for metrics users.

The dataflow:

Dataflow Chart

Grafana dashboard screenshot:

Grafana dashboard

Prerequisites

  1. Azure CLI
  2. Helm 3.30+
  3. kubectl

Or just use the out-of-box Azure Cloud Shell, which includes all above tools.

Getting Started

  1. Create a Azure Kubernetes (1.16+, or use Minikube instead)

    az login
    az account set --subscription "<subscription_id>"
    az aks create --name <kubernetes_cluster_name> --resource-group <kubernetes_cluster_rg> --location eastus --node-vm-size Standard_D2s_v3
    az aks get-credentials --name <kubernetes_cluster_name> --resource-group <kubernetes_cluster_rg>
  2. Create a service principal and grant permission to synapse workspace

    az ad sp create-for-rbac --name <service_principal_name>

    The result should look like:

    {
        "appId": "abcdef...",
        "displayName": "<service_principal_name>",
        "name": "http://<service_principal_name>",
        "password": "abc....",
        "tenant": "<tenant_id>"
    }

    Note down the appId, password, and tenant id.

    1. Login to your Azure Synapse Analytics workspace as Synapse Administrator
    2. In Synapse Studio, on the left-side pane, select Manage > Access control
    3. Click the Add button on the upper left to add a role assignment
    4. For Scope choose Workspace
    5. For Role choose Synapse Compute Operator
    6. For Select user input your <service_principal_name> and click your service principal
    7. Click Apply

    Wait 3 minutes for permission to take effect.

    screenshot-grant-permission-srbac

    Note: Make sure your service principal has at least a "Reader" role in your Synapse workspace. Go to Access Control (IAM) tab of the Azure portal and check the permission settings.

  3. Install Synapse Prometheus Operator

    Add synapse-prometheus-operator repo to Helm client

    helm repo add synapse-charts https://github.com/microsoft/azure-synapse-spark-metrics/releases/download/helm-chart

    Install by Helm client:

    helm install spo synapse-charts/synapse-prometheus-operator --create-namespace --namespace spo \
        --set synapse.workspaces[0].workspace_name="<workspace_name>" \
        --set synapse.workspaces[0].tenant_id="<tenant_id>" \
        --set synapse.workspaces[0].service_principal_name="<service_principal_app_id>" \
        --set synapse.workspaces[0].service_principal_password="<service_principal_password>" \
        --set synapse.workspaces[0].subscription_id="<subscription_id>" \
        --set synapse.workspaces[0].resource_group="<workspace_resource_group_name>"
    • workspace_name: Synapse workspace name.
    • subscription_id: Synapse workspace subscription id.
    • workspace_resource_group_name: Synapse workspace resource group name.
    • tenant_id: Synapse workspace tenant id.
    • service_principal_name: The service principal name (or known as "appId")
    • service_principal_password: The service principal password you just created.

    For more details, please refer to config.example.yaml

  4. Open Grafana and enjoy!

    # Get password
    kubectl get secret --namespace spo spo-grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo
    # Get service ip, copy & paste the external ip to browser, and login with username 'admin' and the password.
    kubectl -n spo get svc spo-grafana

    Find Synapse Dashboard on the upper left corner of the Grafana page (Home -> Synapse Workspace / Synapse Application), try to run a example code in Synapse Studio notebook and wait a few seconds for the metrics pulling.

Uninstall

Remove the operators.

# helm delete <release> -n <namespace>
helm delete spo -n spo

Remove the Kubernetes cluster.

az aks delete --name <kubernetes_cluster_name> --resource-group <kubernetes_cluster_rg>

Install Helm Chart Locally

helm install spo ./synapse-prometheus-operator --create-namespace --namespace spo \
    --set synapse.workspaces[0].workspace_name="<workspace_name>" \
    --set synapse.workspaces[0].tenant_id="<tenant_id>" \
    --set synapse.workspaces[0].service_principal_name="<service_principal_app_id>" \
    --set synapse.workspaces[0].service_principal_password="<service_principal_password>" \
    --set synapse.workspaces[0].subscription_id="<subscription_id>" \
    --set synapse.workspaces[0].resource_group="<workspace_resource_group_name>"

Build Docker Image

cd synapse-prometheus-connector
docker build -t "synapse-prometheus-connector:${Version}" -f Dockerfile .

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

azure-synapse-spark-metrics's People

Contributors

dependabot[bot] avatar kaiyuezhou avatar microsoft-github-operations[bot] avatar microsoftopensource avatar weichunchung avatar wezhang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

azure-synapse-spark-metrics's Issues

Observability gap: jobs that fail on startup

Because the scrape configuration is dynamically generated from the list of running jobs if a job starts up and fails on startup or shortly after (say within 15-20 seconds of startup) it's possible no metrics are generated as the dynamic configuration never gets updated or it is updated then removed before prometheus has a chance to scrape.

I don't have solution for this issue. Wanted to call this out as problem though. Because the reporting API can only give you active jobs and the prometheus API in synapse that this project calls on only returns data if the job is running the issue is really up stream with Synapse.

Soap box: I've been trying to figure out how to get observability in a standard tool outside of the Synapse portal and it really doesn't seem like there is anything. Azure synapse job metrics don't have any standard Azure Monitor metrics exposed so this really feels like the only tool we have.

Not seeing a full list of available spark pools in grafana application dashboard

In the prodoction environment, Synapse Workspace Spark Application dashboard shows only 5 spark pools for the top label, even though there are 7 spark pools which have application logs during the time range.

When I tested it in my test environment, it showed 6 spark pools which had submitted application metrics during the time range.
I think that includeAll option will show all available variables, but it can have performance problem.
https://grafana.com/docs/grafana/latest/dashboards/variables/add-template-variables/#include-all-option

These differences between production and test environments seems to be caused by the amount of data and performance from the data sources.
What determines the number of spark pools that appear in the top drop-down list?

Better release structure: docker image version matching code commit?

The current docker image deployed as part of the helm chart is mcr.microsoft.com/azuresynapse/synapse-prometheus-connector:0.0.25 where is 0.0.25 coming from? There is only one release here and I can't find anything in the commit history or project files that indicates where this version tag is coming from. I would like to know what code commit matches the docker image this is deploying.

High Metric Cardinality

There isn't much happening in this project but I figured I'd write up my biggest issue with this project.

The Synapse metrics exposed by this project have extremely high cardinality. Every submitted job has a unique label value and every job submitted adds at least 40-60 time series. This is not sustainable for any reasonably high volume of Azure Synapse jobs.

I understand the problem is that the API exposed by Synapse is only made available on per-job basis but this project is sustainable from a metrics storage perspective as currently written.

Upgrade Kubernetes api versions

Some of the resource types will be deprecated and unavailable soon in newer version of k8s cluster.
We should upgrade:

  • apiextensions.k8s.io/v1beta1 CustomResourceDefinition to apiextensions.k8s.io/v1 CustomResourceDefinition
  • admissionregistration.k8s.io/v1beta1 MutatingWebhookConfiguration to admissionregistration.k8s.io/v1 MutatingWebhookConfiguration
  • rbac.authorization.k8s.io/v1beta1 to rbac.authorization.k8s.io/v1
    Helm install logs:
W1130 16:59:44.524022   30840 warnings.go:70] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
W1130 16:59:44.860813   30840 warnings.go:70] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
W1130 16:59:45.942810   30840 warnings.go:70] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
W1130 16:59:46.341809   30840 warnings.go:70] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
W1130 16:59:46.747813   30840 warnings.go:70] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
W1130 16:59:47.363204   30840 warnings.go:70] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
W1130 16:59:50.037879   30840 warnings.go:70] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
W1130 16:59:50.348874   30840 warnings.go:70] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
W1130 16:59:50.676881   30840 warnings.go:70] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
W1130 16:59:51.009879   30840 warnings.go:70] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
W1130 16:59:51.305889   30840 warnings.go:70] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
W1130 16:59:51.639880   30840 warnings.go:70] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
W1130 17:00:10.330688   30840 warnings.go:70] rbac.authorization.k8s.io/v1beta1 ClusterRole is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 ClusterRole
W1130 17:00:12.706693   30840 warnings.go:70] rbac.authorization.k8s.io/v1beta1 ClusterRoleBinding is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 ClusterRoleBinding
W1130 17:00:14.761705   30840 warnings.go:70] rbac.authorization.k8s.io/v1beta1 Role is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 Role
W1130 17:00:15.635708   30840 warnings.go:70] rbac.authorization.k8s.io/v1beta1 RoleBinding is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 RoleBinding
W1130 17:00:21.398342   30840 warnings.go:70] admissionregistration.k8s.io/v1beta1 MutatingWebhookConfiguration is deprecated in v1.16+, unavailable in v1.22+; use admissionregistration.k8s.io/v1 MutatingWebhookConfiguration
W1130 17:00:31.766340   30840 warnings.go:70] admissionregistration.k8s.io/v1beta1 ValidatingWebhookConfiguration is deprecated in v1.16+, unavailable in v1.22+; use admissionregistration.k8s.io/v1 ValidatingWebhookConfiguration
W1130 17:00:54.576299   30840 warnings.go:70] rbac.authorization.k8s.io/v1beta1 ClusterRole is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 ClusterRole
W1130 17:00:54.873303   30840 warnings.go:70] rbac.authorization.k8s.io/v1beta1 ClusterRoleBinding is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 ClusterRoleBinding
W1130 17:00:55.173551   30840 warnings.go:70] rbac.authorization.k8s.io/v1beta1 Role is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 Role
W1130 17:00:55.475548   30840 warnings.go:70] rbac.authorization.k8s.io/v1beta1 RoleBinding is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 RoleBinding
W1130 17:00:57.050638   30840 warnings.go:70] admissionregistration.k8s.io/v1beta1 MutatingWebhookConfiguration is deprecated in v1.16+, unavailable in v1.22+; use admissionregistration.k8s.io/v1 MutatingWebhookConfiguration
W1130 17:01:08.548662   30840 warnings.go:70] admissionregistration.k8s.io/v1beta1 ValidatingWebhookConfiguration is deprecated in v1.16+, unavailable in v1.22+; use admissionregistration.k8s.io/v1 ValidatingWebhookConfiguration

ACTION REQUIRED: Microsoft needs this private repository to complete compliance info

There are open compliance tasks that need to be reviewed for your azure-synapse-spark-metrics repo.

Action required: 4 compliance tasks

To bring this repository to the standard required for 2021, we require administrators of this and all Microsoft GitHub repositories to complete a small set of tasks within the next 60 days. This is critical work to ensure the compliance and security of your microsoft GitHub organization.

Please take a few minutes to complete the tasks at: https://repos.opensource.microsoft.com/orgs/microsoft/repos/azure-synapse-spark-metrics/compliance

  • The GitHub AE (GitHub inside Microsoft) migration survey has not been completed for this private repository
  • No Service Tree mapping has been set for this repo. If this team does not use Service Tree, they can also opt-out of providing Service Tree data in the Compliance tab.
  • No repository maintainers are set. The Open Source Maintainers are the decision-makers and actionable owners of the repository, irrespective of administrator permission grants on GitHub.
  • Classification of the repository as production/non-production is missing in the Compliance tab.

You can close this work item once you have completed the compliance tasks, or it will automatically close within a day of taking action.

If you no longer need this repository, it might be quickest to delete the repo, too.

GitHub inside Microsoft program information

More information about GitHub inside Microsoft and the new GitHub AE product can be found at https://aka.ms/gim or by contacting [email protected]

FYI: current admins at Microsoft include @jeffzhengwei, @wezhang, @kaiyuezhou

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.