microsoftdocs / pipelines-azureml Goto Github PK

View Code? Open in Web Editor NEW

112.0 11.0 512.0 1.98 MB

Example Azure Pipeline to train and deploy a machine learning model using the Azure Machine Learning service

License: Creative Commons Attribution 4.0 International

Python 100.00%

pipelines-azureml's People

Stargazers

Watchers

Forkers

pulkitaggarwl sahanaprabhakar drivetimeinc rsteeno sramayanam laihoangle mengyoo johnpaulada obsidianvoid jacwu shinchan75034 mooncowboy orieke llalonde gbaeke prathamesh99 vaidya-s danders32 bokravts p3ngu1nx matakeda1 bospoort sudh9931 wistreng vespassassina-zz jyotsnaravi ram-msft satonaoki neaorin lazzerifrancesca tamitarai revodavid quentin241 olivierdolle datasnowman jimicasey ompanda rajmsft krshoper tamagosenshi xibelly camilowarren jairochoa smolanor jherna9 tbchk sotomsa karol001 margaritacelada yesidmope diacosta santiago1129 christianidata andresariasgarcia dianaaguirre luismejia9 shwetams noorabani yanmayattai riedwaanb wbuchwalter anildwarepo ashu1979 trankennyk saito-jbs tshau vvalpe rubeneric charleswm bhawna5singh fadnavistanmay tcharlesdam drivably devanshidiaries amir8015 wernerchao olufemig mstosugimo-gh kaorinawata free4m peterychang reddum mazn-aurubis taejoo sqlstack goelhardik mikeburba-msft isabelgrund peidyen savs33 jazzyray jasonerrett rileymshea manish-shukla01 cromagnonninja bijuthan rahulkishan-mobbed katyamust anttoni-pykalisto chitratsr

pipelines-azureml's Issues

Issue running Azure DevOps pipeline from pipelines/diabetes-train-and-deploy.yml

I've followed the instructions in the readme to set up the repo, created the service connection as directed, and created an Azure DevOps pipeline based on the diabetes-train-and-deploy.yml file. The workspace the pipeline points to is an existing resource that was created prior to finding the pipelines-azureml repo. When I run the pipeline it always fails on the Train Model step with the following error:

"error": {
    "code": "UserError",
    "message": "Image build run on compute failed: User starting the run is not an owner or assigned user to the Compute Instance",
    "details": []
},

I'm able to dig in further to the error in ML Studio and it shows the user calling is the service connection I set up for the pipeline. On the off chance that it might be a permissions issue, I added that user as a contributor to the workspace but I see the same error. I've also tried the powershell commands from the "Run CLI scripts..." section at the bottom of the README.md file and I get the same message running under my Azure account which has the Owner role on the ML Workspace.

The pipeline was able to create the compute cluster, but it seems that it doesn't have access to the cluster after it's created? Another possibility is that our workspace has something locked down that is preventing this pipeline from working properly. Any help is greatly appreciated. Thank you!

Readme instructions broken

Since the last change to the azure-pipelines.yml the instructions in the readme.md are not valid anymore:

Modify the azure-pipelines.yml and change myresourcegroup to the Azure resource group that contains your workspace. You must also change the myworkspace entry to the name of your Azure Machine Learning service workspace.

azureSubscription (service connection) is now "build-demo" everywhere instead of "azmldemows"
resource group name is now "scottgu-all-hands" instead of "myresourcegroup"
ML workspace name is now "build-2019-demo" instead of "myworkspace"

Retry pipeline and/or task on failure

I use the Python SDK to develop ML pipelines for Azure ML.

How do I get my PythonScriptStep tasks or the encompassing Pipeline object to simply rerun upon failure?
I reckon it's pretty common for pipelines to temporarily break upon temporary network, storage, etc. issues so a simple rerun / retry seems pretty basic for task orchestration frameworks to provide (see e.g. Apache Airflow).

I've spent a fair amount of time going over the documentation for Azure ML and I just can't figure out how to get "retry upon failure" behaviour.

The closest there is is the continue_on_step_failure pipeline / task parameter which doesn't really do what's needed.

Any advice please?

Model not found in cache or in root at ./diabetes-model

Hello,

Following the different steps of the Azure Pipeline, I got this issue :

"message": "Service deployment polling reached non-successful terminal state, current service state: Unhealthy\nOperation ID: e9252f0d-81f8-44e5-bd6d-983076eca1f5\nMore information can be found using '.get_logs()'\nError:\n{\n "code": "DeploymentTimedOut",\n "statusCode": 504,\n "message": "The deployment operation polling has TimedOut. The service creation is taking longer than our normal time. We are still trying to achieve the desired state for the web service. Please check the webservice state for the current webservice health. You can run print(service.state) from the python SDK to retrieve the current state of the webservice."\n}

Looking for the logs with get_logs(), I extract this part of the message :
Model not found in cache or in root at ./diabetes-model

The az CLI command is the following : az ml model deploy -n diabetes-qa-aci -f model.json --ic config/inference-config.yml --dc config/deployment-config-aci.yml --overwrite -v

And model.json is created by the previous step and contains :
{
"cpu": "",
"createdTime": "2020-06-09T04:57:54.550301+00:00",
"description": "",
"experimentName": "diabetes-exp",
"framework": "Custom",
"frameworkVersion": null,
"gpu": "",
"id": "diabetes_reg_model:2",
"memoryInGB": "",
"name": "diabetes_reg_model",
"properties": "",
"runId": "diabetes-exp_1591678184_b25da442",
"sampleInputDatasetId": "",
"sampleOutputDatasetId": "",
"tags": "",
"version": 2
}

Any idea ?

Issue with model train command .

Hi,

We are getting error when running the below command .
az ml run submit-script -c config/train --ct $(ml-ct) -e $(ml-exp) -t run.json train.py

Running h2o.ai in Azure ML (Installing Java is a must)

mcr.microsoft.com/azureml/base:0.2.4 is pretty flat, so tried a few steps to install Java.

Adding a custom base dockerfile

script: train.py
arguments: []
framework: Python
environment:
  python:
    userManagedDependencies: false
    interpreterPath: python
    condaDependenciesFile: train-env.yml
  docker:
    enabled: true
    baseDockerfile: Dockerfile

Returns error:

Output from dependency scanning: fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).

Add an argument to the docker. According to this documentation and this one as well, I can add an argument to the docker command. So, tried the following.

script: train.py
arguments: []
framework: Python
environment:
  python:
    userManagedDependencies: false
    interpreterPath: python
    condaDependenciesFile: config/train-conda.yml
  docker:
    enabled: true
    baseImage: mcr.microsoft.com/azureml/base:0.2.4
    arguments: ["--run","apt-get install default-jdk"]

also arguments: "apt-get install default-jdk" like this.

As there is no documentation about it, having issues installing Java on the environment. Looking for your help.

Any example of model deployment on local compute?

Instead of ACI, what if we want to test our deployment via Azure DevOps locally?

What would the steps? Please add it? So far I have this:
in deployment-config-local.yml

computeType: local
port: 13579

and in the pipeline I have

az ml model deploy -n diabetes-qa-local --model diabetes-model:1 --ic config/inference-config.yml --dc config/deployment-config-local.yml

But it returns

Downloading model diabetes-model:1 to C:\Users\mkrdi\AppData\Local\Temp\azureml_s5877b_f\diabetes-model\1
Generating Docker build context.

then it fails

{'Azure-cli-ml Version': '1.4.0', 'Error': WebserviceException:
        Message: Received bad response from service:
Response Code: 400
Headers: {'Date': 'Wed, 06 May 2020 02:01:46 GMT', 'Content-Type': 'application/json', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Request-Context': 'appId=cid-v1:2d2e8e63-272e-4b3c-8598-4ee570a0e70d', 'x-ms-client-request-id': 'e734f89cdce14741bf8dc8ca879a8bab', 'x-ms-client-session-id': '71665c61-45e2-465a-9b6b-10d23ce6b0f8', 'api-supported-versions': '1.0, 2018-03-01-preview, 2018-11-19', 'Strict-Transport-Security': 'max-age=15724800; includeSubDomains; preload'}
Content: b'{"code":"BadRequest","statusCode":400,"message":"The request is invalid.","details":[{"code":"ServiceModelConflict","message":"Exactly one of the ModelIds or Models must be specified for a service."}],"correlation":{"RequestId":"e734f89cdce14741bf8dc8ca879a8bab"}}'
        InnerException None
        ErrorResponse
{
    "error": {
        "message": "Received bad response from service:\nResponse Code: 400\nHeaders: {'Date': 'Wed, 06 May 2020 02:01:46 GMT', 'Content-Type': 'application/json', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Request-Context': 'appId=cid-v1:2d2e8e63-272e-4b3c-8598-4ee570a0e70d', 'x-ms-client-request-id': 'e734f89cdce14741bf8dc8ca879a8bab', 'x-ms-client-session-id': '71665c61-45e2-465a-9b6b-10d23ce6b0f8', 'api-supported-versions': '1.0, 2018-03-01-preview, 2018-11-19', 'Strict-Transport-Security': 'max-age=15724800; includeSubDomains; preload'}\nContent: b'{\"code\":\"BadRequest\",\"statusCode\":400,\"message\":\"The request is invalid.\",\"details\":[{\"code\":\"ServiceModelConflict\",\"message\":\"Exactly one of the ModelIds or Models must be specified for a service.\"}],\"correlation\":{\"RequestId\":\"e734f89cdce14741bf8dc8ca879a8bab\"}}'"     
    }
}}

Problems executing the pipeline examples

Hello there,
I'm trying to follow the tutorial but when I executed it I got the following error

##[error]No hosted parallelism has been purchased or granted. To request a free parallelism grant, please fill out the following form https://aka.ms/azpipelines-parallelism-request
Pool: Azure Pipelines
Image: Ubuntu-16.04
Started: Just now
Duration: 11s

Job preparation parameters
ContinueOnError: False
TimeoutInMinutes: 60
CancelTimeoutInMinutes: 5
Expand:
  MaxConcurrency: 0
  ########## System Pipeline Decorator(s) ##########

  Begin evaluating template 'system-pre-steps.yml'
Evaluating: eq('true', variables['system.debugContext'])
Expanded: eq('true', Null)
Result: False
Evaluating: resources['repositories']['self']
Expanded: Object
Result: True
Evaluating: not(containsValue(job['steps']['*']['task']['id'], '6d15af64-176c-496d-b583-fd2ae21d4df4'))
Expanded: not(containsValue(Object, '6d15af64-176c-496d-b583-fd2ae21d4df4'))
Result: True
Evaluating: resources['repositories']['self']['checkoutOptions']
Result: Object
Finished evaluating template 'system-pre-steps.yml'
********************************************************************************
Template and static variable resolution complete. Final runtime YAML document:
steps:
- task: 6d15af64-176c-496d-b583-fd2ae21d4df4@1
  inputs:
    repository: self

I found that now you have to request permissions to MS, there is any way to execute it without request their permissions?

Thank you

Unable to delete pipeline drafts?

The Designer UI has a feature to delete pipeline drafts.

This feature is grayed out. There is no ability to select the pipeline draft and delete it either. Is this a defect?

'Error': TypeError("'<' not supported between instances of 'int' and 'NoneType'",)}

getting this error when executing below computetarget create command:
"az ml computetarget create amlcompute -n $(ml-ct) --vm-size STANDARD_D2_V2 --max-nodes 1"

Version:
Azure-cli-ml Version': '1.24.0'

Compute name 'cpu-cluster-1' is invalid

Raising a ticket because the compute name 'cpu-cluster-1' is invalid. My suggestion would be to change it into 'cpu'. See error message below:

Command group 'ml' is experimental and under development. Reference and support levels: https://aka.ms/CLI_refstatus
Creating compute instance...
{'Azure-cli-ml Version': '1.29.0', 'Error': ComputeTargetException:
        Message: Compute name 'cpu-cluster-1' is not available. Reason: Invalid. Message: A name for an Azure ML Com
pute Instance must be between 3 and 24 characters in length and must use only numbers, letters and minus symbol (-)
，must start with letters. Numbers cannot be the ending of the name if the previous character is a minus symbol (-).
 Please specify a different Azure ML Instance name
        InnerException None
        ErrorResponse
{
    "error": {
        "message": "Compute name 'cpu-cluster-1' is not available. Reason: Invalid. Message: A name for an Azure ML
Compute Instance must be between 3 and 24 characters in length and must use only numbers, letters and minus symbol (
-)\uff0cmust start with letters. Numbers cannot be the ending of the name if the previous character is a minus symbo
l (-). Please specify a different Azure ML Instance name"
    }
}}

This repo is missing important files

There are important files that Microsoft projects should all have that are not present in this repository. A pull request has been opened to add the missing file(s). When the pr is merged this issue will be closed automatically.

Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.

Merge this pull request

Error in train model

I'm having trouble completing the getting_started example (getting_started.md) as the pipeline stops on the train (takes too long ≈ 60 min on train model job). Here are the last logs before canceling automatically (the file contains the entire logs:
Complete Logs.txt
):

2022-02-07T00:52:37.0050192Z WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Read timed out. (read timeout=15)",)': /packages/6b/b2/c0d62a3a91c13641e09af294c13fe16929f88dc5902718388cd9b292217f/azure_mgmt_authorization-0.52.0-py2.py3-none-any.whl
2022-02-07T00:52:37.0052090Z Downloading azure_mgmt_authorization-0.52.0-py2.py3-none-any.whl (112 kB)
2022-02-07T00:52:37.0052735Z
2022-02-07T00:57:40.9228879Z WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Read timed out. (read timeout=15)",)': /packages/a1/71/9a20913e92771b3c23564f1bea54d376d09fb30a75585087c70b769d75c8/azure_mgmt_authorization-0.51.1-py2.py3-none-any.whl
2022-02-07T00:58:41.5520782Z Downloading azure_mgmt_authorization-0.51.1-py2.py3-none-any.whl (111 kB)
2022-02-07T00:58:41.5521395Z
2022-02-07T00:59:42.2727333Z INFO: This is taking longer than usual. You might need to provide the dependency resolver with stricter constraints to reduce runtime. If you want to abort this run, you can press Ctrl + C to do so. To improve how pip performs, tell us what happened here: https://pip.pypa.io/surveys/backtracking
2022-02-07T01:03:45.8869909Z Downloading azure_mgmt_authorization-0.51.0-py2.py3-none-any.whl (111 kB)
2022-02-07T01:09:52.4374279Z WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Read timed out. (read timeout=15)",)': /packages/6f/17/55b974603c16be89c7a7c16bac57b7bce48527bf1bebc3f116f7215176e6/azure_mgmt_authorization-0.50.0-py2.py3-none-any.whl
2022-02-07T01:09:52.4376241Z Downloading azure_mgmt_authorization-0.50.0-py2.py3-none-any.whl (81 kB)
2022-02-07T01:09:52.4376835Z
2022-02-07T01:26:07.6809069Z WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Read timed out. (read timeout=15)",)': /packages/67/e4/b3535daae30db9b3f73046a0c151c5c2ae2d2bff96ba0c28c1f26a21dbf1/azure_mgmt_authorization-0.40.0-py2.py3-none-any.whl
2022-02-07T01:26:07.6811091Z Downloading azure_mgmt_authorization-0.40.0-py2.py3-none-any.whl (38 kB)
2022-02-07T01:26:07.6811445Z
2022-02-07T01:39:04.9650251Z ##[error]The operation was canceled.
2022-02-07T01:39:04.9664245Z ##[section]Finishing: Train model

Error in Attach folder to workspace step

Hi,
When I run the pipeline, I'm getting the error below :
The problem seems to be at the Attach folder to workspace step.

task: AzureCLI@2
displayName: 'Attach folder to workspace'
inputs:
azureSubscription: $(ml-ws-connection)
workingDirectory: $(ml-path)
scriptLocation: inlineScript
scriptType: 'bash'
inlineScript: 'az ml folder attach -w $(ml-ws) -g $(ml-rg)'

ERROR: ProjectSystemException:
Message: {
"error_details": {
"error": {
"code": "AuthorizationFailed",
"message": "The client 'a43e0215-c079-499e-b242-2c8cdc19e0ec' with object id 'a43e0215-c079-499e-b242-2c8cdc19e0ec' does not have authorization to perform action 'Microsoft.MachineLearningServices/workspaces/read' over scope '/subscriptions/#######-####-####-####-###########/resourceGroups/aml-demo/providers/Microsoft.MachineLearningServices/workspaces/aml-demo' or the scope is invalid. If access was recently granted, please refresh your credentials."
}
},
"status_code": 403,
"url": "https://management.azure.com/subscriptions/ce55f75a-7c5d-4393-ac9e-601083781d51/resourceGroups/aml-demo/providers/Microsoft.MachineLearningServices/workspaces/aml-demo?api-version=2020-01-01"
}
InnerException None
ErrorResponse
{
"error": {
"message": "{\n "error_details": {\n "error": {\n "code": "AuthorizationFailed",\n "message": "The client 'a43e0215-c079-499e-b242-2c8cdc19e0ec' with object id 'a43e0215-c079-499e-b242-2c8cdc19e0ec' does not have authorization to perform action 'Microsoft.MachineLearningServices/workspaces/read' over scope '/subscriptions/#######-####-####-####-###########/resourceGroups/aml-demo/providers/Microsoft.MachineLearningServices/workspaces/aml-demo' or the scope is invalid. If access was recently granted, please refresh your credentials."\n }\n },\n "status_code": 403,\n "url": "https://management.azure.com/subscriptions/#######-####-####-####-###########/resourceGroups/aml-demo/providers/Microsoft.MachineLearningServices/workspaces/aml-demo?api-version=2020-01-01\"\n}"
}
}
##[error]Script failed with exit code: 1

Testing the model

My deployment in AKS and ACI is done properly. But how can I test that this is running as expected or not.?

microsoftdocs / pipelines-azureml Goto Github PK

pipelines-azureml's People

Stargazers

Watchers

Forkers

pipelines-azureml's Issues

Recommend Projects

Recommend Topics

Recommend Org