Giter VIP home page Giter VIP logo

pipelines-azureml's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pipelines-azureml's Issues

Issue running Azure DevOps pipeline from pipelines/diabetes-train-and-deploy.yml

I've followed the instructions in the readme to set up the repo, created the service connection as directed, and created an Azure DevOps pipeline based on the diabetes-train-and-deploy.yml file. The workspace the pipeline points to is an existing resource that was created prior to finding the pipelines-azureml repo. When I run the pipeline it always fails on the Train Model step with the following error:

"error": {
    "code": "UserError",
    "message": "Image build run on compute failed: User starting the run is not an owner or assigned user to the Compute Instance",
    "details": []
},

I'm able to dig in further to the error in ML Studio and it shows the user calling is the service connection I set up for the pipeline. On the off chance that it might be a permissions issue, I added that user as a contributor to the workspace but I see the same error. I've also tried the powershell commands from the "Run CLI scripts..." section at the bottom of the README.md file and I get the same message running under my Azure account which has the Owner role on the ML Workspace.

The pipeline was able to create the compute cluster, but it seems that it doesn't have access to the cluster after it's created? Another possibility is that our workspace has something locked down that is preventing this pipeline from working properly. Any help is greatly appreciated. Thank you!

Readme instructions broken

Since the last change to the azure-pipelines.yml the instructions in the readme.md are not valid anymore:

Modify the azure-pipelines.yml and change myresourcegroup to the Azure resource group that contains your workspace. You must also change the myworkspace entry to the name of your Azure Machine Learning service workspace.

  • azureSubscription (service connection) is now "build-demo" everywhere instead of "azmldemows"
  • resource group name is now "scottgu-all-hands" instead of "myresourcegroup"
  • ML workspace name is now "build-2019-demo" instead of "myworkspace"

Retry pipeline and/or task on failure

I use the Python SDK to develop ML pipelines for Azure ML.

How do I get my PythonScriptStep tasks or the encompassing Pipeline object to simply rerun upon failure?
I reckon it's pretty common for pipelines to temporarily break upon temporary network, storage, etc. issues so a simple rerun / retry seems pretty basic for task orchestration frameworks to provide (see e.g. Apache Airflow).

I've spent a fair amount of time going over the documentation for Azure ML and I just can't figure out how to get "retry upon failure" behaviour.

The closest there is is the continue_on_step_failure pipeline / task parameter which doesn't really do what's needed.

Any advice please?

Model not found in cache or in root at ./diabetes-model

Hello,

Following the different steps of the Azure Pipeline, I got this issue :

"message": "Service deployment polling reached non-successful terminal state, current service state: Unhealthy\nOperation ID: e9252f0d-81f8-44e5-bd6d-983076eca1f5\nMore information can be found using '.get_logs()'\nError:\n{\n "code": "DeploymentTimedOut",\n "statusCode": 504,\n "message": "The deployment operation polling has TimedOut. The service creation is taking longer than our normal time. We are still trying to achieve the desired state for the web service. Please check the webservice state for the current webservice health. You can run print(service.state) from the python SDK to retrieve the current state of the webservice."\n}

Looking for the logs with get_logs(), I extract this part of the message :
Model not found in cache or in root at ./diabetes-model

The az CLI command is the following : az ml model deploy -n diabetes-qa-aci -f model.json --ic config/inference-config.yml --dc config/deployment-config-aci.yml --overwrite -v

And model.json is created by the previous step and contains :
{
"cpu": "",
"createdTime": "2020-06-09T04:57:54.550301+00:00",
"description": "",
"experimentName": "diabetes-exp",
"framework": "Custom",
"frameworkVersion": null,
"gpu": "",
"id": "diabetes_reg_model:2",
"memoryInGB": "",
"name": "diabetes_reg_model",
"properties": "",
"runId": "diabetes-exp_1591678184_b25da442",
"sampleInputDatasetId": "",
"sampleOutputDatasetId": "",
"tags": "",
"version": 2
}

Any idea ?

Issue with model train command .

Hi,

We are getting error when running the below command .
az ml run submit-script -c config/train --ct $(ml-ct) -e $(ml-exp) -t run.json train.py

Running h2o.ai in Azure ML (Installing Java is a must)

mcr.microsoft.com/azureml/base:0.2.4 is pretty flat, so tried a few steps to install Java.

  1. Adding a custom base dockerfile
script: train.py
arguments: []
framework: Python
environment:
  python:
    userManagedDependencies: false
    interpreterPath: python
    condaDependenciesFile: train-env.yml
  docker:
    enabled: true
    baseDockerfile: Dockerfile

Returns error:

Output from dependency scanning: fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).

  1. Add an argument to the docker. According to this documentation and this one as well, I can add an argument to the docker command. So, tried the following.
script: train.py
arguments: []
framework: Python
environment:
  python:
    userManagedDependencies: false
    interpreterPath: python
    condaDependenciesFile: config/train-conda.yml
  docker:
    enabled: true
    baseImage: mcr.microsoft.com/azureml/base:0.2.4
    arguments: ["--run","apt-get install default-jdk"] 

also arguments: "apt-get install default-jdk" like this.

As there is no documentation about it, having issues installing Java on the environment. Looking for your help.

Any example of model deployment on local compute?

Instead of ACI, what if we want to test our deployment via Azure DevOps locally?

What would the steps? Please add it? So far I have this:
in deployment-config-local.yml

computeType: local
port: 13579

and in the pipeline I have

az ml model deploy -n diabetes-qa-local --model diabetes-model:1 --ic config/inference-config.yml --dc config/deployment-config-local.yml

But it returns

Downloading model diabetes-model:1 to C:\Users\mkrdi\AppData\Local\Temp\azureml_s5877b_f\diabetes-model\1
Generating Docker build context.

then it fails

{'Azure-cli-ml Version': '1.4.0', 'Error': WebserviceException:
        Message: Received bad response from service:
Response Code: 400
Headers: {'Date': 'Wed, 06 May 2020 02:01:46 GMT', 'Content-Type': 'application/json', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Request-Context': 'appId=cid-v1:2d2e8e63-272e-4b3c-8598-4ee570a0e70d', 'x-ms-client-request-id': 'e734f89cdce14741bf8dc8ca879a8bab', 'x-ms-client-session-id': '71665c61-45e2-465a-9b6b-10d23ce6b0f8', 'api-supported-versions': '1.0, 2018-03-01-preview, 2018-11-19', 'Strict-Transport-Security': 'max-age=15724800; includeSubDomains; preload'}
Content: b'{"code":"BadRequest","statusCode":400,"message":"The request is invalid.","details":[{"code":"ServiceModelConflict","message":"Exactly one of the ModelIds or Models must be specified for a service."}],"correlation":{"RequestId":"e734f89cdce14741bf8dc8ca879a8bab"}}'
        InnerException None
        ErrorResponse
{
    "error": {
        "message": "Received bad response from service:\nResponse Code: 400\nHeaders: {'Date': 'Wed, 06 May 2020 02:01:46 GMT', 'Content-Type': 'application/json', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Request-Context': 'appId=cid-v1:2d2e8e63-272e-4b3c-8598-4ee570a0e70d', 'x-ms-client-request-id': 'e734f89cdce14741bf8dc8ca879a8bab', 'x-ms-client-session-id': '71665c61-45e2-465a-9b6b-10d23ce6b0f8', 'api-supported-versions': '1.0, 2018-03-01-preview, 2018-11-19', 'Strict-Transport-Security': 'max-age=15724800; includeSubDomains; preload'}\nContent: b'{\"code\":\"BadRequest\",\"statusCode\":400,\"message\":\"The request is invalid.\",\"details\":[{\"code\":\"ServiceModelConflict\",\"message\":\"Exactly one of the ModelIds or Models must be specified for a service.\"}],\"correlation\":{\"RequestId\":\"e734f89cdce14741bf8dc8ca879a8bab\"}}'"     
    }
}}

Problems executing the pipeline examples

Hello there,
I'm trying to follow the tutorial but when I executed it I got the following error

##[error]No hosted parallelism has been purchased or granted. To request a free parallelism grant, please fill out the following form https://aka.ms/azpipelines-parallelism-request
Pool: Azure Pipelines
Image: Ubuntu-16.04
Started: Just now
Duration: 11s

Job preparation parameters
ContinueOnError: False
TimeoutInMinutes: 60
CancelTimeoutInMinutes: 5
Expand:
  MaxConcurrency: 0
  ########## System Pipeline Decorator(s) ##########

  Begin evaluating template 'system-pre-steps.yml'
Evaluating: eq('true', variables['system.debugContext'])
Expanded: eq('true', Null)
Result: False
Evaluating: resources['repositories']['self']
Expanded: Object
Result: True
Evaluating: not(containsValue(job['steps']['*']['task']['id'], '6d15af64-176c-496d-b583-fd2ae21d4df4'))
Expanded: not(containsValue(Object, '6d15af64-176c-496d-b583-fd2ae21d4df4'))
Result: True
Evaluating: resources['repositories']['self']['checkoutOptions']
Result: Object
Finished evaluating template 'system-pre-steps.yml'
********************************************************************************
Template and static variable resolution complete. Final runtime YAML document:
steps:
- task: 6d15af64-176c-496d-b583-fd2ae21d4df4@1
  inputs:
    repository: self

I found that now you have to request permissions to MS, there is any way to execute it without request their permissions?

Thank you

Unable to delete pipeline drafts?

The Designer UI has a feature to delete pipeline drafts.

This feature is grayed out. There is no ability to select the pipeline draft and delete it either. Is this a defect?

Screen Shot 2020-11-06 at 5 44 35 PM

Compute name 'cpu-cluster-1' is invalid

Raising a ticket because the compute name 'cpu-cluster-1' is invalid. My suggestion would be to change it into 'cpu'. See error message below:

Command group 'ml' is experimental and under development. Reference and support levels: https://aka.ms/CLI_refstatus
Creating compute instance...
{'Azure-cli-ml Version': '1.29.0', 'Error': ComputeTargetException:
        Message: Compute name 'cpu-cluster-1' is not available. Reason: Invalid. Message: A name for an Azure ML Com
pute Instance must be between 3 and 24 characters in length and must use only numbers, letters and minus symbol (-)
,must start with letters. Numbers cannot be the ending of the name if the previous character is a minus symbol (-).
 Please specify a different Azure ML Instance name
        InnerException None
        ErrorResponse
{
    "error": {
        "message": "Compute name 'cpu-cluster-1' is not available. Reason: Invalid. Message: A name for an Azure ML
Compute Instance must be between 3 and 24 characters in length and must use only numbers, letters and minus symbol (
-)\uff0cmust start with letters. Numbers cannot be the ending of the name if the previous character is a minus symbo
l (-). Please specify a different Azure ML Instance name"
    }
}}

Error in train model

I'm having trouble completing the getting_started example (getting_started.md) as the pipeline stops on the train (takes too long ≈ 60 min on train model job). Here are the last logs before canceling automatically (the file contains the entire logs:
Complete Logs.txt
):

2022-02-07T00:52:37.0050192Z WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Read timed out. (read timeout=15)",)': /packages/6b/b2/c0d62a3a91c13641e09af294c13fe16929f88dc5902718388cd9b292217f/azure_mgmt_authorization-0.52.0-py2.py3-none-any.whl
2022-02-07T00:52:37.0052090Z Downloading azure_mgmt_authorization-0.52.0-py2.py3-none-any.whl (112 kB)
2022-02-07T00:52:37.0052735Z
2022-02-07T00:57:40.9228879Z WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Read timed out. (read timeout=15)",)': /packages/a1/71/9a20913e92771b3c23564f1bea54d376d09fb30a75585087c70b769d75c8/azure_mgmt_authorization-0.51.1-py2.py3-none-any.whl
2022-02-07T00:58:41.5520782Z Downloading azure_mgmt_authorization-0.51.1-py2.py3-none-any.whl (111 kB)
2022-02-07T00:58:41.5521395Z
2022-02-07T00:59:42.2727333Z INFO: This is taking longer than usual. You might need to provide the dependency resolver with stricter constraints to reduce runtime. If you want to abort this run, you can press Ctrl + C to do so. To improve how pip performs, tell us what happened here: https://pip.pypa.io/surveys/backtracking
2022-02-07T01:03:45.8869909Z Downloading azure_mgmt_authorization-0.51.0-py2.py3-none-any.whl (111 kB)
2022-02-07T01:09:52.4374279Z WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Read timed out. (read timeout=15)",)': /packages/6f/17/55b974603c16be89c7a7c16bac57b7bce48527bf1bebc3f116f7215176e6/azure_mgmt_authorization-0.50.0-py2.py3-none-any.whl
2022-02-07T01:09:52.4376241Z Downloading azure_mgmt_authorization-0.50.0-py2.py3-none-any.whl (81 kB)
2022-02-07T01:09:52.4376835Z
2022-02-07T01:26:07.6809069Z WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Read timed out. (read timeout=15)",)': /packages/67/e4/b3535daae30db9b3f73046a0c151c5c2ae2d2bff96ba0c28c1f26a21dbf1/azure_mgmt_authorization-0.40.0-py2.py3-none-any.whl
2022-02-07T01:26:07.6811091Z Downloading azure_mgmt_authorization-0.40.0-py2.py3-none-any.whl (38 kB)
2022-02-07T01:26:07.6811445Z
2022-02-07T01:39:04.9650251Z ##[error]The operation was canceled.
2022-02-07T01:39:04.9664245Z ##[section]Finishing: Train model

Error in Attach folder to workspace step

Hi,
When I run the pipeline, I'm getting the error below :
The problem seems to be at the Attach folder to workspace step.

  • task: AzureCLI@2
    displayName: 'Attach folder to workspace'
    inputs:
    azureSubscription: $(ml-ws-connection)
    workingDirectory: $(ml-path)
    scriptLocation: inlineScript
    scriptType: 'bash'
    inlineScript: 'az ml folder attach -w $(ml-ws) -g $(ml-rg)'

ERROR: ProjectSystemException:
Message: {
"error_details": {
"error": {
"code": "AuthorizationFailed",
"message": "The client 'a43e0215-c079-499e-b242-2c8cdc19e0ec' with object id 'a43e0215-c079-499e-b242-2c8cdc19e0ec' does not have authorization to perform action 'Microsoft.MachineLearningServices/workspaces/read' over scope '/subscriptions/#######-####-####-####-###########/resourceGroups/aml-demo/providers/Microsoft.MachineLearningServices/workspaces/aml-demo' or the scope is invalid. If access was recently granted, please refresh your credentials."
}
},
"status_code": 403,
"url": "https://management.azure.com/subscriptions/ce55f75a-7c5d-4393-ac9e-601083781d51/resourceGroups/aml-demo/providers/Microsoft.MachineLearningServices/workspaces/aml-demo?api-version=2020-01-01"
}
InnerException None
ErrorResponse
{
"error": {
"message": "{\n "error_details": {\n "error": {\n "code": "AuthorizationFailed",\n "message": "The client 'a43e0215-c079-499e-b242-2c8cdc19e0ec' with object id 'a43e0215-c079-499e-b242-2c8cdc19e0ec' does not have authorization to perform action 'Microsoft.MachineLearningServices/workspaces/read' over scope '/subscriptions/#######-####-####-####-###########/resourceGroups/aml-demo/providers/Microsoft.MachineLearningServices/workspaces/aml-demo' or the scope is invalid. If access was recently granted, please refresh your credentials."\n }\n },\n "status_code": 403,\n "url": "https://management.azure.com/subscriptions/#######-####-####-####-###########/resourceGroups/aml-demo/providers/Microsoft.MachineLearningServices/workspaces/aml-demo?api-version=2020-01-01\"\n}"
}
}
##[error]Script failed with exit code: 1

Testing the model

My deployment in AKS and ACI is done properly. But how can I test that this is running as expected or not.?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.