Giter VIP home page Giter VIP logo

mlops's Introduction

page_type languages products description
sample
python
azure
azure-machine-learning-service
azure-devops
MLOps end to end examples & solutions. A collection of examples showing different end to end scenarios operationalizing ML workflows with Azure Machine Learning, integrated with GitHub and other Azure services such as Data Factory and DevOps.

Updated MLOps Guidance on Azure (2023)

To learn the more about the latest guidance from Microsoft about MLOps review the following links.


MLOps on Azure

What is MLOps?

MLOps empowers data scientists and app developers to help bring ML models to production. MLOps enables you to track / version / audit / certify / re-use every asset in your ML lifecycle and provides orchestration services to streamline managing this lifecycle.

MLOps podcast

Check out the recent TwiML podcast on MLOps here

How does Azure ML help with MLOps?

Azure ML contains a number of asset management and orchestration services to help you manage the lifecycle of your model training & deployment workflows.

With Azure ML + Azure DevOps you can effectively and cohesively manage your datasets, experiments, models, and ML-infused applications. ML lifecycle

New MLOps features

If you are using the Machine Learning DevOps extension, you can access model name and version info using these variables:

  • Model Name: Release.Artifacts.{alias}.DefinitionName containing model name
  • Model Version: Release.Artifacts.{alias}.BuildNumber where alias is source alias set while adding the release artifact.

Getting Started / MLOps Workflow

An example repo which exercises our recommended flow can be found here

MLOps Best Practices

Train Model

  • Data scientists work in topic branches off of master.
  • When code is pushed to the Git repo, trigger a CI (continuous integration) pipeline.
  • First run: Provision infra-as-code (ML workspace, compute targets, datastores).
  • For new code: Every time new code is committed to the repo, run unit tests, data quality checks, train model.

We recommend the following steps in your CI process:

  • Train Model - run training code / algo & output a model file which is stored in the run history.
  • Evaluate Model - compare the performance of newly trained model with the model in production. If the new model performs better than the production model, the following steps are executed. If not, they will be skipped.
  • Register Model - take the best model and register it with the Azure ML Model registry. This allows us to version control it.

Operationalize Model

  • You can package and validate your ML model using the Azure ML CLI.
  • Once you have registered your ML model, you can use Azure ML + Azure DevOps to deploy it.
  • You can define a release definition in Azure Pipelines to help coordinate a release. Using the DevOps extension for Machine Learning, you can include artifacts from Azure ML, Azure Repos, and GitHub as part of your Release Pipeline.
  • In your release definition, you can leverage the Azure ML CLI's model deploy command to deploy your Azure ML model to the cloud (ACI or AKS).
  • Define your deployment as a gated release. This means that once the model web service deployment in the Staging/QA environment is successful, a notification is sent to approvers to manually review and approve the release. Once the release is approved, the model scoring web service is deployed to Azure Kubernetes Service(AKS) and the deployment is tested.

MLOps Solutions

We are committed to providing a collection of best-in-class solutions for MLOps, both in terms of well documented & fully managed cloud solutions, as well as reusable recipes which can help your organization to bootstrap its MLOps muscle. These examples are community supported and are not guaranteed to be up-to-date as new features enter the product.

All of our examples will be built in the open and we welcome contributions from the community!

How is MLOps different from DevOps?

  • Data/model versioning != code versioning - how to version data sets as the schema and origin data change
  • Digital audit trail requirements change when dealing with code + (potentially customer) data
  • Model reuse is different than software reuse, as models must be tuned based on input data / scenario.
  • To reuse a model you may need to fine-tune / transfer learn on it (meaning you need the training pipeline)
  • Models tend to decay over time & you need the ability to retrain them on demand to ensure they remain useful in a production context.

What are the key challenges we wish to solve with MLOps?

Model reproducibility & versioning

  • Track, snapshot & manage assets used to create the model
  • Enable collaboration and sharing of ML pipelines

Model auditability & explainability

  • Maintain asset integrity & persist access control logs
  • Certify model behavior meets regulatory & adversarial standards

Model packaging & validation

  • Support model portability across a variety of platforms
  • Certify model performance meets functional and latency requirements

Model deployment & monitoring

  • Release models with confidence
  • Monitor & know when to retrain by analyzing signals such as data drift

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Related projects

Microsoft AI Labs Github Find other Best Practice projects, and Azure AI design patterns in our central repository.

mlops's People

Contributors

abeomor avatar akshaya-a avatar blackmist avatar buchananwp avatar chris-lauren avatar datashinobi avatar dciborow avatar deeikele avatar dtzar avatar graememalcolm avatar grmadala avatar jomit avatar jpe316 avatar lostmygithubaccount avatar microsoft-github-policy-service[bot] avatar microsoftopensource avatar msftgits avatar pgmisc avatar praneet22 avatar rastala avatar saachigopal avatar setuc avatar shivp950 avatar singankit avatar sungyonhong avatar swinner95 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mlops's Issues

Customer churn model training error

I'm working on MLOps with the example of customer_churn.
During HyperParameter Tuning stage, I'm getting the following error:
User program failed with RuntimeError: expand(torch.FloatTensor{[2, 64]}, size=[64]): the number of sizes provided (1) must be greater or equal to the number of dimensions in the tensor (2)

The experiment failed. Finalizing run...
Cleaning up all outstanding Run operations, waiting 300.0 seconds
2 items cleaning up...
Cleanup took 0.13342905044555664 seconds
Starting the daemon thread to refresh tokens in background for process with pid = 115
Traceback (most recent call last):
File "svdkl_entry.py", line 61, in
trainer.fit(train_dataloader)
File "/mnt/batch/tasks/shared/LS_root/jobs/mlops-aml-ws/azureml/hd_88232b3c-5bdb-4f2f-8dad-6edb3674a6ce_3/mounts/workspaceblobstore/azureml/HD_88232b3c-5bdb-4f2f-8dad-6edb3674a6ce_3/trainer.py", line 87, in fit
output = self.model(data)
File "/azureml-envs/azureml_8a75b4939559f760114d357d5e253d59/lib/python3.6/site-packages/gpytorch/module.py", line 24, in call
outputs = self.forward(*inputs, **kwargs)
File "/mnt/batch/tasks/shared/LS_root/jobs/mlops-aml-ws/azureml/hd_88232b3c-5bdb-4f2f-8dad-6edb3674a6ce_3/mounts/workspaceblobstore/azureml/HD_88232b3c-5bdb-4f2f-8dad-6edb3674a6ce_3/svdkl.py", line 145, in forward
res = self.gp_layer(features)
File "/azureml-envs/azureml_8a75b4939559f760114d357d5e253d59/lib/python3.6/site-packages/gpytorch/models/approximate_gp.py", line 81, in call
return self.variational_strategy(inputs, prior=prior)
File "/azureml-envs/azureml_8a75b4939559f760114d357d5e253d59/lib/python3.6/site-packages/gpytorch/variational/_variational_strategy.py", line 108, in call
self.variational_distribution.initialize_variational_distribution(prior_dist)
File "/azureml-envs/azureml_8a75b4939559f760114d357d5e253d59/lib/python3.6/site-packages/gpytorch/variational/cholesky_variational_distribution.py", line 50, in initialize_variational_distribution
self.variational_mean.data.copy
(prior_dist.mean)
RuntimeError: expand(torch.FloatTensor{[2, 64]}, size=[64]): the number of sizes provided (1) must be greater or equal to the number of dimensions in the tensor (2)

Pipeline doesn't register model trained by the pipeline

It looks to me like these two steps of the pipeline train a bunch of models, saves them to outputs/ridge_{alpha}.pkl in train-sklearn.py, and then registers an entirely different .pkl file that was already checked in under model-deployment/sklearn_regression_model.pkl. Shouldn't it be registering one of the models it created?

`- task: AzureCLI@1
inputs:
azureSubscription: 'build-demo'
scriptLocation: 'inlineScript'
inlineScript: 'az ml run submit-script -c sklearn -e test -d training-env.yml train-sklearn.py'
workingDirectory: 'model-training'

  • task: AzureCLI@1
    inputs:
    azureSubscription: 'build-demo'
    scriptLocation: 'inlineScript'
    inlineScript: 'az ml model register -n mymodel -p sklearn_regression_model.pkl -t model.json'
    workingDirectory: 'model-deployment'`

Not sure if this is a TensorFlow issue or Docker issue

getting a strange error on one of my embedding layers when using this with keras.

restype:container
2019-08-14 21:00:10,145|azureml.core.authentication|DEBUG|Time to expire 604466.854539 seconds
2019-08-14 azureml.history._tracking.PythonWorkingDirectory.workingdir|DEBUG|Calling pyfs
2019-08-14 21:00:29,324|azureml.history._tracking.PythonWorkingDirectory|INFO|Current working dir: /mnt/batch/tasks/....
2019-08-14
2019-08-14 21:00:29,324|azureml.WorkingDirectoryCM|ERROR|<class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>: indices[8,0] = 565 is not in [0, 562)
[[node master_Embedding/GatherV2 (defined at /azureml-envs/azureml/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:1211) ]]

invalidArgumentError (see above for traceback): indices[8,0] = 565 is not in [0, 562)
[[node broker_master_Embedding/GatherV2 (defined at /azureml-envs/azureml_d582dd13e83051343c8ab0e51ab5a504/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:1211) ]]

any ideas....

The driver_log.txt shows:

WARNING - From /azureml-envs/azureml_d582dd13e83051343c8ab0e51ab5a504/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
Train on 72626 samples, validate on 4035 samples
Epoch 1/100
2019-08-14 21:00:15.382966: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-08-14 21:00:15.388250: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2596990000 Hz
2019-08-14 21:00:15.388560: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55dbbf606c20 executing computations on platform Host. Devices:
2019-08-14 21:00:15.388579: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): ,

Azure Primers

Hi

Would you let me know what's wrong here please

I get the following error at Step 9 -
AttributeError: 'NoneType' object has no attribute 'get_metrics'
Thanks

Dataset url broken

"The dataset we will use (located on a public blob [here](https://msdocsdatasets.blob.core.windows.net/pytorchfowl/fowl_data.zip) as a zip file) consists of about 120 training images each for turkeys and chickens, with 100 validation images for each class. The images are a subset of the [Open Images v5 Dataset](https://storage.googleapis.com/openimages/web/index.html). The unzipped files are in provided in the repository. The cell below you can learn how to easily upload your data into a datastore for traceability. "

The url to the dataset seems to be invalid and would be useful if updated.

Cannot Run deploy to ACI. provided all the required libraries but init() fails in score.py

error: undefined,
stdout: '',
stderr: 'ERROR: {'Azure-cli-ml Version': '1.0.79', 'Error': WebserviceException:\n\tMessage: Service deployment polling reached non-successful terminal state, current service state: Failed\nMore information can be found using '.get_logs()'\nError:\n{\n "code": "AciDeploymentFailed",\n "message": "Aci Deployment failed with exception: Your container application crashed. This may be caused by errors in your scoring file's init() function.\nPlease check the logs for your container instance: diabetes-aci-1. From the AML SDK, you can run print(service.get_logs()) if you have service object to fetch the logs. \nYou can also try to run image scrmlops1amlcr.azurecr.io/azureml/azureml_7658761db4e29457d2df7ab09e40fa3e locally. Please refer to http://aka.ms/debugimage#service-launch-fails for more information.",\n "details": [\n {\n "code": "CrashLoopBackOff",\n "message": "Your container application crashed. This may be caused by errors in your scoring file's init() function.\nPlease check the logs for your container instance: diabetes-aci-1. From the AML SDK, you can run print(service.get_logs()) if you have service object to fetch the logs. \nYou can also try to run image scrmlops1amlcr.azurecr.io/azureml/azureml_7658761db4e29457d2df7ab09e40fa3e locally. Please refer to http://aka.ms/debugimage#service-launch-fails for more information."\n }\n ]\n}\n\tInnerException None\n\tErrorResponse \n{\n "error": {\n "message": "Service deployment polling reached non-successful terminal state, current service state: Failed\nMore information can be found using '.get_logs()'\nError:\n{\n \"code\": \"AciDeploymentFailed\",\n \"message\": \"Aci Deployment failed with exception: Your container application crashed. This may be caused by errors in your scoring file's init() function.\\nPlease check the logs for your container instance: diabetes-aci-1. From the AML SDK, you can run print(service.get_logs()) if you have service object to fetch the logs. \\nYou can also try to run image scrmlops1amlcr.azurecr.io/azureml/azureml_7658761db4e29457d2df7ab09e40fa3e locally. Please refer to http://aka.ms/debugimage#service-launch-fails for more information.\",\n \"details\": [\n {\n \"code\": \"CrashLoopBackOff\",\n \"message\": \"Your container application crashed. This may be caused by errors in your scoring file's init() function.\\nPlease check the logs for your container instance: diabetes-aci-1. From the AML SDK, you can run print(service.get_logs()) if you have service object to fetch the logs. \\nYou can also try to run image scrmlops1amlcr.azurecr.io/azureml/azureml_7658761db4e29457d2df7ab09e40fa3e locally. Please refer to http://aka.ms/debugimage#service-launch-fails for more information.\"\n }\n ]\n}"\n }\n}}\n' }

File "/azureml-envs/azureml_3921e717c180127e72729a5bd9ba1fe6/lib/python3.7/pickle.py", line 1426, in find_class
import(module, level=0)
ModuleNotFoundError: No module named 'sklearn.linear_model._ridge'

SAS token expiration on azuremlsdktestpypi.azureedge.net packages

Attempting to install wheel packages from azuremlsdktestpypi.azureedge.net. The SAS token expired 04/30/2019. Can you update the keys, or make the storage location public. Trying to run the Explainability example for a client.

https://azuremlsdktestpypi.blob.core.windows.net/repo/AzureML-Contrib-Explain-Model-Gated/3010237/azureml_defaults-0.1.0.3010237-py2.py3-none-any.whl?sv=2017-07-29&sr=b&sig=ZBnDJxbIkYn%2FdGkj3819MFEfXhwVkLaWZaL8kIugbLs%3D&st=2019-04-30T21%3A21%3A18Z&se=2020-04-30T21%3A21%3A18Z&sp=rl

Deep predictive maintenance hyperdrive experiment failing.

Experiment fails with:

Could not locate the provided model_path outputs/network.pth in the set of files uploaded to the run: ['azureml-logs/55_azureml-execution-tvmps_7c64ad713a50bc22098f7505ce5fb74b47976f9280fc094fda12d73f64dee461_d.txt', 'azureml-logs/65_job_prep-tvmps_7c64ad713a50bc22098f7505ce5fb74b47976f9280fc094fda12d73f64dee461_d.txt', 'azureml-logs/70_driver_log.txt', 'azureml-logs/process_info.json', 'azureml-logs/process_status.json', 'logs/azureml/129_azureml.log', 'logs/azureml/job_prep_azureml.log']

Where can I access the result of mlflow-enabled training for experiment tracking?

I have ran a ScriptRunConfig job that runs a MLFlow-enabed train.py job.

After the job was finished, I got these two URI (they are not browsable):

Tracking URI: azureml://eastus2.api.azureml.ms/mlflow/v1.0/subscriptions/number/resourceGroups/resource-group/providers/Microsoft.MachineLearningServices/workspaces/mlops-test?
Artifact URI: azureml://experiments/experiment-name/runs/experiment-name_number/artifacts

How can I access my MLFlow experiments within Azure?

These are all the tabs I have for the said job:

Screenshot from 2023-01-11 09-38-24

not sure if the code is using 4 nodes each having 4 GPUs?

so I requested for 4 nodes and 4 GPUs each, and also made that cluster myself in compute. When my cluster is being used, it shows there is only 1 active run but 4 busy node. I am using DistributedDataParallel in PyTorch for a computer vision-based deep learning code.

Could you explain why I don't have 4 active runs and instead I have 4 busy node? What is wrong and how it could be fixed?

jobs:
  train:
    type: command
    component: file:train.yaml
    # compute: azureml:gpu-cluster
    compute: azureml:mona-gpu-cluster
    # compute: azureml:single-node-cluster
    resources:
      instance_count: 4    # number of nodes
      # instance_count: 1  # set to 1 for testing purposes
    distribution:
      type: pytorch
      process_count_per_instance: 4 # number of gpus
      # process_count_per_instance: 1 # set to 1 for testing purposes

this is from pipeline.yaml that I am using based off mlops-templates

Screenshot from 2023-04-06 17-12-16

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.