Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
0abdb5d
getting started
Aug 12, 2019
a460ab0
removed garbage
Aug 12, 2019
033dbbf
getting started update
Aug 13, 2019
2afe237
update IaC pipelines
Aug 13, 2019
5a1e994
document update progress
Aug 13, 2019
3459d7f
document update progress
Aug 13, 2019
67fd322
update document progress
Aug 13, 2019
d75895c
adjust image size
Aug 13, 2019
03b94ca
document update progress
Aug 13, 2019
9c43038
azure-cli to requirements
Aug 13, 2019
5f300f4
document update progress
Aug 13, 2019
e9fb924
document update progress
Aug 13, 2019
4eca7b1
document update progress
Aug 14, 2019
3c129b8
Update getting_started.md
eedorenko Aug 14, 2019
5c7ebe4
Update getting_started.md
eedorenko Aug 14, 2019
8fc0ba1
Update getting_started.md
eedorenko Aug 14, 2019
5a05f00
Update getting_started.md
eedorenko Aug 14, 2019
8b5b76a
Update getting_started.md
eedorenko Aug 14, 2019
a50b0d4
Update getting_started.md
eedorenko Aug 14, 2019
bd2a43b
Update getting_started.md
eedorenko Aug 14, 2019
f930580
Update getting_started.md
eedorenko Aug 14, 2019
e433ca8
Update getting_started.md
eedorenko Aug 14, 2019
a4f44bb
Update getting_started.md
eedorenko Aug 14, 2019
d4b343e
readme update
Aug 14, 2019
62377c5
Merge branch 'eedorenko/documentation-update' of https://github.com/m…
Aug 14, 2019
d421847
Update README.md
eedorenko Aug 14, 2019
5897a31
readme update
Aug 14, 2019
569b4ed
Merge branch 'eedorenko/documentation-update' of https://github.com/m…
Aug 14, 2019
3f6f056
update document progress
Aug 14, 2019
2b122c5
update documentation progress
Aug 15, 2019
4d34ace
azure-cli library update
Aug 15, 2019
71208d6
docker image update
Aug 15, 2019
9736e0a
docker image update
Aug 15, 2019
7b7e382
liniting
Aug 15, 2019
5294441
Update getting_started.md
eedorenko Aug 16, 2019
73db451
image update
Aug 16, 2019
0c3688d
Merge branch 'eedorenko/documentation-update' of https://github.com/m…
Aug 16, 2019
c8c2824
minor add and typo fix
dtzar Aug 16, 2019
48718f3
duplicate file paths and `code` snippet for file paths
eedorenko Aug 16, 2019
2d4bdee
Model Deploy tasks parameters in tables
eedorenko Aug 16, 2019
c0860e5
tables with task parameters uopdate
eedorenko Aug 16, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -104,5 +104,4 @@ venv.bak/
# mypy
.mypy_cache/

aml_config/config.json
.DS_Store
4 changes: 1 addition & 3 deletions .pipelines/azdo-ci-build-train.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,7 @@ trigger:
pool:
vmImage: 'ubuntu-latest'

container:
image: mlopscr.azurecr.io/public/mlops/mlopspython:latest
endpoint: acrconnection
container: mcr.microsoft.com/mlops/python:latest


variables:
Expand Down
4 changes: 1 addition & 3 deletions .pipelines/azdo-pr-build-train.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,7 @@ pr:
pool:
vmImage: 'ubuntu-latest'

container:
image: mlopscr.azurecr.io/public/mlops/mlopspython:latest
endpoint: acrconnection
container: mcr.microsoft.com/mlops/python:latest


variables:
Expand Down
22 changes: 4 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@

[![Build Status](https://dev.azure.com/customai/DevopsForAI-AML/_apis/build/status/Microsoft.MLOpsPython?branchName=master)](https://dev.azure.com/customai/DevopsForAI-AML/_build/latest?definitionId=25&branchName=master)

### Author: Praneet Solanki | Richin Jain

MLOps will help you to understand how to build the Continuous Integration and Continuous Delivery pipeline for a ML/AI project. We will be using the Azure DevOps Project for build and release/deployment pipelines along with Azure ML services for model retraining pipeline, model management and operationalization.

Expand All @@ -25,20 +24,15 @@ To deploy this solution in your subscription, follow the manual instructions in

This reference architecture shows how to implement continuous integration (CI), continuous delivery (CD), and retraining pipeline for an AI application using Azure DevOps and Azure Machine Learning. The solution is built on the scikit-learn diabetes dataset but can be easily adapted for any AI scenario and other popular build systems such as Jenkins and Travis.

![Architecture](/docs/images/Architecture_DevOps_AI.png)
![Architecture](/docs/images/main-flow.png)


## Architecture Flow

### Train Model
1. Data Scientist writes/updates the code and push it to git repo. This triggers the Azure DevOps build pipeline (continuous integration).
2. Once the Azure DevOps build pipeline is triggered, it runs following types of tasks:
- Run for new code: Every time new code is committed to the repo, the build pipeline performs data sanity tests and unit tests on the new code.
- One-time run: These tasks runs only for the first time the build pipeline runs. It will programatically create an [Azure ML Service Workspace](https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace), provision [Azure ML Compute](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-set-up-training-targets#amlcompute) (used for model training compute), and publish an [Azure ML Pipeline](https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-ml-pipelines). This published Azure ML pipeline is the model training/retraining pipeline.

> Note: The Publish Azure ML pipeline task currently runs for every code change

3. The Azure ML Retraining pipeline is triggered once the Azure DevOps build pipeline completes. All the tasks in this pipeline runs on Azure ML Compute created earlier. Following are the tasks in this pipeline:
2. Once the Azure DevOps build pipeline is triggered, it performs code quality checks, data sanity tests, unit tests, builds an [Azure ML Pipeline](https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-ml-pipelines) and publishes it in an [Azure ML Service Workspace](https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace).
3. The [Azure ML Pipeline](https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-ml-pipelines) is triggered once the Azure DevOps build pipeline completes. All the tasks in this pipeline runs on Azure ML Compute. Following are the tasks in this pipeline:

- **Train Model** task executes model training script on Azure ML Compute. It outputs a [model](https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-azure-machine-learning-architecture#model) file which is stored in the [run history](https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-azure-machine-learning-architecture#run).

Expand All @@ -50,16 +44,8 @@ This reference architecture shows how to implement continuous integration (CI),

Once you have registered your ML model, you can use Azure ML + Azure DevOps to deploy it.

The **Package Model** task packages the new model along with the scoring file and its python dependencies into a [docker image](https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-azure-machine-learning-architecture#image) and pushes it to [Azure Container Registry](https://docs.microsoft.com/en-us/azure/container-registry/container-registry-intro). This image is used to deploy the model as [web service](https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-azure-machine-learning-architecture#web-service).

The **Deploy Model** task handles deploying your Azure ML model to the cloud (ACI or AKS).
This pipeline deploys the model scoring image into Staging/QA and PROD environments.

In the Staging/QA environment, one task creates an [Azure Container Instance](https://docs.microsoft.com/en-us/azure/container-instances/container-instances-overview) and deploys the scoring image as a [web service](https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-azure-machine-learning-architecture#web-service) on it.

The second task invokes the web service by calling its REST endpoint with dummy data.
[Azure DevOps release pipeline](https://docs.microsoft.com/en-us/azure/devops/pipelines/release/?view=azure-devops) packages the new model along with the scoring file and its python dependencies into a [docker image](https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-azure-machine-learning-architecture#image) and pushes it to [Azure Container Registry](https://docs.microsoft.com/en-us/azure/container-registry/container-registry-intro). This image is used to deploy the model as [web service](https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-azure-machine-learning-architecture#web-service) across QA and Prod environments. The QA environment is running on top of [Azure Container Instances (ACI)](https://azure.microsoft.com/en-us/services/container-instances/) and the Prod environemt is built with [Azure Kubernetes Service (AKS)](https://docs.microsoft.com/en-us/azure/aks/intro-kubernetes).

5. The deployment in production is a [gated release](https://docs.microsoft.com/en-us/azure/devops/pipelines/release/approvals/gates?view=azure-devops). This means that once the model web service deployment in the Staging/QA environment is successful, a notification is sent to approvers to manually review and approve the release. Once the release is approved, the model scoring web service is deployed to [Azure Kubernetes Service(AKS)](https://docs.microsoft.com/en-us/azure/aks/intro-kubernetes) and the deployment is tested.

### Repo Details

Expand Down
64 changes: 21 additions & 43 deletions docs/code_description.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,59 +2,37 @@

### Environment Setup

- requirements.txt : It consist of list of python packages which are needed by the train.py to run successfully on host agent (locally).
- `environment_setup/requirements.txt` : It consist of list of python packages which are needed by the train.py to run successfully on host agent (locally).

- install_requirements.sh : This script prepare the python environment i.e. install the Azure ML SDK and the packages specified in requirements.txt
- `environment_setup/install_requirements.sh` : This script prepare the python environment i.e. install the Azure ML SDK and the packages specified in requirements.txt

### Config Files
All the scripts inside the ./aml_config are config files. These are the files where you need to provide details about the subscription, resource group, workspace, conda dependencies, remote vm, AKS etc.
- `environment_setup/iac-*.yml, arm-templates` : Infrastructure as Code piplines to create and delete required resources along with corresponding arm-templates.

- config.json : This is a mandatory config file. Provide the subscription id, resource group name, workspace name and location where you want to create Azure ML services workspace. If you have already created the workspace, provide the existing workspace details in here.
- `environment_setup/Dockerfile` : Dockerfile of a building agent containing Python 3.6 and all required packages.

- conda_dependencies.yml : This is a mandatory file. This files contains the list of dependencies which are needed by the training/scoring script to run. This file is used to prepare environment for the local run(user managed/system managed) and docker run(local/remote).
- `environment_setup/docker-image-pipeline.yml` : An AzDo pipeline building and pushing [microsoft/mlopspython](https://hub.docker.com/_/microsoft-mlops-python) image.

- security_config.json : This file contains the credentials to the remove vm where we want to train the model. This config is used by the script 02-AttachTrainingVM.py to attach remote vm as a compute to the workspace. Attaching remote vm to workspace is one time operation. It is recommended not to publish this file with credentials populated in it. You can put the credentials, run the 02-AttachTrainingVM.py manually and clear the credentials before pushing it to git.
### Pipelines

- aks_webservice.json : This is an optional config. If you already have an AKS attached to your workspace, then provide the details in this file. If not, you do not have to check in this file to git.
- `.pipelines/azdo-base-pipeline.yml` : a pipeline template used by ci-build-train pipeline and pr-build-train pipelines. It contains steps performig linting, data and unit testing.
- `.pipelines/azdo-ci-build-train.yml` : a pipeline triggered when the code is merged into **master**. It profrorms linting, data integrity testing, unit testing, building and publishing an ML pipeline.
- `.pipelines/azdo-pr-build-train.yml` : a pipeline triggered when a **pull request** to the **master** branch is created. It profrorms linting, data integrity testing and unit testing only.

### Build Pipeline Scripts
### ML Services

The script under ./aml_service are used in build pipeline. All the scripts starting with 0 are the one time run scripts. These are the scripts which need to be run only once. There is no harm of running these scripts every time in build pipeline.
- `ml_service/pipelines/build_train_pipeline.py` : builds and publishes an ML training pipeline.
- `ml_service/pipelines/run_train_pipeline.py` : invokes a published ML training pipeline via REST API.
- `ml_service/util` : contains common utility functions used to build and publish an ML training pipeline.

- 00-WorkSpace.py : This is a onetime run script. It reads the workspace details from ./aml_config/config.json file and create (if workspace not available) or get (existing workspace).
### Code

- 01-Experiment.py : This is a onetime run script. It registers the root directory as project. It is not included as a step in build pipeline.
- `code/training/train.py` : a training step of an ML training pipeline.
- `code/evaluate/evaluate_model.py` : an evaluating step of an ML training pipeline.
- `code/evaluate/register_model.py` : registers a new trained model if evaluation shows the new model is more performent than the previous one.

- 02-AttachTrainingVM.py : This is a onetime run script. It attaches a remote VM to the workspace. It reads the config from ./aml_config/security_config.json. It is not included as a step in build pipeline.
### Scoring
- code/scoring/score.py : a scoring script which is about to be packed into a Docker Image along with a model while being deployed to QA/Prod environment.
- code/scoring/conda_dependencies.yml : contains a list of dependencies required by sore.py to be installed in a deployable Docker Image
- code/scoring/inference_config.yml, deployment_config_aci.yml, deployment_config_aks.yml : configuration files for the [AML Model Deploy](https://marketplace.visualstudio.com/items?itemName=ms-air-aiagility.private-vss-services-azureml&ssr=false#overview) pipeline task for ACI and AKS deployment targets.

- 10-TrainOnLocal.py : This scripts triggers the run of ./training/train.py script on the local compute(Host agent in case of build pipeline). If you are training on remote vm, you do not need this script in build pipeline. All the training scripts (1x) generates an output file aml_config/run_id.json which records the run_id and run history name of the training run. run_id.json is used by 20-RegisterModel.py to get the trained model.

- 11-TrainOnLocalEnv.py : Its functionality is same as 10-TrainOnLocal.py, the only difference is that it creates a virtual environment on local compute and run training script on virtual env.

- 12-TrainOnVM.py : As we want to train the model on remote VM, this script is included as a task in build pipeline. It submits the training job on remote vm.

- 15.EvaluateModel.py : It gets the metrics of latest model trained and compares it with the model in production. If the production model still performs better, all below scripts are skipped.

- 20-RegisterModel.py : It gets the run id from training steps output json and registers the model associated with that run along with tags. This scripts outputs a model.json file which contains model name and version. This script included as build task.

- 30-CreateScoringImage.py : This takes the model details from last step, creates a scoring webservice docker image and publish the image to ACR. This script included as build task. It writes the image name and version to image.json file.

### Deployment/Release Scripts
File under the directory ./aml_service starting with 5x and 6x are used in release pipeline. They are basically to deploy the docker image on AKS and ACI and publish webservice on them.

- 50-deployOnAci.py : This script reads the image.json which is published as an artifact from build pipeline, create aci cluster and deploy the scoring web service on it. It writes the scoring service details to aci_webservice.json

- 51-deployOnAks.py : This script reads the image.json which is published as an artifact from build pipeline, create aks cluster and deploy the scoring web service on it. If the aks_webservice.json file was checked in with existing aks details, it will update the existing webservice with new Image. It writes the scoring service details to aks_webservice.json

- 60-AciWebServiceTest.py : Reads the ACI info from aci_webservice.json and test it with sample data.

- 61-AksWebServiceTest.py : Reads the AKS info from aks_webservice.json and test it with sample data.

### Training/Scoring Scripts

- /code/training/train.py : This is the model training code. It uploads the model file to AML Service run id once the training is successful. This script is submitted as run job by all the 1x scripts.

- /code/scoring/score.py : This is the score file used to create the webservice docker image. There is a conda_dependencies.yml in this directory which is exactly same as the one in aml_config. These two files are needed by the 30-CreateScoringImage.py scripts to be in same root directory while creating the image.

**Note: In CICD Pipeline, please make sure that the working directory is the root directory of the repo.**

Loading