This post was written in collaboration with Bhajandeep Singh and Ajay Vishwakarma from Wipro’s AWS AI/ML Practice. Many organizations have been using a combination of on-premises and open source data science solutions to create and manage machine learning (ML) models. Data science and DevOps teams may face challenges managing these isolated tool stacks and systems. Integrating multiple tool stacks to build a compact solution might involve building custom connectors or workflows. Managing different dependencies based on the current version of each stack and maintaining those dependencies with the release of new updates of each stack complicates the solution. This increases the cost of infrastructure maintenance and hampers productivity.
Artificial intelligence (AI) and machine learning (ML) offerings from Amazon Web Services (AWS), along with integrated monitoring and notification services, help organizations achieve the required level of automation, scalability, and model quality at optimal cost. AWS also helps data science and DevOps teams to collaborate and streamlines the overall model lifecycle process. The AWS portfolio of ML services includes a robust set of services that you can use to accelerate the development, training, and deployment of machine learning applications. The suite of services can be used to support the complete model lifecycle including monitoring and retraining ML models. In this post, we discuss model development and MLOps framework implementation for one of Wipro’s customers that uses Amazon SageMaker and other AWS services. Wipro is an AWS Premier Tier Services Partner and Managed Service Provider (MSP). Its AI/ML solutions drive enhanced operational efficiency, productivity, and customer experience for many of their enterprise clients.
Current challenges
Let’s first understand a few of the challenges the customer’s data science and DevOps teams faced with their current setup. We can then examine how the integrated SageMaker AI/ML offerings helped solve those challenges.
Collaboration – Data scientists each worked on their own local Jupyter notebooks to create and train ML models. They lacked an effective method for sharing and collaborating with other data scientists.
Scalability – Training and re-training ML models was taking more and more time as models became more complex while the allocated infrastructure capacity remained static.
MLOps – Model monitoring and ongoing governance wasn’t tightly integrated and automated with the ML models. There are dependencies and complexities with integrating third-party tools into the MLOps pipeline.
Reusability – Without reusable MLOps frameworks, each model must be developed and governed separately, which adds to the overall effort and delays model operationalization.
This diagram summarizes the challenges and how Wipro’s implementation on SageMaker addressed them with built-in SageMaker services and offerings.
Figure 1 – SageMaker offerings for ML workload migration
Wipro defined an architecture that addresses the challenges in a cost-optimized and fully automated way. The following is the use case and model used to build the solution:
Use case: Price prediction based on the used car dataset
Problem type: Regression
Models used: XGBoost and Linear Learner (SageMaker built-in algorithms)
Solution architecture
Wipro consultants conducted a deep-dive discovery workshop with the customer’s data science, DevOps, and data engineering teams to understand the current environment as well as their requirements and expectations for a modern solution on AWS. By the end of the consulting engagement, the team had implemented the following architecture that effectively addressed the core requirements of the customer team, including:
Code Sharing – SageMaker notebooks enable data scientists to experiment and share code with other team members. Wipro further accelerated their ML model journey by implementing Wipro’s code accelerators and snippets to expedite feature engineering, model training, model deployment, and pipeline creation.
Continuous integration and continuous delivery (CI/CD) pipeline – Using the customer’s GitHub repository enabled code versioning and automated scripts to launch pipeline deployment whenever new versions of the code are committed.
MLOps – The architecture implements a SageMaker model monitoring pipeline for continuous model quality governance by validating data and model drift as required by the defined schedule. Whenever drift is detected, an event is launched to notify the respective teams to take action or initiate model retraining.
Event-driven architecture – The pipelines for model training, model deployment, and model monitoring are well integrated by use Amazon EventBridge, a serverless event bus. When defined events occur, EventBridge can invoke a pipeline to run in response. This provides a loosely-coupled set of pipelines that can run as needed in response to the environment.
Figure 2 – Event Driven MLOps architecture with SageMaker
Solution components
This section describes the various solution components of the architecture.
Experiment notebooks
Purpose: The customer’s data science team wanted to experiment with various datasets and multiple models to come up with the optimal features, using those as further inputs to the automated pipeline.
Solution: Wipro created SageMaker experiment notebooks with code snippets for each reusable step, such as reading and writing data, model feature engineering, model training, and hyperparameter tuning. Feature engineering tasks can also be prepared in Data Wrangler, but the client specifically asked for SageMaker processing jobs and AWS Step Functions because they were more comfortable using those technologies. We used the AWS step function data science SDK to create a step function—for flow testing—directly from the notebook instance to enable well-defined inputs for the pipelines. This has helped the data scientist team to create and test pipelines at a much faster pace.
Automated training pipeline
Purpose: To enable an automated training and re-training pipeline with configurable parameters such as instance type, hyperparameters, and an Amazon Simple Storage Service (Amazon S3) bucket location. The pipeline should also be launched by the data push event to S3.
Solution: Wipro implemented a reusable training pipeline using the Step Functions SDK, SageMaker processing, training jobs, a SageMaker model monitor container for baseline generation, AWS Lambda, and EventBridge services. Using AWS event-driven architecture, the pipeline is configured to launch automatically based on a new data event being pushed to the mapped S3 bucket. Notifications are configured to be sent to the defined email addresses. At a high level, the training flow looks like the following diagram:
Figure 3 – Training pipeline step machine.
Flow description for the automated training pipeline
The above diagram is an automated training pipeline built using Step Functions, Lambda, and SageMaker. It’s a reusable pipeline for setting up automated model training, generating predictions, creating a baseline for model monitoring and data monitoring, and creating and updating an endpoint based on previous model threshold value.
Pre-processing: This step takes data from an Amazon S3 location as input and uses the SageMaker SKLearn container to perform necessary feature engineering and data pre-processing tasks, such as the train, test, and validate split.
Model training: Using the SageMaker SDK, this step runs training code with the respective model image and trains datasets from pre-processing scripts while generating the trained model artifacts.
Save model: This step creates a model from the trained model artifacts. The model name is stored for reference in another pipeline using the AWS Systems Manager Parameter Store.
Query training results: This step calls the Lambda function to fetch the metrics of the completed training job from the earlier model training step.
RMSE threshold: This step verifies the trained model metric (RMSE) against a defined threshold to decide whether to proceed towards endpoint deployment or reject this model.
Model accuracy too low: At this step the model accuracy is checked against the previous best model. If the model fails at metric validation, the notification is sent by a Lambda function to the target topic registered in Amazon Simple Notification Service (Amazon SNS). If this check fails, the flow exits because the new trained model didn’t meet the defined threshold.
Baseline job data drift: If the trained model passes the validation steps, baseline stats are generated for this trained model version to enable monitoring and the parallel branch steps are run to generate the baseline for the model quality check.
Create model endpoint configuration: This step creates endpoint configuration for the evaluated model in the previous step with an enable data capture configuration.
Check endpoint: This step checks if the endpoint exists or needs to be created. Based on the output, the next step is to create or update the endpoint.
Export configuration: This step exports the parameter’s model name, endpoint name, and endpoint configuration to the AWS Systems Manager Parameter Store. Alerts and notifications are configured to be sent to the configured SNS topic email on the failure or success of state machine status change.
The same pipeline…
Source link