Machine Learning model training with AKS

Solution Idea

If you'd like to see us expand this article with more information (implementation details, pricing guidance, code examples, etc), let us know with GitHub Feedback!

Training of models using large datasets is a complex and resource intensive task. Use familiar tools such as TensorFlow and Kubeflow to simplify training of Machine Learning models. Your ML models will run in AKS clusters backed by GPU enabled VMs.

Architecture

Architecture diagram Download an SVG of this architecture.

Data Flow

  1. Package ML model into a container and publish to ACR
  2. Azure Blob storage hosts training data sets and trained model
  3. Use Kubeflow to deploy training job to AKS, distributed training job to AKS includes Parameter servers and Worker nodes
  4. Serve production model using Kubeflow, promoting a consistent environment across test, control and production
  5. AKS supports GPU enabled VM
  6. Developer can build features querying the model running in AKS cluster