© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Arun Gupta, @arungupta
Machine Learning using Kubernetes
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Centerpiece for digital transformation
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Machine Learning 101
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
EC2 P3
& P3dn
EC2
C5 FPGAs Greengrass
Elastic
inference
FRAMEWORKS INTERFACES INFRASTRUCTURE
Inferentia
EC2
G4
The Amazon ML stack:
Broadest & deepest set of capabilities
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Ground Truth Notebooks
Algorithms +
Marketplace
Reinforcement
Learning Training Optimization Deployment Hosting
ML Frameworks +
Infrastructure EC2 P3
& P3dn
EC2
C5 FPGAs Greengrass
Elastic
inference
FRAMEWORKS INTERFACES INFRASTRUCTURE
Inferentia
EC2
G4
The Amazon ML stack:
Broadest & deepest set of capabilities
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
R E K O G N I T I O N
I M A G E
P O L L Y T R A N S C R I B E T R A N S L A T E C O M P R E H E N D L E XR E K O G N I T I O N
V I D E O
F O R E C A S TT E X T R A C T P E R S O N A L I Z E
VISION SPEECH LANGUAGE CHATBOTS FORECASTING RECOMMENDATIONS
ML Services Amazon
SageMaker Ground Truth Notebooks Algorithms + Marketplace
Reinforcement
Learning Training Optimization Deployment Hosting
ML Frameworks +
Infrastructure EC2 P3
& P3dn
EC2
C5 FPGAs Greengrass
Elastic
inference
FRAMEWORKS INTERFACES INFRASTRUCTURE
Inferentia
EC2
G4
The Amazon ML stack:
Broadest & deepest set of capabilities
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
The Amazon ML stack:
Broadest & deepest set of capabilities
R E K O G N I T I O N
I M A G E
P O L L Y T R A N S C R I B E T R A N S L A T E C O M P R E H E N D L E XR E K O G N I T I O N
V I D E O
F O R E C A S TT E X T R A C T P E R S O N A L I Z E
VISION SPEECH LANGUAGE CHATBOTS FORECASTING RECOMMENDATIONS
Amazon
SageMaker Ground Truth Notebooks Algorithms + Marketplace
Reinforcement
Learning Training Optimization Deployment Hosting
EC2 P3
& P3dn
EC2
C5 FPGAs Greengrass
Elastic
inference
FRAMEWORKS INTERFACES
Inferentia
EC2
G4
INFRASTRUCTURE
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
“A little less conversation, a little more action, please”
M A C H I N E L E A R N I N G
S T O R A G E
Amazon Redshift
+ Redshift Spectrum
Amazon
QuickSight
Amazon EMR
Hadoop, Spark, Presto,
Pig, Hive…19 total
Amazon
Athena
Amazon
Kinesis
Amazon
Elasticsearch
Service
AWS Glue
A N A L Y T I C S
Amazon S3
Standard-IA
Amazon S3
Standard
Amazon S3
One Zone-IA
Amazon
Glacier
Amazon S3
Intelligent-
Tiering
N E W
Amazon
EBS
Amazon S3
Glacier Deep
Archive
N E W
R E K O G N I T I O N
I M A G E
P O L L Y T R A N S C R I B E T R A N S L A T E C O M P R E H E N D L E XR E K O G N I T I O N
V I D E O
F O R E C A S TT E X T R A C T P E R S O N A L I Z E
VI SI ON SP E E C H LANGUAGE C HATBOTS FORE C ASTI NG RE C OMME NDATI ONS
Amazon
SageMaker
Ground Truth Notebooks
Algorithms +
Marketplace
Reinforcement
Learning
Training Optimization Deployment Hosting
EC2 P3
& P3dn
EC2
C5 FPGAs Greengrass
Elastic
inference
F R A M E W O R K S I N T E R F A C E S
Inferentia
EC2
G4
I N F R A S T R U C T U R E
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
R E K O G N I T I O N
I M A G E
P O L L Y T R A N S C R I B E T R A N S L A T E C O M P R E H E N D L E XR E K O G N I T I O N
V I D E O
F O R E C A S TT E X T R A C T P E R S O N A L I Z E
VISION SPEECH LANGUAGE CHATBOTS FORECASTING RECOMMENDATIONS
Amazon
SageMaker Ground Truth Notebooks Algorithms + Marketplace
Reinforcement
Learning Training Optimization Deployment Hosting
EC2 P3
& P3dn
EC2
C5 FPGAs Greengrass
Elastic
inference
FRAMEWORKS INTERFACES INFRASTRUCTURE
Inferentia
EC2
G4
Machine Learning using Kubernetes
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
ML Frameworks +
Infrastructure EC2 P3
& P3dn
EC2
C5
FPGAs Greengrass
Elastic
inference
FRAMEWORKS INTERFACES INFRASTRUCTURE
Inferentia
EC2
G4
Machine Learning using Kubernetes
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Why Machine Learning on Kubernetes?
Composability Portability Scalability
O N - P R E M I S E S C L O U D
http://www.shutterstock.com/gallery-635827p1.html
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon EKS—run Kubernetes in cloud
Managed Kubernetes control plane, attach data plane
Native upstream Kubernetes experience
Platform for enterprises to run production-grade workloads
Integrates with additional AWS services
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon EKS deployment
mycluster.eks.amazonaws.com
Availability
Zone 1
Availability
Zone 2
Availability
Zone 3
kubectl
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Getting started with Amazon EKS
eksctl CLI—create Amazon EKS clusters (eksctl.io)
Creates all resources needed for the cluster
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Creating an EKS cluster using eksctl
Auto generated cluster name
2x m5.large nodes
Uses AWS EKS AMI
us-west-2 region
Dedicated VPCs
Static AMI resolver
GPU-powered cluster
Install
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
GPUs for Machine Learning training
Matrix A
b11 b12 b13
b21 b22 b23
b31 b32 b33
Matrix B
a11.b11 + a12.b21 +
a13.b31
a11.b12 + a12.b22 +
a13.b32
a11.b13 + a12.b23 +
a13.b33
a21.b11 + a22.b21 +
a23.b31
a21.b12 + a22.b22 +
a23.b32
a21.b13 + a22.b23 +
a23.b33
a31.b11 + a32.b21 +
a33.b31
a31.b12 + a32.b22 +
a33.b32
a31.b13 + a32.b23 +
a33.b33
Matrix C
Operations can be parallelized across 1,000s of cores
a11 a12 a13
a21 a22 a23
a31 a32 a33
• Training maps to matrix multiplications
• Coupled with extremely high memory bandwidth
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Train Inference
Set up K8s for ML—option 1
Trained
model
2 3 4
Data
1
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Create K8s cluster for ML—option 1
Create training cluster
Create inference cluster
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Scaling the cluster
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Set up K8s for ML—option 2a
Train & inference
Trained
model
2
3
4
role: train
role: train
role: train role: inference
role: inference
Data
1
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Create K8s cluster for ML—option 2
Eksctl cluster configuration
with two node groups
Create cluster
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Set up K8s for ML—option 2b
Train, inference, & applications
role: train
role: train
role: train role: inference
role: inference
role: apps
role: apps
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Challenges in setting up containers for ML
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS deep learning containers
KEY FEATURES
Customizable
container images
Support for TensorFlow,
Apache MXNet
Single and multi-node
training and inference
Pre-packaged Docker
container images
fully configured
and validated
Best performance
and scalability
without tuning
Works with Amazon EKS,
Amazon ECS,
and Amazon EC2
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
16 container images
Training Inference
GPU CPU
Python 2.7 Python 3.6
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
ML on K8s—without KubeFlow
Credits: @aronchik
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
ML on K8s—with KubeFlow
Credits: @aronchik
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What’s in
KubeFlow?
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
MNIST database
Database of gray-scaled
handwritten digits
Training set of 60k
Test set of 10k
Size-normalized (28x28 pixels)
Centered in a fixed-size image
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Fashion MNIST
Database of Zalando’s article images
Labels assigned to 10 items
Drop-in replacement for MNIST
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
TensorFlow
Open source library to develop and train ML models
Created by Google Brain team
Can run on desktop, servers, mobiles, edge devices
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS is the platform of choice to run TensorFlow
of all
TensorFlow
workloads in the
cloud runs on AWS
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Train twice as fast
with TensorFlow
65%
Scaling efficiency
with 256 GPUs
STOCK TENSORFLOW
90%
Scaling efficiency
with 256 GPUS
AWS-OPTIMZED TENSORFLOW
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Machine Learning using TensorFlow on K8s
Read training data Build training model
Feed test data
and match the
expected output
Report accuracy,
improve with each run
Download Keras-consumable Fashion-MNIST training and test data
Run 40 epochs on the model
Export the model to S3 bucket
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Apache MXNet
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Advantages of KubeFlow on AWS
EKS cluster provision with
External traffic with
to manage Lustre file system
Centralized and unified K8s logs in
TLS and Auth with and
for your K8s API server endpoint
Detect GPU instance and install
kubeflow.org/docs/aws
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Distributed training using Horovod
Distributed Training framework for TensorFlow,
Keras, PyTorch, and MXNet
Traditional Russian dance where participants
dance in a circle with linked hands
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Machine Learning pipeline
Choose and
Optimize your
ML algorithm
Setup and
manage
environments
for training
Deploy model
in production
Collect &
prepare
training data
Train and
tune model
(trial and error)
Scale &
manage
environment
in production
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Machine Learning pipeline for K8s
Linear
regression,
decision tree,
BYOA
GPU- and CPU-
based clusters,
*operators
(TensorFlow,
MXNet, …)
TensorFlow
Serving, MXNet
Model Server,
Seldon, …
EMR,
Redshift, S3
TensorFlow,
MXNet,
PyTorch, Keras,
…
EKS
Choose and
Optimize your
ML algorithm
Setup and
manage
environments
for training
Deploy model
in production
Collect &
prepare
training data
Train and
tune model
(trial and error)
Scale &
manage
environment
in production
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Machine Learning pipeline using SageMaker
Choose and
Optimize your
ML algorithm
Setup and
manage
environments
for training
Deploy model
in production
Collect &
prepare
training data
Train and
tune model
(trial and error)
Scale &
manage
environment
in production
Built-in high
performance
algorithms
One-click
training
One-click
deployment
Prebuilt
notebooks for
common
problems
Optimization
Fully managed,
auto-scaling,
health and
security checks
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
References

Machine learning using Kubernetes

  • 1.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved.© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Arun Gupta, @arungupta Machine Learning using Kubernetes
  • 2.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Centerpiece for digital transformation
  • 3.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved.
  • 4.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Machine Learning 101
  • 5.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved.
  • 6.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. EC2 P3 & P3dn EC2 C5 FPGAs Greengrass Elastic inference FRAMEWORKS INTERFACES INFRASTRUCTURE Inferentia EC2 G4 The Amazon ML stack: Broadest & deepest set of capabilities
  • 7.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Ground Truth Notebooks Algorithms + Marketplace Reinforcement Learning Training Optimization Deployment Hosting ML Frameworks + Infrastructure EC2 P3 & P3dn EC2 C5 FPGAs Greengrass Elastic inference FRAMEWORKS INTERFACES INFRASTRUCTURE Inferentia EC2 G4 The Amazon ML stack: Broadest & deepest set of capabilities
  • 8.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. R E K O G N I T I O N I M A G E P O L L Y T R A N S C R I B E T R A N S L A T E C O M P R E H E N D L E XR E K O G N I T I O N V I D E O F O R E C A S TT E X T R A C T P E R S O N A L I Z E VISION SPEECH LANGUAGE CHATBOTS FORECASTING RECOMMENDATIONS ML Services Amazon SageMaker Ground Truth Notebooks Algorithms + Marketplace Reinforcement Learning Training Optimization Deployment Hosting ML Frameworks + Infrastructure EC2 P3 & P3dn EC2 C5 FPGAs Greengrass Elastic inference FRAMEWORKS INTERFACES INFRASTRUCTURE Inferentia EC2 G4 The Amazon ML stack: Broadest & deepest set of capabilities
  • 9.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. The Amazon ML stack: Broadest & deepest set of capabilities R E K O G N I T I O N I M A G E P O L L Y T R A N S C R I B E T R A N S L A T E C O M P R E H E N D L E XR E K O G N I T I O N V I D E O F O R E C A S TT E X T R A C T P E R S O N A L I Z E VISION SPEECH LANGUAGE CHATBOTS FORECASTING RECOMMENDATIONS Amazon SageMaker Ground Truth Notebooks Algorithms + Marketplace Reinforcement Learning Training Optimization Deployment Hosting EC2 P3 & P3dn EC2 C5 FPGAs Greengrass Elastic inference FRAMEWORKS INTERFACES Inferentia EC2 G4 INFRASTRUCTURE
  • 10.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. “A little less conversation, a little more action, please” M A C H I N E L E A R N I N G S T O R A G E Amazon Redshift + Redshift Spectrum Amazon QuickSight Amazon EMR Hadoop, Spark, Presto, Pig, Hive…19 total Amazon Athena Amazon Kinesis Amazon Elasticsearch Service AWS Glue A N A L Y T I C S Amazon S3 Standard-IA Amazon S3 Standard Amazon S3 One Zone-IA Amazon Glacier Amazon S3 Intelligent- Tiering N E W Amazon EBS Amazon S3 Glacier Deep Archive N E W R E K O G N I T I O N I M A G E P O L L Y T R A N S C R I B E T R A N S L A T E C O M P R E H E N D L E XR E K O G N I T I O N V I D E O F O R E C A S TT E X T R A C T P E R S O N A L I Z E VI SI ON SP E E C H LANGUAGE C HATBOTS FORE C ASTI NG RE C OMME NDATI ONS Amazon SageMaker Ground Truth Notebooks Algorithms + Marketplace Reinforcement Learning Training Optimization Deployment Hosting EC2 P3 & P3dn EC2 C5 FPGAs Greengrass Elastic inference F R A M E W O R K S I N T E R F A C E S Inferentia EC2 G4 I N F R A S T R U C T U R E
  • 11.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. R E K O G N I T I O N I M A G E P O L L Y T R A N S C R I B E T R A N S L A T E C O M P R E H E N D L E XR E K O G N I T I O N V I D E O F O R E C A S TT E X T R A C T P E R S O N A L I Z E VISION SPEECH LANGUAGE CHATBOTS FORECASTING RECOMMENDATIONS Amazon SageMaker Ground Truth Notebooks Algorithms + Marketplace Reinforcement Learning Training Optimization Deployment Hosting EC2 P3 & P3dn EC2 C5 FPGAs Greengrass Elastic inference FRAMEWORKS INTERFACES INFRASTRUCTURE Inferentia EC2 G4 Machine Learning using Kubernetes
  • 12.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. ML Frameworks + Infrastructure EC2 P3 & P3dn EC2 C5 FPGAs Greengrass Elastic inference FRAMEWORKS INTERFACES INFRASTRUCTURE Inferentia EC2 G4 Machine Learning using Kubernetes
  • 13.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Why Machine Learning on Kubernetes? Composability Portability Scalability O N - P R E M I S E S C L O U D http://www.shutterstock.com/gallery-635827p1.html
  • 14.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Amazon EKS—run Kubernetes in cloud Managed Kubernetes control plane, attach data plane Native upstream Kubernetes experience Platform for enterprises to run production-grade workloads Integrates with additional AWS services
  • 15.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Amazon EKS deployment mycluster.eks.amazonaws.com Availability Zone 1 Availability Zone 2 Availability Zone 3 kubectl
  • 16.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Getting started with Amazon EKS eksctl CLI—create Amazon EKS clusters (eksctl.io) Creates all resources needed for the cluster
  • 17.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Creating an EKS cluster using eksctl Auto generated cluster name 2x m5.large nodes Uses AWS EKS AMI us-west-2 region Dedicated VPCs Static AMI resolver GPU-powered cluster Install
  • 18.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. GPUs for Machine Learning training Matrix A b11 b12 b13 b21 b22 b23 b31 b32 b33 Matrix B a11.b11 + a12.b21 + a13.b31 a11.b12 + a12.b22 + a13.b32 a11.b13 + a12.b23 + a13.b33 a21.b11 + a22.b21 + a23.b31 a21.b12 + a22.b22 + a23.b32 a21.b13 + a22.b23 + a23.b33 a31.b11 + a32.b21 + a33.b31 a31.b12 + a32.b22 + a33.b32 a31.b13 + a32.b23 + a33.b33 Matrix C Operations can be parallelized across 1,000s of cores a11 a12 a13 a21 a22 a23 a31 a32 a33 • Training maps to matrix multiplications • Coupled with extremely high memory bandwidth
  • 19.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Train Inference Set up K8s for ML—option 1 Trained model 2 3 4 Data 1
  • 20.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Create K8s cluster for ML—option 1 Create training cluster Create inference cluster
  • 21.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Scaling the cluster
  • 22.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Set up K8s for ML—option 2a Train & inference Trained model 2 3 4 role: train role: train role: train role: inference role: inference Data 1
  • 23.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Create K8s cluster for ML—option 2 Eksctl cluster configuration with two node groups Create cluster
  • 24.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Set up K8s for ML—option 2b Train, inference, & applications role: train role: train role: train role: inference role: inference role: apps role: apps
  • 25.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Challenges in setting up containers for ML
  • 26.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. AWS deep learning containers KEY FEATURES Customizable container images Support for TensorFlow, Apache MXNet Single and multi-node training and inference Pre-packaged Docker container images fully configured and validated Best performance and scalability without tuning Works with Amazon EKS, Amazon ECS, and Amazon EC2
  • 27.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. 16 container images Training Inference GPU CPU Python 2.7 Python 3.6
  • 28.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. ML on K8s—without KubeFlow Credits: @aronchik
  • 29.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. ML on K8s—with KubeFlow Credits: @aronchik
  • 30.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. What’s in KubeFlow?
  • 31.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. MNIST database Database of gray-scaled handwritten digits Training set of 60k Test set of 10k Size-normalized (28x28 pixels) Centered in a fixed-size image
  • 32.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Fashion MNIST Database of Zalando’s article images Labels assigned to 10 items Drop-in replacement for MNIST
  • 33.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. TensorFlow Open source library to develop and train ML models Created by Google Brain team Can run on desktop, servers, mobiles, edge devices
  • 34.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. AWS is the platform of choice to run TensorFlow of all TensorFlow workloads in the cloud runs on AWS
  • 35.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Train twice as fast with TensorFlow 65% Scaling efficiency with 256 GPUs STOCK TENSORFLOW 90% Scaling efficiency with 256 GPUS AWS-OPTIMZED TENSORFLOW
  • 36.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Machine Learning using TensorFlow on K8s Read training data Build training model Feed test data and match the expected output Report accuracy, improve with each run Download Keras-consumable Fashion-MNIST training and test data Run 40 epochs on the model Export the model to S3 bucket
  • 37.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Apache MXNet
  • 38.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Advantages of KubeFlow on AWS EKS cluster provision with External traffic with to manage Lustre file system Centralized and unified K8s logs in TLS and Auth with and for your K8s API server endpoint Detect GPU instance and install kubeflow.org/docs/aws
  • 39.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Distributed training using Horovod Distributed Training framework for TensorFlow, Keras, PyTorch, and MXNet Traditional Russian dance where participants dance in a circle with linked hands
  • 40.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Machine Learning pipeline Choose and Optimize your ML algorithm Setup and manage environments for training Deploy model in production Collect & prepare training data Train and tune model (trial and error) Scale & manage environment in production
  • 41.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Machine Learning pipeline for K8s Linear regression, decision tree, BYOA GPU- and CPU- based clusters, *operators (TensorFlow, MXNet, …) TensorFlow Serving, MXNet Model Server, Seldon, … EMR, Redshift, S3 TensorFlow, MXNet, PyTorch, Keras, … EKS Choose and Optimize your ML algorithm Setup and manage environments for training Deploy model in production Collect & prepare training data Train and tune model (trial and error) Scale & manage environment in production
  • 42.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Machine Learning pipeline using SageMaker Choose and Optimize your ML algorithm Setup and manage environments for training Deploy model in production Collect & prepare training data Train and tune model (trial and error) Scale & manage environment in production Built-in high performance algorithms One-click training One-click deployment Prebuilt notebooks for common problems Optimization Fully managed, auto-scaling, health and security checks
  • 43.
    © 2019, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. References