# Deploy cnvrg CORE on AWS with native support to Habana Gaudi
In this tutorial, you will learn how to install cnvrg.io on your private Amazon EKS cluster, with native support to Habana Gaudi devices. We will walk through setting up the cluster using DL1 EC2 instances and then installing cnvrg.io using the cnvrg.io Operator and helm.
# Requirements
Before you can complete the installation you must install and prepare the following dependencies on your local machine:
# Prepare an EKS cluster
We will use eksctl cli to deploy an EKS cluster containing two nodegroups, the first is for the cnvrg control plane, and the second for Habana Gaudi workloads. Before deploying Habana Gaudi workloads, we will need to choose the correct AWS AMI based on the region, and the Kubernetes version and to find the availability zone of the machine type
West-2:
Kubernetes Version | AMI ID |
---|---|
18 | ami-012dcf16d655c757f |
19 | ami-0f093da0acaaf4a92 |
20 | ami-095898da639cbbbbe |
21 | ami-0a6edcea4417c9e9c |
East-1:
Kubernetes Version | AMI ID |
---|---|
18 | ami-077ca8fa36be9b6d3 |
19 | ami-0786b638b54209f05 |
20 | ami-0feac853dfbaaaefc |
21 | ami-0e30d8dea91e2d389 |
Now, let's find the availability zone in which the machine type is available:
aws ec2 describe-instance-type-offerings \
--filters Name=instance-type,Values=dl1.24xlarge \
--location-type availability-zone --region us-east-1
Create a cluster configuration file named eks-cluster.yaml
with the following content:
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: habana-eks-cluster
region: us-east-1
version: "1.21"
availabilityZones: ['us-east-1a','us-east-1b']
nodeGroups:
- name: cnvrg-control-plan-01
instanceType: m5a.2xlarge # minimum
volumeSize: 100
minSize: 2
maxSize: 3
desiredCapacity: 2
privateNetworking: true
iam:
attachPolicyARNs:
- arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
- arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
- arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
- arn:aws:iam::aws:policy/AmazonS3FullAccess #you can create policy specfic for bucket created
withAddonPolicies:
autoScaler: true
imageBuilder: true
tags:
k8s.io/cluster-autoscaler/enabled: 'true'
availabilityZones: ['us-east-1a']
- name: dl1-ng-1d2
instanceType: dl1.24xlarge
instancePrefix: dl1-ng-1d-worker
volumeSize: 200
minSize: 0
maxSize: 4
desiredCapacity: 1
privateNetworking: true
ami: ami-0c385d0d99fce057d
iam:
withAddonPolicies:
imageBuilder: true
autoScaler: true
ebs: true
fsx: true
attachPolicyARNs:
- arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
- arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
- arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
- arn:aws:iam::aws:policy/AmazonS3FullAccess #you can create policy specfic for bucket created
tags:
k8s.io/cluster-autoscaler/enabled: 'true'
k8s.io/cluster-autoscaler/node-template/resources/habana.ai/gaudi: "8"
k8s.io/cluster-autoscaler/node-template/resources/hugepages-2Mi: "30000Mi"
overrideBootstrapCommand: |
#!/bin/bash
/etc/eks/bootstrap.sh habana-eks-cluster
availabilityZones: ['us-east-1d']
Use eksctl to create the cluster using the cluster config file.
eksctl create cluster -f eks-cluster.yaml
Once the EKS cluster is ready, use Helm to deploy an automatic node autoscaler. First, add the Helm repository of the autoscaler:
helm repo add autoscaler https://kubernetes.github.io/autoscaler
Create the chart deployment using Helm install command, make sure to change:
- autoDiscovery.clusterName: cluster name as defined in ClusterConfig
- image.tag: equal to the version of the Kubernetes version
helm install cluster-autoscaler autoscaler/cluster-autoscaler -n kube-system \
--set autoDiscovery.clusterName=habana-eks-cluster \
--set awsRegion=us-east-1 \
--set image.tag=v1.21.0 \
--set replicaCount=1 \
--set extraArgs.skip-nodes-with-system-pods=false,extraArgs.skip-nodes-with-local-storage=false,extraArgs.cloud-provider=aws
Verify the readiness of the autoscaler using Kubectl command:
kubectl get pods -n kube-system |grep autoscaler
# Install cnvrg
# Install and Update the Helm repo
Run the following command to download the most updated cnvrg helm charts:
helm repo add cnvrg https://charts.cnvrg.io
helm repo update
# Deploy cnvrg
The following is the minimum Helm command with the parameters required by cnvrg to run and utilize Habana Gaudi AMI. it will install cnvrg with an Istio Ingress controller, and S3 as the object storage backend.
Before deploying cnvrg you will need:
- wildcard dns record for cnvrg
- The ip address of your cluster
- The details of the S3 bucket
- EKS kubeconfig file
Optional:
- cnvrg premium user credentials for deploying cnvrg premium
helm install cnvrg cnvrgv3/cnvrg --timeout 1500s --wait --namespace=cnvrg --create-namespace \
--set clusterDomain=YOUR-DOMAIN \
--set registry.user=cnvrghelm \
--set controlPlane.image=cnvrg/core:3.10.1 \
--set controlPlane.baseConfig.featureFlags.HABANA_ENABLED="true" \
--set gpu.habanaDp.enabled=true \
--set monitoring.habanaExporter.enabled=true \
--set controlPlane.objectStorage.bucket=YOUR-S3-BUCKET \
--set controlPlane.objectStorage.accessKey=YOUR-S3-BUCKET-ACCESSKEY \
--set controlPlane.objectStorage.secretKey=YOUR-S3-BUCKET-SECRETKEY \
--set controlPlane.objectStorage.region=YOUR-S3-BUCKET-REGION\
# Advanced Helm Options
There are numerous ways to customize the installation to best fit your own infrastructure and requirement, including disk sizes, memory information, versions, and so on. For the full list of customizable flags, click here.
# Completing the Setup
The helm install command can take up to 10 minutes. When the deployment completes, you can go to the URL of your newly deployed cnvrg or add the new cluster as a resource inside your organization. The helm command will inform you of the correct URL:
🚀 Thank you for installing cnvrg.io!
Your installation of cnvrg.io is now available, and can be reached via:
http://app.mydomain.com
Talk to our team via email at hi@cnvrg.io
# Monitoring the deployment
You can monitor and validate the deployment process by running the following command:
kubectl -n cnvrg get pods
When the status of all the containers is running
or completed
, cnvrg will have been successfully deployed.
It should look similar to the below output example:
NAME READY STATUS RESTARTS AGE
cnvrg-app-69fbb9df98-6xrgf 1/1 Running 0 2m
cnvrg-sidekiq-b9d54d889-5x4fc 1/1 Running 0 2m
controller-65895b47d4-s96v6 1/1 Running 0 2m
init-app-vs-config-wv9c4 0/1 Completed 0 9m
init-gateway-vs-config-2zbpp 0/1 Completed 0 9m
init-minio-vs-config-cd2rg 0/1 Completed 0 9m
istio-citadel-c58d68844-bcwv7 1/1 Running 0 2m
istio-galley-67dfcd65c5-vb2jf 1/1 Running 0 2m
istio-ingressgateway-6d48767f5b-mw4q8 1/1 Running 0 2m
istio-pilot-7bb78bbfb9-dpq6q 2/2 Running 0 2m
minio-0 1/1 Running 0 2m
postgres-0 1/1 Running 0 2m
redis-695c49c986-kcbt9 1/1 Running 0 2m
seeder-wh655 0/1 Completed 0 2m
speaker-5sghr 1/1 Running 0 2m
NOTE
The exact list of pods may be different, as it depends on the flags that you used with the helm install
command. As long as the statuses are running
or completed
, the deployment will have been successful.
# Monitoring your cluster using Kibana and Grafana
Now that cnvrg has been deployed, you can access the Kibana and Grafana dashboards of your cluster.
They are great tools for monitoring the health of your cluster and analyzing the logs of your cluster.
To access Kibana, go to:
kibana.<your_domain>.com
To access Grafana, go to:
grafana.<your_domain>.com
# Delete cnvrg CORE
If you would like to delete the cnvrg deployment using Helm, run the following command:
helm uninstall cnvrg -n cnvrg
# Upgrade a cnvrg Installation
If you would like to upgrade an existing Helm installation, run the following command with the other settings as required for your install:
helm upgrade cnvrg cnvrg/cnvrg --reuse-values \
--set cnvrgApp.image=<image>