# Compute

As a data scientist, you need to tap large amounts of compute power for extended periods of time, reliably and efficiently.

cnvrg provides powerful features for managing compute, whether in a Kubernetes cluster or on-premise machine.

cnvrg provides several default compute templates (predefined sets of computer resources). You can modify these and add new compute templates with no special effort.

In most instances, you configure the computes you'll need once, in the form of compute templates, at system setup. These become available across your entire organization.

Then you select one or more computes during the creation of your experiment, workspace, flow, apps and serving.

You can do this through the cnvrg web UI, or by using a cnvrg CLI command or cnvrg SDK call.

In running the job, cnvrg attempts to use the compute in the order in which you attach them, based on available resources. If the first compute is unavailable, cnvrg attempts to use the second compute and so on.

The topics in this page:

# Compute Templates

A compute template is a predefined set of computer resources. It consists of information to define CPUs, GPUs, memory and other metadata. A template exists for a specific compute resource (Kubernetes cluster or on-premise machine).

Technically, a compute template describes either one or more Kubernetes pod, or an on-premise machine. Each template you add is a different set of resources to use as a compute engine when running a job - either as a pod on a Kubernetes cluster or as a specific on-premise machine.

Your templates are available for selection whenever you encounter a Compute drop-down list. When you select one of your templates, cnvrg checks if the required resources are available in your cluster or the machine is available and attempts to allocate the resources of the requested size. For more information, see Using Compute in Jobs.

For each resource you are connected to, cnvrg automatically creates a set of default compute templates. You can customize existing templates, add templates or remove templates as desired.

There are three types of compute templates:

Type of Template Explanation
Regular A regular template is a set of resources defining a single pod that will be launched on a single node (ie. non-distributed). A machine's template is a regular template as well.
Open MPI This type of template is used for multi-node distributed jobs. It allows you to run any Open MPI compatible workload or job as a distributed workload over more than one node. The details of the master node and workers nodes can be defined separately.
Spark A Spark template is used for running distributed processing jobs using Spark. The details of the master node and workers nodes can be defined separately.

# Add a compute template

You can add compute templates to a cluster.

  1. Click Compute > Templates.
  2. Click Add Compute Template.
  3. In the list that appears, select the cluster you want to add the compute template to.
  4. Set the title for the template.
  5. Select which type of template you are creating, Regular, Open MPI or Spark.
  6. Fill in the rest of the relevant Specifications:
    • Number of CPUs
    • Amount of Memory
    • Number of GPUs
    • The Labels or selector for the node pools the template should use. For example, gputype=v100. Ensure it matches the template. To specify several, separate them with commas.
  7. (For Open MPI and Spark) Fill in the Workers Specifications of the worker nodes:
    • Workers Count: The amount of worker nodes to use with this template.
    • Number of CPUs
    • Amount of Memory
    • Number of GPUs
    • The Labels or selector for the node pools the template should use. For example, gputype=v100. Ensure it matches the template. To specify several, separate them with commas.
  8. Click Save.

cnvrg adds the new template to the selected cluster.

# Delete a compute template

To delete a compute template:

  1. Click Compute > Templates.
  2. Click the Delete icon at the extreme right end of the compute template that you want to delete.
  3. Confirm the deletion.

cnvrg deletes the template from your organization.

# Edit an existing template

To edit a compute template:

  1. Go to Compute > Templates.
  2. Click the template you wish to edit. A similar page to the new template page is presented.
  3. Edit the fields as desired.
  4. Click Save.

cnvrg updates the details of the template.

# Compute Resources

In cnvrg, Kubernetes clusters and on-premise machines are referred to as compute resources.

cnvrg seamlessly integrates with your Kubernetes clusters and allows you to start leveraging your nodes quickly and easily. Your organization can connect to many Kubernetes clusters.

You can also connect to your on-premise machines and add them as resources as well.

You can see all your connected compute resources by going to Compute > Resources.

Here, you can also add, edit and delete Kubernetes clusters and machines from your organization.

# Kubernetes

# Add a Kubernetes cluster

  1. Go to Compute > Resources.
  2. Click on + Add Resource and then click Kubernetes from the list.
  3. Click the type of cluster you are adding: On-premise, Google GKE, Amazon EKS or Azure AKS.
  4. Type in the Title of the cluster (for use within cnvrg).
  5. Type in the Domain for the cluster.
  6. Paste the Kube Config for the cluster.
  7. Choose the provider for the cluster (GKE - Google, EKS - Amazon, AKS - Microsoft or On-Premise).
  8. (When type is not On-Premise) Check Use Persistent Volumes if relevant.
  9. (When type is not On-Premise) Check Spots Enabled if relevant.
  10. Click Create. cnvrg adds the cluster to your organization and creates default templates for the cluster, displaying them in the panel that appears.
  11. Click Save to complete the creation of the Kubernetes cluster. A set of default templates will be automatically generated as well.

If you would like to edit the cluster's templates, remove templates or add additional templates, you can do so from the Compute > Templates page.

# Kubernetes cluster information

Each cluster that you have added as a resource to cnvrg is accompanied with a dashboard. You can use the cluster's page to update information about your cluster.

You can access this page by clicking on the name of the cluster in Compute > Resources.

At the top is a summary of the cluster's details including the cluster's name, who created the cluster, the status of the cluster and the date of the last health check.

The page has the following tabs with deeper information:

Tab Contents
Logs The logs of health checks can be found here.
Kibana The Kibana dashboard for the cluster can be accessed from this tab. Use it to gain deep insights into the logs of your cluster.
Grafana The Grafana dashboard for the cluster can be accessed from this tab. Grafana allows you to monitor the compute usage for jobs running on the cluster.
Config This tab holds the configuration details for the cluster. You can edit all of the cluster's details here as well.
System Dashboard This tab holds a dashboard for getting at a glance in-depth insights into the health and utilization of the custer.

# Edit a Kubernetes cluster

To edit an existing Kubernetes cluster:

  1. Go to Compute > Resources.
  2. Click on the cluster you wish to edit.
  3. Click on the Config tab.
  4. Click Edit.
  5. Edit the fields as desired.
  6. Click Save.

cnvrg updates the details of the cluster.

# Delete a Kubernetes cluster

You can delete an on-premise machine from your organization:

  1. Go to Compute > Resources.
  2. Click on the cluster you wish to delete.
  3. Click on the Config tab.
  4. Click Delete.
  5. Confirm you want to delete the machine by clicking Delete in the pop-up.

cnvrg deletes the cluster from your organization.

# On-Premise machines

# Add an on-premise machine

Before you can add your on-premise machine you must verify the following dependencies are installed on the machine:

    1. Go to Compute > Resources
    2. Click on + Add Resource and then click Machine from the list.
    3. Type in the Title of your machine (for use within cnvrg).
    4. Type in all the SSH details for the machine:
      • Username
      • Host
      • Port
    5. Choose an SSH authentication method and add the relevant authentication:
      • SSH Password or
      • SSH Key
    6. If it is a GPU machine, enable the GPU Machine toggle.
    7. Fill out the advanced settings (optional):
      • Set CPU Cores.
      • Set Memory in GB.
      • Set GPU Count (if relevant).
      • Set GPU Memory (if relevant).
      • Set the GPU Type (if relevant).
    8. Click Add.

    cnvrg saves the details and adds the machine to your organization.

    # On-Premise machine information

    Each machine that you have added as a resource to cnvrg is accompanied with a dashboard. You can use the machine's page to update information about your machine.

    You can access this page by clicking on the name of the machine in Compute > Resources.

    At the top is a summary of the machine's details including, the machine's name, who created the machine, the status of the machine and the date of the last health check.

    The page has the following tabs with deeper information:

    Tab Contents
    Logs The logs for health checks can be found in this tab.
    Config This tab holds the configuration details for the machine. You can edit all of the machine's details here as well.
    System Dashboard This tab holds a dashboard for getting at a glance in-depth insights into the health and utilization of the machine.

    # Edit an on-premise machine

    You can edit settings for an on-premise machine in your organization:

    1. Go to Compute > Resources.
    2. Click on the machine you wish to edit.
    3. Click on the Config tab.
    4. Click Edit.
    5. Edit the fields as desired.
    6. Click Save.

    cnvrg updates the details of the machine.

    # Delete an on-premise machine

    You can delete an on-premise machine from your organization:

    1. Go to Compute > Resources.
    2. Click on the machine you wish to delete.
    3. Click on the Config tab.
    4. Click Delete.
    5. Confirm you want to delete the machine by clicking Delete in the pop-up.

    cnvrg deletes the machine from your organization.

    # Compute Dashboards

    One of the key goals of cnvrg is to simplify DevOps and provide tools to easily manage all of your compute resources. To make this possible, cnvrg builds in support for mnay different compute dashboards:

    # System Dashboard (On-Premise and Kubernetes)

    The system dashboard provides you with at a glance insights into the health and utilization of all of your resources. This allows you to monitor your GPUs and CPUs easier than ever before.

    The system dashboard can be found in the System Dashboard tab on the information page of each cluster and machine.

    Inside the tab, dynamic and live charts for every relevant metric of your resource is displayed. You can get at a glance insights into:

    • GPU charts:
      • GPU Utilization (%)
      • Memory (%)
      • Temperature (°C)
      • Power (W)
    • CPU charts:
      • CPU Utilization (%)
      • Memory (Mib)
      • Disk IO
      • Network Traffic

    At the top you can determine the time horizon for the charts that are displayed:

    • Live
    • 1 Hour
    • 24 Hours
    • 30 Days
    • 60 Days

    # Kibana (Kubernetes Only)

    Kibana lets you visualize your Elasticsearch data and navigate the Elastic Stack so you can do anything from tracking query load to understanding the way requests flow through your apps.

    Kibana is natively integrated with cnvrg and you can use it to dynamically visualize the logs of your Kubernetes cluster.

    You can access the Kibana dashboard for your cluster from the cluster's information page. Additionally, any endpoints you create will be accompanied by a specific log dashboard using Kibana, which can be accessed in the Kibana tab of the service.

    You can learn more about using Kibana in the Kibana docs.

    # Grafana (Kubernetes Only)

    Grafana allows you to query, visualize, alert on and understand the metrics of your Kubernetes cluster. Create, explore, and share dashboards with your team with this simple and easy to use tool.

    Its integration with cnvrg allows you to easily monitor the health of your cluster. You can check the resource usage of pods and create dynamic charts to keep an eye on your entire cluster.

    You can access the Grafana for your cluster from the cluster's information page. Additionally, any endpoints you create will be accompanied by a specific resource dashboard using Grafana, which can be accessed in the Grafana tab of the service.

    You can learn more about using Grafana in the Grafana docs.

    # Compute Health Checks

    To help manage your compute effectively, cnvrg will regularly check the health of all your connected compute resources.

    cnvrg will query the resources every 5 minutes to see if they are reachable and useable in jobs. You can follow the logs for this process in the Logs tab on the information page of each cluster and machine.

    If the status is Online, the resource can be used.
    If the status is Offline, cnvrg could not connect to the resource. You will need to troubleshoot the resource for any issues and check that the configuration details for the resource are correct.

    When the status of a compute resource changes, an email notification is sent to the administrators of the organization.

    # Using Compute in Jobs

    Using the web UI, you can choose a compute when starting a workspace or experiment or when building a flow.

    You can also set a compute for jobs when running them using the cnvrg CLI and cnvrg SDK.

    In running the job, cnvrg attempts to use the computes in the order in which you set them. If the first compute is unavailable, cnvrg attempts to use the second compute and so on. If none of the selected computes are available, the experiment will enter into a queued state. When a compute becomes available, it will start running.

    # When starting a workspace or experiment using the web UI

    When you start a workspace or experiment from the UI, you can choose one or more computes to attempt to run on, using the compute selector.

    1. Click Start Workspace or New Experiment and fill in the relevant details in the pane that appears.
      To choose your compute(s), select the Compute drop-down list (for a workspace), or the Compute drop-down list under Environment (for an experiment).
    2. Click each compute you want to attempt to use in the workspace or experiment. You can remove a compute from the list by clicking the X next to its name. The numbers next to their title when clicked indicate the order cnvrg will try to use them in.
    3. Click Start Workspace or Run.

    # When building a flow using the web UI

    1. Open the flow for which you wish to choose the compute.
    2. In the Advanced tab, select the Compute drop-down list.
    3. Click each compute you want to attempt to use in the workspace or experiment. You can remove a compute from the list by clicking the X next to its name. The numbers next to their title when clicked indicate the order cnvrg will try to use them in.
    4. Click Save Changes.

    # Using the CLI

    To set a compute when running an experiment, use the --machine flag:

    cnvrg run python3 train.py --machine=‘medium’
    

    You can include multiple computes in the array. See the full cnvrg CLI documentation for more information about running experiments using the CLI.

    # Using the SDK

    To set a compute when running an experiment using the Python SDK use the compute parameter:

    from cnvrg import Experiment
    e = Experiment.run('python3 train.py',
                        compute=‘medium’)
    

    You can include multiple computes in the array. See the full cnvrg SDK documentation for more information about running experiments using the SDK.

    # Viewing Job History

    You can view a history of the jobs that ran recently.

    Click Compute > Jobs.

    The Jobs pane is displayed, showing a summary of the recent jobs that have run. There are columns showing:

    • Title. Clicking the title brings you to the experiments page.
    • Project the job ran in. Clicking the project name brings you to the project page.
    • Status
    • Duration
    • User
    • Created at
    • Compute
    • Image

    # Controlling which Jobs run on which Nodes

    Kubernetes is designed to efficiently allocate and orchestrate your compute. By default, all nodes can be used by any job/compute template if the requested resources exist. However, this may mean that GPU nodes may be used for CPU jobs, meaning that a CPU job will use a GPU machine when it needs to be used for a GPU job. This is just one example of when you might wnat to limit this behavior.

    You can use Kubernetes and cnvrg to control which jobs are run on which nodes. There are two ways to enforce this:

    1. Adding a taint to the GPU node pool.
    2. Using node labels.

    # Using node taints

    If you wish to restrict GPU nodes to only GPU jobs, add the following node taint to the node:

    • key: nvidia.com/gpu
    • value: present
    • effect: NoSchedule

    You can use the following kubectl command to set a specific node or nodes:

    kubectl taint nodes <node_name> key1=key:nvidia.com/gpu key2=value:present key3=effect:NoSchedule
    

    # Using node labels

    You can use node labels to attach specific compute jobs to specific nodes.

    To label your nodes with a custom label use the following kubectl command:

    kubectl label nodes <node-name> <label-key>=<label-value>
    

    Now add the label to the compute templates you want to run on the labelled node. Go to Compute > Templates and edit the template you want to use only on the labelled node. In Selector, add the same <label-key>=<label-value>.

    Now any jobs run on the template will only run on nodes with the matching labels.

    # Installing Docker Compose on a GPU Machine

    Docker Compose is not yet officially supported for GPU machines. However, it is still possible to set up Docker Compose to work in the context of a GPU machine using NVIDIA drivers. This relies on altering the runtime for docker compose. Follow the guide below:

    1. Install docker compose.
    2. Download and install the nivida-container-runtime:
      sudo apt-get install nvidia-container-runtime
      
    3. Add the NVIDIA runtime as the default Docker Compose runtime by running the following command:
      sudo tee /etc/docker/daemon.json <<EOF
      {
          "default-runtime":"nvidia",
          "runtimes": {
              "nvidia": {
                  "path": "nvidia-container-runtime",
                  "runtimeArgs": []
              }
          }
      }
      EOF
      sudo pkill -SIGHUP dockerd
      
    4. Restart docker:
      sudo systemctl restart docker
      
    Last Updated: 4/13/2020, 8:31:42 AM