# Compute

As a data scientist, you need to tap large amounts of compute power for extended periods of time, reliably and efficiently.

cnvrg provides powerful features for managing compute, whether in a Kubernetes cluster or on-premise machine.

cnvrg provides several default compute templates (predefined sets of computer resources). You can modify these and add new compute templates with no special effort.

In most instances, you configure the computes you'll need once, in the form of compute templates, at system setup. These become available across your entire organization.

Then you select one or more computes during the creation of your experiment, workspace, flow, apps and serving.

You can do this through the cnvrg web UI, or by using a cnvrg CLI command or cnvrg SDK call.

In running the job, cnvrg attempts to use the compute in the order in which you attach them, based on available resources. If the first compute is unavailable, cnvrg attempts to use the second compute and so on.

The topics in this page:

# Compute Templates

A compute template is a predefined set of computer resources. It consists of information to define CPUs, GPUs, memory and other metadata. A template exists for a specific compute resource (Kubernetes cluster or on-premise machine).

Technically, a compute template describes either one or more Kubernetes pod, or an on-premise machine. Each template you add is a different set of resources to use as a compute engine when running a job - either as a pod on a Kubernetes cluster or as a specific on-premise machine.

Your templates are available for selection whenever you encounter a Compute drop-down list. When you select one of your templates, cnvrg checks if the required resources are available in your cluster or the machine is available and attempts to allocate the resources of the requested size. For more information, see Using Compute in Jobs.

For each resource you are connected to, cnvrg automatically creates a set of default compute templates. You can customize existing templates, add templates or remove templates as desired.

The templates for a remote Spark cluster must be configured before use and are not generated automatically.

There are five types of compute templates:

Type of Template Explanation
Regular A regular template is a set of resources defining a single pod that will be launched on a single node (ie. non-distributed). A machine's template is a regular template as well.
Open MPI This type of template is used for multi-node distributed jobs. It allows you to run any Open MPI compatible workload or job as a distributed workload over more than one node. The details of the master node and workers nodes can be defined separately.
PyTorch Distributed A PyTorch Distributed template is used for running code that uses the torch.distributed framework. The template defines the configuration for thr worker nodes.
Spark on Kubernetes A Spark on Kubernetes template is used for running distributed processing jobs using Spark on Kubernetes. The details of the master node and workers nodes can be defined separately.
Remote Spark A Remote Spark template is used for running distributed processing jobs using a remote Spark cluster. You choose an existing regular template as the Spark driver and then define the configuration of the desired worker nodes.

# Add a compute template

You can add compute templates to a Kubernetes cluster and Spark cluster.

    # Delete a compute template

    To delete a compute template:

    1. Click Compute > Templates.
    2. Click the Delete icon at the extreme right end of the compute template that you want to delete.
    3. Confirm the deletion.

    cnvrg deletes the template from your organization.

    # Edit an existing template

    To edit a compute template:

    1. Go to Compute > Templates.
    2. Click the template you wish to edit. A similar page to the new template page is presented.
    3. Edit the fields as desired.
    4. Click Save.

    cnvrg updates the details of the template.

    # Compute Resources

    In cnvrg, Kubernetes clusters and on-premise machines are referred to as compute resources.

    cnvrg seamlessly integrates with your Kubernetes clusters and allows you to start leveraging your nodes quickly and easily. Your organization can connect to many Kubernetes clusters.

    You can also connect to your on-premise machines and add them as resources as well.

    Additionally, you can add remote Spark clusters as compute resources.

    You can see all your connected compute resources by going to Compute > Resources.

    Here, you can also add, edit and delete Kubernetes clusters, machines and Spark clusters from your organization.

    # Kubernetes

    # Add a Kubernetes cluster

    1. Go to Compute > Resources.
    2. Click on + Add Resource and then click Kubernetes from the list.
    3. Click the type of cluster you are adding: On-premise, Google GKE, Amazon EKS or Azure AKS.
    4. Type in the Title of the cluster (for use within cnvrg).
    5. Type in the Domain for the cluster.
    6. Paste the Kube Config for the cluster.
    7. Choose the provider for the cluster (GKE - Google, EKS - Amazon, AKS - Microsoft or On-Premise).
    8. (When type is not On-Premise) Check Use Persistent Volumes if relevant.
    9. (When type is not On-Premise) Check Spots Enabled if relevant.
    10. Click Create. cnvrg adds the cluster to your organization and creates default templates for the cluster, displaying them in the panel that appears.
    11. Click Save to complete the creation of the Kubernetes cluster. A set of default templates will be automatically generated as well.

    If you would like to edit the cluster's templates, remove templates or add additional templates, you can do so from the Compute > Templates page.

    # Kubernetes cluster information

    Each cluster that you have added as a resource to cnvrg is accompanied with a dashboard. You can use the cluster's page to update information about your cluster.

    You can access this page by clicking on the name of the cluster in Compute > Resources.

    At the top is a summary of the cluster's details including the cluster's name, who created the cluster, the status of the cluster and the date of the last health check.

    The page has the following tabs with deeper information:

    Tab Contents
    Logs The logs of health checks can be found here.
    Kibana The Kibana dashboard for the cluster can be accessed from this tab. Use it to gain deep insights into the logs of your cluster.
    Grafana The Grafana dashboard for the cluster can be accessed from this tab. Grafana allows you to monitor the compute usage for jobs running on the cluster.
    Config This tab holds the configuration details for the cluster. You can edit all of the cluster's details here as well.
    System Dashboard This tab holds a dashboard for getting at a glance in-depth insights into the health and utilization of the custer.

    # Edit a Kubernetes cluster

    To edit an existing Kubernetes cluster:

    1. Go to Compute > Resources.
    2. Click on the cluster you wish to edit.
    3. Click on the Config tab.
    4. Click Edit.
    5. Edit the fields as desired.
    6. Click Save.

    cnvrg updates the details of the cluster.

    # Delete a Kubernetes cluster

    You can delete an on-premise machine from your organization:

    1. Go to Compute > Resources.
    2. Click on the cluster you wish to delete.
    3. Click on the Config tab.
    4. Click Delete.
    5. Confirm you want to delete the machine by clicking Delete in the pop-up.

    cnvrg deletes the cluster from your organization.

    # On-Premise machines

    # Add an on-premise machine

    Before you can add your on-premise machine you must verify the following dependencies are installed on the machine:

      1. Go to Compute > Resources
      2. Click on + Add Resource and then click Machine from the list.
      3. Type in the Title of your machine (for use within cnvrg).
      4. Type in all the SSH details for the machine:
        • Username
        • Host
        • Port
      5. Choose an SSH authentication method and add the relevant authentication:
        • SSH Password or
        • SSH Key
      6. If it is a GPU machine, enable the GPU Machine toggle.
      7. Fill out the advanced settings (optional):
        • Set CPU Cores.
        • Set Memory in GB.
        • Set GPU Count (if relevant).
        • Set GPU Memory (if relevant).
        • Set the GPU Type (if relevant).
      8. Click Add.

      cnvrg saves the details and adds the machine to your organization.

      # On-Premise machine information

      Each machine that you have added as a resource to cnvrg is accompanied with a dashboard. You can use the machine's page to update information about your machine.

      You can access this page by clicking on the name of the machine in Compute > Resources.

      At the top is a summary of the machine's details including, the machine's name, who created the machine, the status of the machine and the date of the last health check.

      The page has the following tabs with deeper information:

      Tab Contents
      Logs The logs for health checks can be found in this tab.
      Config This tab holds the configuration details for the machine. You can edit all of the machine's details here as well.
      System Dashboard This tab holds a dashboard for getting at a glance in-depth insights into the health and utilization of the machine.

      # Edit an on-premise machine

      You can edit settings for an on-premise machine in your organization:

      1. Go to Compute > Resources.
      2. Click on the machine you wish to edit.
      3. Click on the Config tab.
      4. Click Edit.
      5. Edit the fields as desired.
      6. Click Save.

      cnvrg updates the details of the machine.

      # Delete an on-premise machine

      You can delete an on-premise machine from your organization:

      1. Go to Compute > Resources.
      2. Click on the machine you wish to delete.
      3. Click on the Config tab.
      4. Click Delete.
      5. Confirm you want to delete the machine by clicking Delete in the pop-up.

      cnvrg deletes the machine from your organization.

      # Spark clusters

      While cnvrg natively support running Spark on Kubernetes without any setup, you can additionally add an existing remote Spark cluster for use in cnvrg. To get started:

      1. Add the Spark cluster as a compute resource.
      2. Create the desired Spark compute templates for that cluster.

      # Add a remote Spark cluster

      1. Go to Compute > Resources
      2. Click on + Add Resource and then click Spark from the list.
      3. Set the Title of the Spark cluster (for use in cnvrg).
      4. Set the Spark Configuration. These key-value pairs will be used in the construction of the spark-defaults.conf. Enter all of the settings that you need for your cluster. For example, spark.master and spark://your_host:your_port. A full list of options can be found here.
      5. Add any desired Environment Variables. These key-value pairs will be exposed as environment variables in the Spark driver. These will be exposed in addition to any that might exist in the chosen Docker image or those added in the project settings. For example, SPARK_HOME and /spark.
      6. Upload Files that are relevant to your Spark configuration. Click Browse Files or drag-and-drop the files you wish to upload. Then add the exact file path and name for where the file should be copied to in the Spark driver and what it should be called. For example, yarn-conf.xml with Target Path /spark/yarn-conf.xml.
      7. Click Save.

      The Spark cluster will be added as a compute resource. Now you will need to define compute templates for use with the cluster. These define which regular compute templates to use as a Spark driver and how many Spark worker nodes should be used. This allows you to run Spark jobs with different amounts of executors.

      # Remote Spark cluster information

      You can review the settings of your remote Spark cluster including which user added it and when it was created.

      1. Go to Compute > Resources.
      2. Click on the Spark cluster you would like to view.

      # Edit a remote Spark cluster

      You can edit settings for a a remote Spark cluster in your organization:

      1. Go to Compute > Resources.
      2. Click on the Spark cluster you would like to edit.
      3. Edit the fields as desired.
      4. Click Save.

      cnvrg updates the details of the Spark cluster.

      # Delete a remote Spark cluster

      You can delete a Spark cluster from your organization:

      1. Go to Compute > Resources.
      2. Click on the Spark cluster you would like to remove.
      3. Click Delete.
      4. Confirm the deletion by clicking Yes in the popup.

      cnvrg will delete the Spark cluster and remove it as a compute resource.

      # Compute Dashboards

      One of the key goals of cnvrg is to simplify DevOps and provide tools to easily manage all of your compute resources. To make this possible, cnvrg builds in support for many different compute dashboards:

      # System Dashboard (machines and Kubernetes)

      The system dashboard provides you with at a glance insights into the health and utilization of all of your resources. This allows you to monitor your GPUs and CPUs easier than ever before.

      The system dashboard can be found in the System Dashboard tab on the information page of each cluster and machine.

      Inside the tab, dynamic and live charts for every relevant metric of your resource is displayed. You can get at a glance insights into:

      • GPU charts:
        • GPU Utilization (%)
        • Memory (%)
        • Temperature (°C)
        • Power (W)
      • CPU charts:
        • CPU Utilization (%)
        • Memory (Mib)
        • Disk IO
        • Network Traffic

      At the top you can determine the time horizon for the charts that are displayed:

      • Live
      • 1 Hour
      • 24 Hours
      • 30 Days
      • 60 Days

      # Kibana (Kubernetes only)

      Kibana lets you visualize your Elasticsearch data and navigate the Elastic Stack so you can do anything from tracking query load to understanding the way requests flow through your apps.

      Kibana is natively integrated with cnvrg and you can use it to dynamically visualize the logs of your Kubernetes cluster.

      You can access the Kibana dashboard for your cluster from the cluster's information page. Additionally, any endpoints you create will be accompanied by a specific log dashboard using Kibana, which can be accessed in the Kibana tab of the service.

      You can learn more about using Kibana in the Kibana docs.

      # Grafana (Kubernetes only)

      Grafana allows you to query, visualize, alert on and understand the metrics of your Kubernetes cluster. Create, explore, and share dashboards with your team with this simple and easy to use tool.

      Its integration with cnvrg allows you to easily monitor the health of your cluster. You can check the resource usage of pods and create dynamic charts to keep an eye on your entire cluster.

      You can access the Grafana for your cluster from the cluster's information page. Additionally, any endpoints you create will be accompanied by a specific resource dashboard using Grafana, which can be accessed in the Grafana tab of the service.

      You can learn more about using Grafana in the Grafana docs.

      # Compute Health Checks

      To help manage your compute effectively, cnvrg will regularly check the health of your connected machines and Kubernetes clusters.

      cnvrg will query the resources every 5 minutes to see if they are reachable and useable in jobs. You can follow the logs for this process in the Logs tab on the information page of each cluster and machine.

      If the status is Online, the resource can be used.
      If the status is Offline, cnvrg could not connect to the resource. You will need to troubleshoot the resource for any issues and check that the configuration details for the resource are correct.

      When the status of a compute resource changes, an email notification is sent to the administrators of the organization.

      # Using Compute in Jobs

      Using the web UI, you can choose a compute when starting a workspace or experiment or when building a flow.

      You can also set a compute for jobs when running them using the cnvrg CLI and cnvrg SDK.

      In running the job, cnvrg attempts to use the computes in the order in which you set them. If the first compute is unavailable, cnvrg attempts to use the second compute and so on. If none of the selected computes are available, the experiment will enter into a queued state. When a compute becomes available, it will start running.

      # When starting a workspace or experiment using the web UI

      When you start a workspace or experiment from the UI, you can choose one or more computes to attempt to run on, using the compute selector.

      1. Click Start Workspace or New Experiment and fill in the relevant details in the pane that appears.
        To choose your compute(s), select the Compute drop-down list (for a workspace), or the Compute drop-down list under Environment (for an experiment).
      2. Click each compute you want to attempt to use in the workspace or experiment. You can remove a compute from the list by clicking the X next to its name. The numbers next to their title when clicked indicate the order cnvrg will try to use them in.
      3. Click Start Workspace or Run.

      # When building a flow using the web UI

      1. Open the flow for which you wish to choose the compute.
      2. In the Advanced tab, select the Compute drop-down list.
      3. Click each compute you want to attempt to use in the workspace or experiment. You can remove a compute from the list by clicking the X next to its name. The numbers next to their title when clicked indicate the order cnvrg will try to use them in.
      4. Click Save Changes.

      # Using the CLI

      To set a compute when running an experiment, use the --machine flag:

      cnvrg run python3 train.py --machine=‘medium’
      

      You can include multiple computes in the array. See the full cnvrg CLI documentation for more information about running experiments using the CLI.

      # Using the SDK

      To set a compute when running an experiment using the Python SDK use the compute parameter:

      from cnvrg import Experiment
      e = Experiment.run('python3 train.py',
                          compute=‘medium’)
      

      You can include multiple computes in the array. See the full cnvrg SDK documentation for more information about running experiments using the SDK.

      # Viewing Job History

      You can view a history of the jobs that ran recently.

      Click Compute > Jobs.

      The Jobs pane is displayed, showing a summary of the recent jobs that have run. There are columns showing:

      • Title. Clicking the title brings you to the experiments page.
      • Project the job ran in. Clicking the project name brings you to the project page.
      • Status
      • Duration
      • User
      • Created at
      • Compute
      • Image

      # Controlling which Jobs run on which Nodes

      Kubernetes is designed to efficiently allocate and orchestrate your compute. By default, all nodes can be used by any job/compute template if the requested resources exist. However, this may mean that GPU nodes may be used for CPU jobs, meaning that a CPU job will use a GPU machine when it needs to be used for a GPU job. This is just one example of when you might want to limit this behavior.

      You can use Kubernetes and cnvrg to control which jobs are run on which nodes. There are two ways to enforce this:

      1. Adding a taint to the GPU node pool
      2. Using node labels
      3. Controlling using instance type

      # Add a taint to the GPU node pool

      If you wish to restrict GPU nodes to only GPU jobs, add the following node taint to the node:

      • key: nvidia.com/gpu
      • value: present
      • effect: NoSchedule

      You can use the following kubectl command to set a specific node or nodes:

      kubectl taint nodes <node_name> key1=key:nvidia.com/gpu key2=value:present key3=effect:NoSchedule
      

      # Using node labels

      You can use node labels to attach specific compute jobs to specific nodes.

      To label your nodes with a custom label use the following kubectl command:

      kubectl label nodes <node-name> <label-key>=<label-value>
      

      Now add the label to the compute templates you want to run on the labelled node. Go to Compute > Templates and edit the template you want to use only on the labelled node. In Labels, add the same <label-key>=<label-value>.

      Now any jobs run on the template will only run on nodes with the matching labels.

      # Controlling using instance type

      You can also add a selector to a job template that uses the instance type of the desired node group. This does not require adding a label or taint to the node group.

      Go to Compute > Templates and edit the template you want to use only a specific instance type In Labels, add the following: beta.kubernetes.io/instance-type: 'desired-instance-type'

      For example, to enforce the template to use an m5.xlarge, add beta.kubernetes.io/instance-type: 'm5.xlarge' to the Selector.

      # Installing Docker Compose on a GPU Machine

      Docker Compose is not yet officially supported for GPU machines. However, it is still possible to set up Docker Compose to work in the context of a GPU machine using NVIDIA drivers. This relies on altering the runtime for docker compose. Follow the guide below:

      1. Install docker compose.
      2. Download and install the nivida-container-runtime:
        sudo apt-get install nvidia-container-runtime
        
      3. Add the NVIDIA runtime as the default Docker Compose runtime by running the following command:
        sudo tee /etc/docker/daemon.json <<EOF
        {
            "default-runtime":"nvidia",
            "runtimes": {
                "nvidia": {
                    "path": "nvidia-container-runtime",
                    "runtimeArgs": []
                }
            }
        }
        EOF
        sudo pkill -SIGHUP dockerd
        
      4. Restart docker:
        sudo systemctl restart docker
        
      Last Updated: 8/16/2020, 8:16:00 AM