# Compute

Data scientists require efficient and reliable access to large amounts of compute power for extended periods of time.

cnvrg provides powerful features to manage compute, whether in a Kubernetes cluster or an on-premise machine.

cnvrg provides several default compute templates (predefined sets of computer resources), which users can easily modify to create new, customized compute templates.

In most instances, you configure your required computes once at system setup in the form of compute templates. These become available across your entire organization.

Then, select one or more computes during the creation of your experiment, workspace, flow, apps, and serving.

You can do this through the cnvrg web UI or with a cnvrg CLI command or SDK call.

When running the job, cnvrg attempts to use the compute in the order they were attached, based on available resources. If the first compute is unavailable, cnvrg attempts to use the second compute, and so on.

The topics in this page:

# Compute Templates

A compute template is a predefined set of computer resources. It consists of information to define CPUs, GPUs, memory, and other metadata. A template exists for a specific compute resource (Kubernetes cluster or on-premise machine).

Technically, a compute template describes either an on-premise machine or one or more Kubernetes pods. Each template added is a different set of resources to use as a compute engine when running a job, either as a specific on-premise machine or as a pod on a Kubernetes cluster.

Your templates are available for selection from a Compute drop-down list. When you select one of your templates, cnvrg checks for the availability of the machine or the required cluster resources and attempts to allocate the resources of the requested size. For more information, see Compute Usage in Jobs.

For each connected resource, cnvrg automatically creates a set of default compute templates. You can customize existing templates, add templates, or remove templates as desired.

The templates for a remote Spark cluster must be configured before use, as they are not generated automatically.

There are five types of cnvrg compute templates:

Type Explanation
Regular This template is a set of resources defining a single pod that is launched on a single node (i.e., non-distributed). A machine's template is also a regular template.
Open MPI This template is used for multi-node distributed jobs. It allows running any Open MPI compatible job as a distributed workload over more than one node. Define the details of the master node and worker nodes separately.
PyTorch Distributed This template is used for running code using the torch.distributed framework. The template defines the configuration for the worker nodes.
Spark on Kubernetes This template is used for running distributed processing jobs using Spark on Kubernetes. Define the details of the master node and worker nodes separately.
Remote Spark This template is used for running distributed processing jobs using a remote Spark cluster. Choose an existing regular template as the Spark driver and then configure the desired worker nodes.

# Add a compute template

You can add compute templates to Kubernetes and Spark clusters.

    # Delete a compute template

    Complete the following steps to delete a compute template:

    1. Go to Compute > Templates.
    2. Click the Delete button at the right end of the compute template to delete.
    3. Confirm the deletion.

    cnvrg deletes the template from your organization.

    # Edit an existing template

    Complete the following steps to edit a compute template:

    1. Go to Compute > Templates.
    2. Select the template to edit. A similar page to the new template page displays.
    3. Edit the fields as desired.
    4. Click Save.

    cnvrg updates the details of the template.

    # Compute Resources

    In cnvrg, Kubernetes clusters and on-premise machines are referred to as compute resources.

    cnvrg seamlessly integrates with your Kubernetes clusters and allows you to quickly and easily leverage your nodes. Your organization can connect to many Kubernetes clusters.

    You can also connect to your on-premise machines and add them as resources.

    Additionally, you can add remote Spark clusters as compute resources.

    Navigate to Compute > Resources to view all your connected compute resources.

    Here, you can also add, edit, and delete machines as well as Kubernetes and Spark clusters from your organization.

    # Kubernetes

    # Add a Kubernetes cluster

    Complete the following steps to add a Kubernetes cluster:

    1. Go to Compute > Resources.
    2. Click + Add Resource and then select Kubernetes from the list.
    3. Select the cluster type: On-premise, Google GKE, Amazon EKS, or Azure AKS.
    4. Enter the Title of the cluster (for use within cnvrg).
    5. Enter the Domain for the cluster.
    6. Paste the Kube Config for the cluster.
    7. Select the provider for the cluster: GKE - Google, EKS - Amazon, AKS - Microsoft, or On-Premise.
    8. Check Use Persistent Volumes, if relevant. (When type is not On-Premise.)
    9. Check Spots Enabled, if relevant. (When type is not On-Premise.)
    10. Click Create. cnvrg adds the cluster to your organization and creates default templates for the cluster, listing them in the displayed panel.
    11. Click Save to complete the creation of the Kubernetes cluster. A set of default templates also automatically generates.

    To edit, remove, or add additional templates, navigate to the Compute > Templates page.

    # Access Kubernetes cluster information

    Each cluster added as a resource to cnvrg is accompanied with a dashboard. You can use the cluster's page to update information about your cluster.

    Go to Compute > Resources and select the cluster name to access its information page.

    At the top is a summary of the cluster's details including its name, its creator, its status, and the date of its last health check.

    The page has the following tabs with additional information:

    Tab Contents
    Logs The cluster's logs for health checks.
    Kibana The cluster's Kibana dashboard to gain insights into its logs.
    Grafana The cluster's Grafana dashboard to monitor the compute usage for jobs running on the cluster.
    Config The cluster's configuration details to view and edit.
    System Dashboard The dashboard to obtain at-a-glance insights into the health and utilization of the custer.

    # Edit a Kubernetes cluster

    Complete the following steps to edit an existing Kubernetes cluster:

    1. Go to Compute > Resources.
    2. Select the cluster to edit.
    3. Click the Config tab.
    4. Click Edit.
    5. Edit the fields as desired.
    6. Click Save.

    cnvrg updates the details of the cluster.

    # Delete a Kubernetes cluster

    Complete the following steps delete a cluster from your organization:

    1. Go to Compute > Resources.
    2. Select the cluster to delete.
    3. Click the Config tab.
    4. Click Delete.
    5. Click Delete to confirm the cluster deletion.

    cnvrg deletes the cluster from your organization.

    # On-Premise machines

    # Add an on-premise machine

    Before you can add an on-premise machine, verify the following dependencies are installed on it:

      Complete the following steps to add an on-premise machine:

      1. Go to Compute > Resources.
      2. Click + Add Resource and then select Machine from the list.
      3. Enter the Title of the machine (for use within cnvrg).
      4. Provide all the SSH details for the machine:
        • Username
        • Host
        • Port
      5. Choose an SSH authentication method and add the relevant authentication:
        • SSH Password or
        • SSH Key
      6. If it is a GPU machine, enable the GPU Machine toggle.
      7. Complete the advanced settings (optional):
        • Set CPU Cores.
        • Set Memory in GB.
        • Set GPU Count (if relevant).
        • Set GPU Memory (if relevant).
        • Set the GPU Type (if relevant).
      8. Click Add.

      cnvrg saves the details and adds the machine to your organization.

      # Access on-premise machine information

      Each machine added as a resource to cnvrg is accompanied with a dashboard. You can use the machine's page to update its information.

      Go to Compute > Resources and select the machine name to access its information page.

      At the top is a summary of the machine's details including its name, its creator, its status, and the date of its last health check.

      The page has the following tabs with additional information:

      Tab Contents
      Logs The machine's logs for health checks.
      Config The machine's configuration details to view and edit.
      System Dashboard The machine's dashboard to obtain at-a-glance insights into the health and utilization of the machine.

      # Edit an on-premise machine

      Complete the following steps to edit settings for an on-premise machine in your organization:

      1. Go to Compute > Resources.
      2. Select the machine to edit.
      3. Click the Config tab.
      4. Click Edit.
      5. Edit the fields as desired.
      6. Click Save.

      cnvrg updates the details of the machine.

      # Delete an on-premise machine

      Complete the following steps to delete an on-premise machine from your organization:

      1. Go to Compute > Resources.
      2. Select the machine to delete.
      3. Click the Config tab.
      4. Click Delete.
      5. Click Delete to confirm the machine deletion.

      cnvrg deletes the machine from your organization.

      # Spark clusters

      While cnvrg natively supports running Spark on Kubernetes without any setup, you can also add an existing remote Spark cluster for use in cnvrg. The process involves the following:

      # Add a remote Spark cluster

      Complete the following steps to add a remote Spark cluster:

      1. Go to Compute > Resources.
      2. Click + Add Resource and then select Spark from the list.
      3. Set the Title of the Spark cluster (for use in cnvrg).
      4. Set the Spark Configuration. These key-value pairs are in the construction of the spark-defaults.conf. Enter all the settings required for your cluster. For example, spark.master and spark://your_host:your_port. A full list of options can be found here.
      5. Add any desired Environment Variables. These key-value pairs are exposed as environment variables in the Spark driver, along with those located in the selected Docker image and those added in the Project Settings. For example, SPARK_HOME and /spark.
      6. Upload Files relevant to your Spark configuration. Click Browse Files or drag-and-drop the files to upload. Then add the correct file path and name for where the file is to be copied in the Spark driver. For example, yarn-conf.xml with Target Path /spark/yarn-conf.xml.
      7. Click Save.

      The Spark cluster is added as a compute resource. Now, define compute templates for use with the cluster, which identify the regular compute templates to use as a Spark driver and the number of Spark worker nodes. This allows you to run Spark jobs with different numbers of executors.

      # Access remote Spark cluster information

      Complete the following steps to review the settings of your remote Spark cluster (including who added it and when it was created):

      1. Go to Compute > Resources.
      2. Select the Spark cluster to view.

      # Edit a remote Spark cluster

      Complete the following steps to edit settings for a a remote Spark cluster in your organization:

      1. Go to Compute > Resources.
      2. Select the Spark cluster to edit.
      3. Edit the fields as desired.
      4. Click Save.

      cnvrg updates the details of the Spark cluster.

      # Delete a remote Spark cluster

      Complete the following steps to delete a Spark cluster from your organization:

      1. Go to Compute > Resources.
      2. Select the Spark cluster to remove.
      3. Click Delete.
      4. Click Yes to confirm the Spark cluster deletion.

      cnvrg deletes the Spark cluster and removes it as a compute resource from your organization.

      # Compute Dashboards

      One of the key cnvrg objectives is to simplify DevOps and provide tools to easily manage compute resources. To make this possible, cnvrg builds in support for many different compute dashboards:

      # System Dashboard (machines and Kubernetes)

      The system dashboard provides at-a-glance insights into the health and utilization of all of your resources. This allows you to easily monitor your GPUs and CPUs.

      To display the system dashboard, click the System Dashboard tab on the information page of each cluster and machine.

      The system dashboard displays dynamic and live charts for every relevant metric of your resource and provides at-a-glance insights into:

      • GPU charts:
        • GPU Utilization (%)
        • Memory (%)
        • Temperature (°C)
        • Power (W)
      • CPU charts:
        • CPU Utilization (%)
        • Memory (Mib)
        • Disk IO
        • Network Traffic

      At the top, you can determine the time horizon for the displayed charts:

      • Live
      • 1 Hour
      • 24 Hours
      • 30 Days
      • 60 Days

      # Kibana (Kubernetes only)

      Kibana enables you to visualize your Elasticsearch data and navigate the Elastic Stack so you can perform tasks from tracking query loads to following request flows through your apps.

      Kibana is natively integrated with cnvrg and you can use it to dynamically visualize the logs of your Kubernetes cluster.

      To display your cluster's Kibana dashboard, click the service's Kibana tab in the cluster's information page. Additionally, any endpoints you create are accompanied by a specific Kibana log dashboard.

      You can learn more about using Kibana in the Kibana docs.

      # Grafana (Kubernetes only)

      Grafana allows you to query, visualize, alert on, and analyze the metrics of your Kubernetes cluster. Create, explore, and share dashboards with your team with this simple tool.

      Its integration with cnvrg allows you to easily monitor your cluster's health. You can check pod resource usage and create dynamic charts to monitor your entire cluster.

      To display your cluster's Grafana dashboard, click the service's Grafana tab in the cluster's information page. Additionally, any endpoints you create are accompanied by a specific Grafana resource dashboard.

      You can learn more about using Grafana in the Grafana docs.

      # Compute Health Checks

      To help manage your compute effectively, cnvrg regularly checks the health of your connected machines and Kubernetes clusters.

      cnvrg queries the resources every 5 minutes to determine if they are reachable and useable in jobs. To follow the logs for this process, click the Logs tab on the information page of each cluster and machine.

      If the status is Online, the resource can be used.
      If the status is Offline, cnvrg cannot connect to the resource. Troubleshoot the resource for any issues and check its configuration details to confirm their correctness.

      When the status of a compute resource changes, cnvrg sends an email notification to the administrators of the organization.

      # Compute Usage in Jobs

      Using the web UI, you can select a compute when starting a workspace or experiment or when building a flow.

      You can also set a compute for jobs when running them using the cnvrg CLI and SDK.

      When running the job, cnvrg attempts to use the computes in the order you set them. If the first compute is unavailable, cnvrg attempts to use the second compute, and so on. If none of the selected computes are available, the experiment enters into a queued state. When a compute becomes available, it starts running.

      # Using the web UI when starting a workspace or experiment

      When starting a workspace or experiment from the UI, complete the following steps to select one or more computes to attempt to run on:

      1. Click Start Workspace or New Experiment and provide the relevant details in the displayed pane.
      2. To display your compute(s), complete one of the following two steps:
        • Click the Compute drop-down list (for a workspace)
        • Click the Compute drop-down list in the Environment section (for an experiment)
      3. Select each compute to attempt to use in the workspace or experiment. The numbers next to their titles (when clicked) indicate the order cnvrg will attempt to use them. You can remove a compute from the list by clicking the X next to its name.
      4. Click Start Workspace or Run.

      # Using the web UI when building a flow

      When building a flow, complete the following steps to select one or more computes:

      1. Open the flow to select its compute(s).
      2. Click the Advanced tab and then the Compute drop-down list.
      3. Select each compute to attempt to use in the flow. The numbers next to their titles (when clicked) indicate the order cnvrg will attempt to use them. You can remove a compute from the list by clicking the X next to its name.
      4. Click Save Changes.

      # Using the CLI

      To set a compute when running an experiment with the CLI, use the --machine flag:

      cnvrg run python3 train.py --machine=‘medium’
      

      You can include multiple computes in the array. See the full cnvrg CLI documentation for more information about running experiments using the CLI.

      # Using the SDK

      To set a compute when running an experiment with the Python SDK, use the compute parameter:

      from cnvrg import Experiment
      e = Experiment.run('python3 train.py',
                          compute=‘medium’)
      

      You can include multiple computes in the array. See the full cnvrg SDK documentation for more information about running experiments using the SDK.

      # Job History Summary

      Go to Compute > Jobs to view the history of recently run jobs. The displayed Jobs pane shows a summary of the recently run jobs with the following columns:

      • Title: The job's title. Clicking the title displays the experiments page.
      • Project: The job's project. Clicking the project name displays the job's project page.
      • Status: Pending, Initializing, Running, Success, Aborted, Error, or Debug
      • Duration: The time span the job ran
      • User: The user who ran the job
      • Created at: The time the job started
      • Compute: The compute template on which the job ran
      • Image: The image the job used

      # Job Node Usage

      Kubernetes is designed to efficiently allocate and orchestrate your compute. By default, all nodes can be used by any job and compute template if the requested resources exist. However, this may require that GPU nodes may be used for CPU jobs, meaning that a CPU job uses a GPU machine when needed for a GPU job. This is one example of when you may want to limit this behavior.

      You can use Kubernetes and cnvrg to control the jobs running on specific nodes. There are three ways to enforce this:

      # Adding a taint to the GPU node pool

      To restrict GPU nodes to only GPU jobs, add the following taint to the node:

      • key: nvidia.com/gpu
      • value: present
      • effect: NoSchedule

      Use the following kubectl command to set a specific node(s):

      kubectl taint nodes <node_name> key1=key:nvidia.com/gpu key2=value:present key3=effect:NoSchedule
      

      # Using node labels

      You can use node labels to attach specific compute jobs to specific nodes.

      To label your nodes with a custom label, use the following kubectl command:

      kubectl label nodes <node-name> <label-key>=<label-value>
      

      Now add the label to the compute templates to run on the labelled node. Go to Compute > Templates and edit the template to use only on the labelled node. In Labels, add the same <label-key>=<label-value>.

      Now, any jobs run on the template run only on nodes with the matching labels.

      # Controlling with instance type

      You can also add a selector to a job template that uses the instance type of the desired node group. This does not require adding a label or taint to the node group.

      Go to Compute > Templates and edit the template to use only a specific instance type. In Labels, add the following: beta.kubernetes.io/instance-type: 'desired-instance-type'.

      For example, to enforce the template to use an m5.xlarge, add beta.kubernetes.io/instance-type: 'm5.xlarge' to the Selector.

      # Docker Compose Installation on a GPU Machine

      Docker Compose is not yet officially supported for GPU machines. However, it is still possible to set up Docker Compose to work in the context of a GPU machine using NVIDIA drivers. This relies on altering the runtime for Docker Compose. Complete the following steps using the guidelines provided:

      1. Install Docker Compose.
      2. Download and install the nvidia-container-runtime:
        sudo apt-get install nvidia-container-runtime
        
      3. Add the NVIDIA runtime as the default Docker Compose runtime by running the following command:
        sudo tee /etc/docker/daemon.json <<EOF
        {
            "default-runtime":"nvidia",
            "runtimes": {
                "nvidia": {
                    "path": "nvidia-container-runtime",
                    "runtimeArgs": []
                }
            }
        }
        EOF
        sudo pkill -SIGHUP dockerd
        
      4. Restart Docker using the following command:
        sudo systemctl restart docker
        
      Last Updated: 12/1/2022, 9:47:18 PM