# Compute

Data scientists necessitate robust and dependable access to substantial computational resources for extended durations.

Intel® Tiber™ AI Studio platform offers advanced capabilities for managing compute resources, whether within a Kubernetes cluster. The software includes several preconfigured compute templates, which are predefined collections of computational resources. Users have the flexibility to modify these default templates or create entirely new, customized templates tailored to specific requirements.

Typically, users configure the necessary compute resources once during the initial system setup using compute templates. These computes are then made available organization-wide, facilitating their selection during the creation of experiment, workspace, flow, apps, and serving.

Compute template configuration can be performed through the AI Studio web UI, or alternatively, through CLI command or SDK call.

During job execution, AI Studio prioritizes the use of compute resources based on their attachment order and availability. If the first compute resource is not available, AI Studio will sequentially attempt to utilize subsequent resources in the defined order until it finds an available compute instance.

The topics in this page:

# Compute Templates

A compute template is a predefined set of computer resources. It consists of information to define CPUs, GPUs, memory, and other metadata. A template exists for a specific compute resource such as a Kubernetes cluster.

Technically, a compute template describes either an one or more Kubernetes pods. Each template added is a different set of resources to use as a compute engine when running a AI Studio job, either as a specific pod on a Kubernetes cluster.

For each connected resource, AI Studio automatically creates a set of default compute templates. Users can customize existing templates, add templates, or remove templates as required.

# Compute Template Settings

Edit Compute Template When users create or modify an AI Studio compute template, they can configure various resource parameters, including memory, CPUs, HPUs, and GPUs, to meet their specific needs. Compute templates encompass the following configurable parameters:

# General settings:

# Specifications:

  • CPU: Specifies the number of central processing units (CPUs) allocated to the compute template, which may be represented as a discrete integer or a fraction depending on multi-threading and virtualized environments.
  • Memory: Total system memory allocated to the compute template, expressed in gigabytes(GB), specifying the upper limit of RAM available for workloads.
  • GPU: Defines the quantity of NVIDIA GPU accelerators assigned to the compute template.
  • Custom Resources: To fully utilize platform capabilities, deployments must define custom resource requests (e.g., rdma/gpu_ib_fabric: 1). This ensures the necessary resources are allocated, optimizing workload performance while maintaining security and control.
  • Security Context: Allows admins to configure container-level security settings. By specifying capabilities like IPC_LOCK and NET_RAW, admins control how containers interact with system resources, ensuring secure and compliant execution. *Available only for admins
  • Shared Memory: Indicates the amount of RAM allocated for shared access among the compute template's CPUs and GPUs.
  • Node Labels: Provides a key-value pair for node labeling. This label selector is used to assign the compute template to specific node pools. For example, use gputype=v100 to target nodes with NVIDIA V100 GPUs. Ensure that the labels specified match those in the node pool configuration. Multiple key-value pairs should be separated by commas. For additional information, refer to the section on Using Node Labels.
  • Node Taints: Defines a key-value pair for node tainting. This taint selector restricts the compute template to specific node pools, such as restricting GPU nodes to GPU-specific tasks (e.g., nvidia.com/gpu=present). Ensure that the taints specified match those configured in the node pool. Multiple key-value pairs should be separated by commas. For more details, see Adding Node Taints.
  • HPU(Optional): Specifies the number of Gaudi AI accelerators assigned to the compute template.
  • Hugepages(Optional): Configures access to memory pages larger than the default 4-KB size, allowing for more efficient memory management within the pod.

# Permissions:

  • Public: Grants access to the compute template to all users within the AI Studio organization.
  • Private: Restricts access to the compute template to a specified subset of users within the AI Studio organization.

# Compute template selection

If setting up a job within the AI Studio UI, users can select a compute template using the Compute drop-down list when building a flow or starting a workspace or experiment. A compute can also be set when running jobs using the CLI and SDK. When a template is selected, AI Studio checks for the availability of the required cluster resources and attempts to allocate the resources of the requested size. For more information, see Compute Usage in Jobs.

# Compute template types

The following table lists the different types of AI Studio compute templates:

Type Explanation
Regular This template is a set of resources defining a single pod that is launched on a single node (i.e., non-distributed). A machine's template is also a regular template.
Open MPI This template is used for multi-node distributed jobs. It allows running any Open MPI compatible job as a distributed workload over more than one node. Define the details of the master node and worker nodes separately.
PyTorch Distributed This template is used for running code using the torch.distributed framework. The template defines the configuration for the worker nodes.
Spark This template is used for running spark distributed processing jobs using kubernetes cluster. Configure the Spark Driver specifications and the Spark Configuration section's spark-defaults.conf properties.
Ray This template is used for running Ray distributed computing.
Modin This template is used for running Modin distributed computing.

# Compute Template Functions

The AI Studio platform enables users to add templates, remove templates, and edit existing templates.

# Add a compute template

Compute templates can be added to Kubernetes and Spark clusters.

    # Delete a compute template

    Complete the following steps to delete a compute template:

    1. Go to Compute > Templates.
    2. Click the Delete button at the right end of the compute template to delete.
    3. Confirm the deletion.

    The AI Studio software deletes the template from your organization.

    # Edit an existing template

    Complete the following steps to edit a compute template:

    1. Go to Compute > Templates.
    2. Select the template to edit. A similar page to the new template page displays.
    3. Edit the fields as desired.
    4. Click Save.

    The AI Studio software updates the details of the template.

    # Compute Resources

    In AI Studio, Kubernetes clusters are categorized as compute resources. The platform provides seamless integration with these clusters, enabling users to efficiently utilize their nodes. An organization can connect multiple Kubernetes clusters for enhanced resource management.

    To view all connected compute resources, navigate to Compute > Resources.
    Within this section, you can:

    • Add new Kubernetes or Spark clusters
    • Edit configurations of existing clusters
    • Delete clusters as needed

    # Kubernetes

    # Add a Kubernetes cluster

    Complete the following steps to add a Kubernetes cluster:

    1. Go to Compute > Resources.
    2. Click + Add Resource and then select Kubernetes from the list.
    3. Enter the Resource Name of the cluster (for use within AI Studio).
    4. Enter the Domain for the cluster.
    5. Paste the Kube Config for the cluster.
    6. Set the Scheduler from the list.
    7. Define the Namespace
    8. Define the Prometheus url(optional)
    9. Check Use Persistent Volumes, if relevant.
    10. Check HTTPS Scheme, if relevant.
    11. Click Create.
    • AI Studio automatically adds the cluster to your organization and generates default templates for it. These templates will be listed in the Templates tab, providing a streamlined overview of the available configurations for the newly added cluster. This allows for efficient management and deployment of resources within your organization.

    TIP

    To edit, remove, or add additional templates, navigate to the Compute > Templates page.

    # Access Kubernetes cluster information

    Each cluster added as a resource to AI Studio is accompanied with a dashboard. Users can use the cluster's page to update information about their cluster.

    Go to Compute > Resources and select the cluster name to access its information page. At the top is a summary of the cluster's details including its name, its creator, its status, and the date of its last health check.

    The page has the following tabs with additional information:

    Tab Contents
    Logs The cluster's logs for health checks.
    Kibana The cluster's Kibana dashboard to gain insights into its logs.

    # Edit a Kubernetes cluster

    Complete the following steps to edit an existing Kubernetes cluster:

    1. Go to Compute > Resources.
    2. To edit a resource, click the menu icon at the right end of the resource entry. From the dropdown menu, select Edit Resource to access the configuration settings for modification.
    3. Edit the fields as desired.
    4. Click Save.

    The AI Studio software updates the details of the cluster.

    # Delete a Kubernetes cluster

    Complete the following steps delete a cluster from your organization:

    1. Go to Compute > Resources.
    2. To delete a resource, click the menu icon at the right end of the resource entry. From the dropdown menu, select Delete Resource.
    3. In the window that opens, type the word confirm to confirm.
    4. Click Confirm and Delete

    The AI Studio software deletes the cluster from your organization.

    # On-Premises machines - Deprecated on AI Studio v4.8.0

    # Add an on-premises machine

    Before adding an on-premises machine, verify the following dependencies are installed on it:

      Complete the following steps to add an on-premises machine:

      1. Go to Compute > Resources.
      2. Click + Add Resource and then select Machine from the list.
      3. Enter the Title of the machine (for use within AI Studio).
      4. Provide all the SSH details for the machine:
        • Username
        • Host
        • Port
      5. Choose an SSH authentication method and add the relevant authentication:
        • SSH Password or
        • SSH Key
      6. If it is a GPU machine, enable the GPU Machine toggle.
      7. Complete the advanced settings (optional):
        • Set CPU Cores.
        • Set Memory in GB.
        • Set GPU Count (if relevant).
        • Set GPU Memory (if relevant).
        • Set the GPU Type (if relevant).
      8. Click Add.

      The AI Studio software saves the details and adds the machine to your organization.

      # Access on-premises machine information

      Each machine added as a resource to AI Studio is accompanied with a dashboard. Users can use the machine's page to update its information.

      Go to Compute > Resources and select the machine name to access its information page. At the top is a summary of the machine's details including its name, its creator, its status, and the date of its last health check.

      The page has the following tabs with additional information:

      Tab Contents
      Logs The machine's logs for health checks.
      Config The machine's configuration details to view and edit.
      System Dashboard The machine's dashboard to obtain at-a-glance insights into the health and utilization of the machine.

      # Edit an on-premises machine

      Complete the following steps to edit settings for an on-premises machine in your organization:

      1. Go to Compute > Resources.
      2. Select the machine to edit.
      3. Click the Config tab.
      4. Click Edit.
      5. Edit the fields as desired.
      6. Click Save.

      The AI Studio software updates the details of the machine.

      # Delete an on-premise machine

      Complete the following steps to delete an on-premises machine from an organization:

      1. Go to Compute > Resources.
      2. Select the machine to delete.
      3. Click the Config tab.
      4. Click Delete.
      5. Click Delete to confirm the machine deletion.

      The Ai Studio software deletes the machine from an organization.

      # Kibana

      Kibana enables users to visualize their Elasticsearch data and navigate the Elastic Stack so they can perform tasks from tracking query loads to following request flows through their apps.

      Kibana is natively integrated with AI Studio and can be used to dynamically visualize the logs of a Kubernetes cluster.

      To display a cluster's Kibana dashboard, click the service's Kibana tab in the cluster's information page. Additionally, any endpoints created are accompanied by a specific Kibana log dashboard.

      Learn more about using Kibana in the Kibana docs.

      # Compute Health Checks

      To help manage compute effectively, AI Studio regularly checks the health of connected machines and Kubernetes clusters.

      The AI Studio software queries the resources every 5 minutes to determine if they are reachable and useable in jobs. To follow the logs for this process, click the Logs tab on the information page of each cluster and machine.

      If the status is Online, the resource can be used.
      If the status is Offline, AI Studio cannot connect to the resource. Troubleshoot the resource for any issues and check its configuration details to confirm their correctness.

      When the status of a compute resource changes, AI Studio sends an email notification to the administrators of the organization.

      # Compute Usage in Jobs

      Using the web UI, select a compute when starting a workspace or experiment or when building a flow. A compute can also be set when running jobs using the CLI or SDK.

      When running the job, AI Studio attempts to use the computes in the order a user sets them. If the first compute is unavailable, AI Studio attempts to use the second compute, and so on. If none of the selected computes are available, the experiment enters into a queued state. When a compute becomes available, it starts running.

      # Using the web UI when starting a workspace or experiment

      When starting a workspace or experiment from the UI, complete the following steps to select one or more computes to attempt to run on:

      1. Click Start Workspace or New Experiment and provide the relevant details in the displayed pane.
      2. To display your compute(s), complete one of the following two steps:
        • For a workspace: Click the Compute drop-down list.
        • For an experiment: Click the Compute drop-down list in the Environment section.
      3. Select each compute to attempt to use in the workspace or experiment. The numbers next to their titles (when clicked) indicate the order AI Studio will attempt to use them. You can remove a compute from the list by clicking the X next to its name.
      4. Click Start Workspace or Run.

      # Using the web UI when building a flow

      When building a flow, complete the following steps to select one or more computes:

      1. Open the flow to select its compute(s).
      2. Click the Advanced tab and then the Compute drop-down list.
      3. Select each compute to attempt to use in the flow. The numbers next to their titles (when clicked) indicate the order AI Studio will attempt to use them. A compute can be removed from the list by clicking the X next to its name.
      4. Click Save Changes.

      # Using the CLI

      To set a compute when running an experiment with the CLI, use the --templates flag:

      cnvrgv2 experiment run --command 'python3 train.py' --templates medium
      

      A user can include multiple computes in the array. See the full AI Studio CLI documentation for more information about running experiments using the CLI.

      # Using the SDK

      To set a compute when running an experiment with the Python SDK, use the templates parameter:

      from cnvrgv2 import Cnvrg
      cnvrg = Cnvrg()
      myproj = cnvrg.projects.get("myproject")
      e = myproj.experiments.create(title="my new exp", 
                                    templates=["medium", "small"], 
                                    command="python3 test.py")
      
      

      A user can include multiple computes in the array. See the full AI Studio SDK documentation for more information about running experiments using the SDK.

      # Job Node Usage

      Kubernetes is designed to efficiently allocate and orchestrate compute. By default, all nodes can be used by any job and compute template if the requested resources exist. However, this may require that GPU nodes may be used for CPU jobs, meaning that a CPU job uses a GPU machine when needed for a GPU job. This is one example of when a user may want to limit this behavior.

      A user can use Kubernetes and AI Studio to control the jobs running on specific nodes. There are three ways to enforce this:

      # Adding a taint to the GPU node pool

      To restrict GPU nodes to only GPU jobs, add the following taint to the node:

      • key: nvidia.com/gpu
      • value: present
      • effect: NoSchedule

      Use the following kubectl command to set a specific node(s):

      kubectl taint nodes <node_name> key1=key:nvidia.com/gpu key2=value:present key3=effect:NoSchedule
      

      # Using node labels

      A user can use node labels to attach specific compute jobs to specific nodes.

      To label nodes with a custom label, use the following kubectl command:

      kubectl label nodes <node-name> <label-key>=<label-value>
      

      Now add the label to the compute templates to run on the labeled node. Go to Compute > Templates and edit the template to use only on the labeled node. In Labels, add the same <label-key>=<label-value>.

      Now, any jobs run on the template run only on nodes with the matching labels.

      # Controlling with instance type

      A user can also add a selector to a job template that uses the instance type of the desired node group. This does not require adding a label or taint to the node group.

      Go to Compute > Templates and edit the template to use only a specific instance type. In Labels, add the following: beta.kubernetes.io/instance-type: 'desired-instance-type'.

      For example, to enforce the template to use an m5.xlarge, add beta.kubernetes.io/instance-type: 'm5.xlarge' to the Selector.

      # Docker Compose Installation on a GPU Machine

      Docker Compose is not yet officially supported for GPU machines. However, it is still possible to set up Docker Compose to work in the context of a GPU machine using NVIDIA drivers. This relies on altering the runtime for Docker Compose. Complete the following steps using the guidelines provided:

      1. Install Docker Compose.
      2. Download and install the nvidia-container-runtime:
        sudo apt-get install nvidia-container-runtime
        
      3. Add the NVIDIA runtime as the default Docker Compose runtime by running the following command:
        sudo tee /etc/docker/daemon.json <<EOF
        {
            "default-runtime":"nvidia",
            "runtimes": {
                "nvidia": {
                    "path": "nvidia-container-runtime",
                    "runtimeArgs": []
                }
            }
        }
        EOF
        sudo pkill -SIGHUP dockerd
        
      4. Restart Docker using the following command:
        sudo systemctl restart docker
        

      # Queues for Job Scheduling

      Utilizing compute resources efficiently is crucial in any organization, as it represents a significant investment. When faced with a limited amount of compute resources and multiple jobs requiring access to the same resources, it is essential to prioritize and allocate those resources in a manner that aligns with the organization's goals and priorities.

      To address this challenge, AI Studio introduces "Queues" a system that enables job scheduling based on priority across resources. Through this system, users can assign a priority to a job and specify the list of compute resources required to execute it. AI Studio will then run the job on the most suitable compute resource based on its priority, optimizing resource utilization, aligned with the organization's objectives. Queues feature is build on top of AI Studio scheduler.

      AI Studio schdeuler strategy aims to prioritize nodes that are already running GPU tasks for new GPU workloads. This approach consolidates GPU jobs onto fewer nodes, which can optimize GPU utilization and reduce resource fragmentation. Essentially, it's about stacking GPU tasks where GPUs are already in use, keeping some nodes focused on GPU work and others free for different tasks. This can help maintain dedicated resources for high-demand GPU workloads and improve overall efficiency.

      # Enable

      Before proceeding, ensure that the following prerequisites are met:

      • AI Studio app version > v3.10.0 It is displayed in the bottom left corner of the AI Studio UI. You can also run this command to check:

        kubectl -n cnvrg get cnvrgapp
        
      • AI Studio operator version > 4.0.0

        You can also use kubectl with access to the cluster to check the current version, by running the following command:

        kubectl -n cnvrg get deploy cnvrg-operator -o yaml | grep "image: "
        

      # Install

      When meeting the prerequisites, queues will be enabled automatically.

      1. Make sure compute resource is configured to use the AI Studio scheduler: Through the UI navigate to Compute → Resources → choose the relevant cluster’s menu → Edit Resource

      1. Then confirm that the cnvrgScheduler is selected:

      1. Using kubectl with access to the cluster, check that cnvrg-scheduler is running:

      1. Using kubectl with access to the cluster, check that cnvrgScheduler is enabled:
      kubectl -n cnvrg edit cnvrgapp
      
      ## navigate to 'cnvrgScheduler', it should be enabled
      
      cnvrgScheduler:
            enabled: true
      

      Note: each cluster is assigned a dedicated AI Studio scheduler; however, users may opt to disable the scheduler and utilize Kubernetes as the cluster scheduler.

      # How to use

      After enabling the scheduler feature, organizations can create queues to prioritize the execution of jobs.

      Each Queue will have:

      • Title
      • Value
      • Preemption (On / Off) —> Can jobs running on this queue be stopped in order to allow higher priority job to be executed
      • Permissions to use this queue

      Once a queue is defined, users can schedule a job on it, and it will be executed according to the queue's priority relative to other jobs waiting to be executed.

      For example:

      Consider the following scenario where an organization maintains two queues, Urgent and Default, for executing jobs.

      If a user (or multiple users) initiates the execution of two jobs simultaneously, one job is assigned to the Default queue and the other to the Urgent queue, which has a higher priority than the Default queue.

      The job assigned to the Urgent queue is given priority for execution ahead of the Default queue. Only after the completion of the Urgent job, the job assigned to the Default queue will commence execution.

      Additionally, if two jobs are scheduled to run on the same queue at the same time, they will be executed based on the FIFO (First In First Out) principle by default.

      # Preemption

      The implementation of preemption in the AI Studio platform facilitates the removal of a running job in favor of a higher priority job. As an illustration, suppose a job, running on queue with priority 1, in which preemption is enabled. In that case, if a higher priority (> 1) task is scheduled, the current lower priority job will be halted, and its resources will be reassigned to the higher priority task.

      • Example workflow:
        • Job A commences with priority: 1
        • Job B is launched with queue with priority: 2
        • When receiving a request to allocation resources for job B, job A will be stopped and and moved back to the queue (same priority, new start_commit). Then job B will be executed. Note: Job A will only be stopped if the resources it uses are sufficient to be used for job B.
        • Once Job B concludes and resources are available, Job A resumes with artifacts created in previous run.

      FAQ

      Q: Is it possible to run this without using a queue?
      A: Default queue is selected unless specifying a different one.

      Q: What happens if I want to run a job on an 'urgent' priority, but the necessary resources are not available? At the same time, I also want to run a task on 'default' priority, for which the resources are available.
      A: Default will run. resources will only be taken for requests that are able to allocate.

      Last Updated: 9/24/2024, 2:27:51 PM