# Compute
Data scientists require efficient and reliable access to large amounts of compute power for extended periods of time.
The cnvrg platform includes powerful features to manage compute, whether in a Kubernetes cluster or an on-premises machine. The software provides several default compute templates (predefined sets of computer resources), which users can easily modify to create new, customized compute templates.
In most instances, users configure their required computes once at system setup in the form of compute templates. These become available across an entire organization. Then, they can select one or more computes during the creation of an experiment, workspace, flow, apps, and serving.
Compute template setup can be accomplished through the cnvrg web UI or with a cnvrg CLI command or SDK call.
When running a job, cnvrg attempts to use the compute in the order they were attached, based on available resources. If the first compute is unavailable, cnvrg attempts to use the second compute, and so on.
The topics in this page:
# Compute Templates
A compute template is a predefined set of computer resources. It consists of information to define CPUs, GPUs, memory, and other metadata. A template exists for a specific compute resource such as a Kubernetes cluster or on-premises machine.
Technically, a compute template describes either an on-premises machine or one or more Kubernetes pods. Each template added is a different set of resources to use as a compute engine when running a cnvrg job, either as a specific on-premises machine or as a pod on a Kubernetes cluster.
For each connected resource, cnvrg automatically creates a set of default compute templates. Users can customize existing templates, add templates, or remove templates as required.
# Compute Template Settings
When users add or edit a cnvrg compute template, they can adjust the resources such as memory, CPUs, HPUs, and GPUs according to their requirements. Compute templates include the following configurable information:
- General settings
- Title – the compute template title
- Type – the compute template type
- Specifications
- CPU – the number of processors
- Memory – the amount of total RAM
- GPU – the fractional number or whole number of NVIDIA GPU accelerators; see metaGPU
- HPU – the number of Gaudi AI accelerators
- Hugepages – the pod access to memory pages larger than the default 4-KB memory page size
- Shared Memory – the amount of RAM shared among the pod’s set CPUs and GPUs
- Node Labels – key-value pair label selector for the template’s node pools to attach the compute to a specific node, such as
gputype=v100
, for example. This must also match the template. Separate more than one key-value pair with commas. See Using Node Labels. - Node Taints – key-value pair taint selector for the template’s node pools to say restrict a GPU node to only GPU jobs, such as
nvidia.com/gpu=present
, for example. This must also match the template. Separate more than one key-value pair with commas. See Adding Node Taints.
- Permissions
- Public – all users within a cnvrg organization have access to the compute template
- Private – only specified users within a cnvrg organization have access to the compute template
# Compute template selection
If setting up a job within the cnvrg UI, users can select a compute template using the Compute drop-down list when building a flow or starting a workspace or experiment. A compute can also be set when running jobs using the cnvrg CLI and SDK. When a template is selected, cnvrg checks for the availability of the machine or the required cluster resources and attempts to allocate the resources of the requested size. For more information, see Compute Usage in Jobs.
# Compute template types
The following table lists the different types of cnvrg compute templates:
Type | Explanation |
---|---|
Regular | This template is a set of resources defining a single pod that is launched on a single node (i.e., non-distributed). A machine's template is also a regular template. |
Open MPI | This template is used for multi-node distributed jobs. It allows running any Open MPI compatible job as a distributed workload over more than one node. Define the details of the master node and worker nodes separately. |
PyTorch Distributed | This template is used for running code using the torch.distributed framework. The template defines the configuration for the worker nodes. |
Spark | This template is used for running spark distributed processing jobs using kubernetes cluster. Configure the Spark Driver specifications and the Spark Configuration section's spark-defaults.conf properties. |
Ray | This template is used for running Ray distributed computing. |
Modin | This template is used for running Modin distributed computing. |
# Compute Template Functions
The cnvrg platform enables users to add templates, remove templates, and edit existing templates.
# Add a compute template
Compute templates can be added to Kubernetes and Spark clusters.
# Delete a compute template
Complete the following steps to delete a compute template:
- Go to Compute > Templates.
- Click the Delete button at the right end of the compute template to delete.
- Confirm the deletion.
The cnvrg software deletes the template from your organization.
# Edit an existing template
Complete the following steps to edit a compute template:
- Go to Compute > Templates.
- Select the template to edit. A similar page to the new template page displays.
- Edit the fields as desired.
- Click Save.
The cnvrg software updates the details of the template.
# Compute Resources
In cnvrg, Kubernetes clusters and on-premises machines are referred to as compute resources. The cnvrg platform seamlessly integrates with Kubernetes clusters and allows users to quickly and easily leverage their nodes. An organization can connect to many Kubernetes clusters.
Users can also connect to their on-premises machines and add them as resources.
Navigate to Compute > Resources to view all the connected compute resources. Here, add, edit, and delete machines as well as Kubernetes and Spark clusters from an organization.
# Kubernetes
# Add a Kubernetes cluster
Complete the following steps to add a Kubernetes cluster:
- Go to Compute > Resources.
- Click + Add Resource and then select Kubernetes from the list.
- Select the cluster type: On-premise, Google GKE, Amazon EKS, or Azure AKS.
- Enter the Title of the cluster (for use within cnvrg).
- Enter the Domain for the cluster.
- Paste the Kube Config for the cluster.
- Select the provider for the cluster: GKE - Google, EKS - Amazon, AKS - Microsoft, or On-Premise.
- Check Use Persistent Volumes, if relevant. (When type is not On-Premise.)
- Check Spots Enabled, if relevant. (When type is not On-Premise.)
- Click Create. cnvrg adds the cluster to your organization and creates default templates for the cluster, listing them in the displayed panel.
- Click Save to complete the creation of the Kubernetes cluster. A set of default templates also automatically generates.
To edit, remove, or add additional templates, navigate to the Compute > Templates page.
# Access Kubernetes cluster information
Each cluster added as a resource to cnvrg is accompanied with a dashboard. Users can use the cluster's page to update information about their cluster.
Go to Compute > Resources and select the cluster name to access its information page. At the top is a summary of the cluster's details including its name, its creator, its status, and the date of its last health check.
The page has the following tabs with additional information:
Tab | Contents |
---|---|
Logs | The cluster's logs for health checks. |
Kibana | The cluster's Kibana dashboard to gain insights into its logs. |
Grafana | The cluster's Grafana dashboard to monitor the compute usage for jobs running on the cluster. |
Config | The cluster's configuration details to view and edit. |
System Dashboard | The dashboard to obtain at-a-glance insights into the health and utilization of the custer. |
# Edit a Kubernetes cluster
Complete the following steps to edit an existing Kubernetes cluster:
- Go to Compute > Resources.
- Select the cluster to edit.
- Click the Config tab.
- Click Edit.
- Edit the fields as desired.
- Click Save.
The cnvrg software updates the details of the cluster.
# Delete a Kubernetes cluster
Complete the following steps delete a cluster from your organization:
- Go to Compute > Resources.
- Select the cluster to delete.
- Click the Config tab.
- Click Delete.
- Click Delete to confirm the cluster deletion.
The cnvrg software deletes the cluster from your organization.
# On-Premises machines - Deprecated on cnvrg v4.8.0
# Add an on-premises machine
Before adding an on-premises machine, verify the following dependencies are installed on it:
Complete the following steps to add an on-premises machine:
- Go to Compute > Resources.
- Click + Add Resource and then select Machine from the list.
- Enter the Title of the machine (for use within cnvrg).
- Provide all the SSH details for the machine:
- Username
- Host
- Port
- Choose an SSH authentication method and add the relevant authentication:
- SSH Password or
- SSH Key
- If it is a GPU machine, enable the GPU Machine toggle.
- Complete the advanced settings (optional):
- Set CPU Cores.
- Set Memory in GB.
- Set GPU Count (if relevant).
- Set GPU Memory (if relevant).
- Set the GPU Type (if relevant).
- Click Add.
The cnvrg software saves the details and adds the machine to your organization.
# Access on-premises machine information
Each machine added as a resource to cnvrg is accompanied with a dashboard. Users can use the machine's page to update its information.
Go to Compute > Resources and select the machine name to access its information page. At the top is a summary of the machine's details including its name, its creator, its status, and the date of its last health check.
The page has the following tabs with additional information:
Tab | Contents |
---|---|
Logs | The machine's logs for health checks. |
Config | The machine's configuration details to view and edit. |
System Dashboard | The machine's dashboard to obtain at-a-glance insights into the health and utilization of the machine. |
# Edit an on-premises machine
Complete the following steps to edit settings for an on-premises machine in your organization:
- Go to Compute > Resources.
- Select the machine to edit.
- Click the Config tab.
- Click Edit.
- Edit the fields as desired.
- Click Save.
The cnvrg software updates the details of the machine.
# Delete an on-premise machine
Complete the following steps to delete an on-premises machine from an organization:
- Go to Compute > Resources.
- Select the machine to delete.
- Click the Config tab.
- Click Delete.
- Click Delete to confirm the machine deletion.
The cnvrg software deletes the machine from an organization.
# Compute Dashboards
One of the key cnvrg objectives is to simplify DevOps and provide tools to easily manage compute resources. To make this possible, cnvrg builds in support for many different compute dashboards:
# System Dashboard (machines and Kubernetes)
The system dashboard provides at-a-glance insights into the health and utilization of all of an organization's resources. This allows users to easily monitor their GPUs, HPUs, and CPUs.
To display the system dashboard, click the System Dashboard tab on the information page of each cluster and machine.
The system dashboard displays dynamic and live charts for every relevant metric of a resource and provides at-a-glance insights into:
- GPU charts:
- GPU Utilization (%)
- Memory (%)
- Temperature (°C)
- Power (W)
- CPU charts:
- CPU Utilization (%)
- Memory (Mib)
- Disk IO
- Network Traffic
At the top, you can determine the time horizon for the displayed charts:
- Live
- 1 Hour
- 24 Hours
- 30 Days
- 60 Days
# Kibana (Kubernetes only)
Kibana enables users to visualize their Elasticsearch data and navigate the Elastic Stack so they can perform tasks from tracking query loads to following request flows through their apps.
Kibana is natively integrated with cnvrg and can be used to dynamically visualize the logs of a Kubernetes cluster.
To display a cluster's Kibana dashboard, click the service's Kibana tab in the cluster's information page. Additionally, any endpoints created are accompanied by a specific Kibana log dashboard.
Learn more about using Kibana in the Kibana docs.
# Grafana (Kubernetes only)
Grafana allows users to query, visualize, alert on, and analyze the metrics of their Kubernetes clusters. They can create, explore, and share dashboards with their teams with this simple tool.
Its integration with cnvrg allows users to easily monitor a cluster's health. They can check pod resource usage and create dynamic charts to monitor an entire cluster.
To display a cluster's Grafana dashboard, click the service's Grafana tab in the cluster's information page. Additionally, any endpoints created are accompanied by a specific Grafana resource dashboard.
Learn more about using Grafana in the Grafana docs.
# Compute Health Checks
To help manage compute effectively, cnvrg regularly checks the health of connected machines and Kubernetes clusters.
The cnvrg software queries the resources every 5 minutes to determine if they are reachable and useable in jobs. To follow the logs for this process, click the Logs tab on the information page of each cluster and machine.
If the status is Online
, the resource can be used.
If the status is Offline
, cnvrg cannot connect to the resource. Troubleshoot the resource for any issues and check its configuration details to confirm their correctness.
When the status of a compute resource changes, cnvrg sends an email notification to the administrators of the organization.
# Compute Usage in Jobs
Using the web UI, select a compute when starting a workspace or experiment or when building a flow. A compute can also be set when running jobs using the cnvrg CLI and SDK.
When running the job, cnvrg attempts to use the computes in the order a user sets them. If the first compute is unavailable, cnvrg attempts to use the second compute, and so on. If none of the selected computes are available, the experiment enters into a queued state. When a compute becomes available, it starts running.
# Using the web UI when starting a workspace or experiment
When starting a workspace or experiment from the UI, complete the following steps to select one or more computes to attempt to run on:
- Click Start Workspace or New Experiment and provide the relevant details in the displayed pane.
- To display your compute(s), complete one of the following two steps:
- Click the Compute drop-down list (for a workspace)
- Click the Compute drop-down list in the Environment section (for an experiment)
- Select each compute to attempt to use in the workspace or experiment. The numbers next to their titles (when clicked) indicate the order cnvrg will attempt to use them. You can remove a compute from the list by clicking the X next to its name.
- Click Start Workspace or Run.
# Using the web UI when building a flow
When building a flow, complete the following steps to select one or more computes:
- Open the flow to select its compute(s).
- Click the Advanced tab and then the Compute drop-down list.
- Select each compute to attempt to use in the flow. The numbers next to their titles (when clicked) indicate the order cnvrg will attempt to use them. A compute can be removed from the list by clicking the X next to its name.
- Click Save Changes.
# Using the CLI
To set a compute when running an experiment with the CLI, use the --machine
flag:
cnvrg run python3 train.py --machine=‘medium’
A user can include multiple computes in the array. See the full cnvrg CLI documentation for more information about running experiments using the CLI.
# Using the SDK
To set a compute when running an experiment with the Python SDK, use the compute
parameter:
from cnvrg import Experiment
e = Experiment.run('python3 train.py',
compute=‘medium’)
A user can include multiple computes in the array. See the full cnvrg SDK documentation for more information about running experiments using the SDK.
# Job History Summary
Go to Compute > Jobs to view the history of recently run jobs. The displayed Jobs pane shows a summary of the recently run jobs with the following columns:
- Title: The job's title. Clicking the title displays the experiments page.
- Project: The job's project. Clicking the project name displays the job's project page.
- Status: Pending, Initializing, Running, Success, Aborted, Error, or Debug
- Duration: The time span the job ran
- User: The user who ran the job
- Created at: The time the job started
- Compute: The compute template on which the job ran
- Image: The image the job used
# Job Node Usage
Kubernetes is designed to efficiently allocate and orchestrate compute. By default, all nodes can be used by any job and compute template if the requested resources exist. However, this may require that GPU nodes may be used for CPU jobs, meaning that a CPU job uses a GPU machine when needed for a GPU job. This is one example of when a user may want to limit this behavior.
A user can use Kubernetes and cnvrg to control the jobs running on specific nodes. There are three ways to enforce this:
# Adding a taint to the GPU node pool
To restrict GPU nodes to only GPU jobs, add the following taint to the node:
- key:
nvidia.com/gpu
- value:
present
- effect:
NoSchedule
Use the following kubectl command to set a specific node(s):
kubectl taint nodes <node_name> key1=key:nvidia.com/gpu key2=value:present key3=effect:NoSchedule
# Using node labels
A user can use node labels to attach specific compute jobs to specific nodes.
To label nodes with a custom label, use the following kubectl command:
kubectl label nodes <node-name> <label-key>=<label-value>
Now add the label to the compute templates to run on the labeled node.
Go to Compute > Templates and edit the template to use only on the labeled node. In Labels, add the same <label-key>=<label-value>
.
Now, any jobs run on the template run only on nodes with the matching labels.
# Controlling with instance type
A user can also add a selector to a job template that uses the instance type of the desired node group. This does not require adding a label or taint to the node group.
Go to Compute > Templates and edit the template to use only a specific instance type.
In Labels, add the following: beta.kubernetes.io/instance-type: 'desired-instance-type'
.
For example, to enforce the template to use an m5.xlarge
, add beta.kubernetes.io/instance-type: 'm5.xlarge'
to the Selector.
# Docker Compose Installation on a GPU Machine
Docker Compose is not yet officially supported for GPU machines. However, it is still possible to set up Docker Compose to work in the context of a GPU machine using NVIDIA drivers. This relies on altering the runtime
for Docker Compose. Complete the following steps using the guidelines provided:
- Install Docker Compose.
- Download and install the
nvidia-container-runtime
:sudo apt-get install nvidia-container-runtime
- Add the NVIDIA runtime as the default Docker Compose runtime by running the following command:
sudo tee /etc/docker/daemon.json <<EOF { "default-runtime":"nvidia", "runtimes": { "nvidia": { "path": "nvidia-container-runtime", "runtimeArgs": [] } } } EOF sudo pkill -SIGHUP dockerd
- Restart Docker using the following command:
sudo systemctl restart docker
# Queues for Job Scheduling
Utilizing compute resources efficiently is crucial in any organization, as it represents a significant investment. When faced with a limited amount of compute resources and multiple jobs requiring access to the same resources, it is essential to prioritize and allocate those resources in a manner that aligns with the organization's goals and priorities.
To address this challenge, cnvrg introduces "Queues" a system that enables job scheduling based on priority across resources. Through this system, users can assign a priority to a job and specify the list of compute resources required to execute it. cnvrg will then run the job on the most suitable compute resource based on its priority, optimizing resource utilization, aligned with the organization's objectives. Queues feature is build on top of cnvrg scheduler.
cnvrg.io schdeuler strategy aims to prioritize nodes that are already running GPU tasks for new GPU workloads. This approach consolidates GPU jobs onto fewer nodes, which can optimize GPU utilization and reduce resource fragmentation. Essentially, it's about stacking GPU tasks where GPUs are already in use, keeping some nodes focused on GPU work and others free for different tasks. This can help maintain dedicated resources for high-demand GPU workloads and improve overall efficiency.
# Enable
Before proceeding, ensure that the following prerequisites are met:
cnvrg app version > v3.10.0 It is displayed in the bottom left corner of the cnvrg UI. You can also run this command to check:
kubectl -n cnvrg get cnvrgapp
cnvrg operator version > 4.0.0
You can also use kubectl with access to the cluster to check the current version, by running the following command:
kubectl -n cnvrg get deploy cnvrg-operator -o yaml | grep "image: "
# Install
When meeting the prerequisites, queues will be enabled automatically.
- Make sure compute resource is configured to use the cnvrg scheduler: Through the UI navigate to Compute → Resources → choose the relevant cluster’s menu → Edit Resource
- Then confirm that the
cnvrgScheduler
is selected:
- Using kubectl with access to the cluster, check that cnvrg-scheduler is running:
- Using kubectl with access to the cluster, check that
cnvrgScheduler
is enabled:
kubectl -n cnvrg edit cnvrgapp
## navigate to 'cnvrgScheduler', it should be enabled
cnvrgScheduler:
enabled: true
Note: each cluster is assigned a dedicated cnvrg scheduler; however, users may opt to disable the scheduler and utilize Kubernetes as the cluster scheduler.
# How to use
After enabling the scheduler feature, organizations can create queues to prioritize the execution of jobs.
Each Queue will have:
- Title
- Value
- Preemption (On / Off) —> Can jobs running on this queue be stopped in order to allow higher priority job to be executed
- Permissions to use this queue
Once a queue is defined, users can schedule a job on it, and it will be executed according to the queue's priority relative to other jobs waiting to be executed.
For example:
Consider the following scenario where an organization maintains two queues, Urgent and Default, for executing jobs.
If a user (or multiple users) initiates the execution of two jobs simultaneously, one job is assigned to the Default queue and the other to the Urgent queue, which has a higher priority than the Default queue.
The job assigned to the Urgent queue is given priority for execution ahead of the Default queue. Only after the completion of the Urgent job, the job assigned to the Default queue will commence execution.
Additionally, if two jobs are scheduled to run on the same queue at the same time, they will be executed based on the FIFO (First In First Out) principle by default.
# Preemption
The implementation of preemption in the cnvrg platform facilitates the removal of a running job in favor of a higher priority job. As an illustration, suppose a job, running on queue with priority 1, in which preemption is enabled. In that case, if a higher priority (> 1) task is scheduled, the current lower priority job will be halted, and its resources will be reassigned to the higher priority task.
- Example workflow:
- Job A commences with priority: 1
- Job B is launched with queue with priority: 2
- When receiving a request to allocation resources for job B, job A will be stopped and and moved back to the queue (same priority, new start_commit). Then job B will be executed. Note: Job A will only be stopped if the resources it uses are sufficient to be used for job B.
- Once Job B concludes and resources are available, Job A resumes with artifacts created in previous run.
# FAQ
Q:Is it possible to run this without using a queue?
A:Default queue is selected unless specifying a different one.
Q:What happens if I want to run a job on an 'urgent' priority, but the necessary resources are not available? At the same time, I also want to run a task on 'default' priority, for which the resources are available.
A:Default will run. resources will only be taken for requests that are able to allocate.
# GPU Sharing - MetaGPU
MetaGPU Device Plugin for Kubernetes is an open-source tool that enables sharing of Nvidia GPUs between multiple Kubernetes workloads to improve GPU utilization and reduce operating costs.
Each gpu machine is divided into 100 MetaGPU units that can be used flexibly across different Jobs.
The minimum amount of gpu that can be allocated through cnvrg is 0.1 GPU (10 metagpu).
MetaGPU is an open source solution, it can be installed on any cluster independently.
This section outlines a guide for installing MetaGPU on your Kubernetes cluster, accompanied by detailed instructions to verify that all components are functioning as intended.
For more details, see the MetaGPU Overview.
# Prerequisites
Before proceeding, ensure that the following prerequisites are met:
Nodegroup with NVIDIA GPU instances
kubectl
tool installed and kubeconfig with cluster access is availablecnvrg app version > v4.6.0
It is displayed in the bottom left corner of the cnvrg UI, but you can also run this command to check:kubectl -n cnvrg get cnvrgapp
cnvrg operator version > 4.0.0
You can check the current version by running the following command:kubectl -n cnvrg get deploy cnvrg-operator -o yaml | grep "image: "
helm tool (required for manual installation only)
# Installation
In order to enable MetaGPU, you must install the MetaGPU Device Plugin for Kubernetes in the cluster where you plan to run your GPU workloads.
- Edit the cnvrginfra CRD:
kubectl -n cnvrg edit cnvrginfra
###navigate to the MetaGpuDp configuration and add enabled: true
gpu:
MetaGpuDp:
enabled: true
save the new configuration and exit.
- Terminate the current application pods to initiate new ones with updated configuration:
kubectl -n cnvrg rollout restart deploy app sidekiq searchkiq systemkiq
# Validation
After completing the installation process, perform the following verification tests to ensure that the changes have been successfully implemented:
- Verify MetaGPU presence is enabled in the configMap:
kubectl -n cnvrg describe configmap MetaGPU-presence
You should see enabled: true under Data section:
- Verify the device plugin daemonSet exists:
kubectl -n cnvrg get daemonset metagpu-device-plugin
- Ensure the GPU node is properly labeled and tainted as required:
kubectl label node GPU_NODE accelerator=nvidia
kubectl taint node GPU_NODE cnvrg.io/metagpu=present:NoSchedule
kubectl taint node GPU_NODE nvidia.com/gpu=present:NoSchedule
This can be also done on the nodegroup level, to ensure any newly created/scaled nodes contain the same labels and taints. The application method varies between cloud provider and k8s distribution.
# Installing MetaGPU device plugin manually outside of cnvrg - OPTIONAL
While MetaGPU can be enabled directly from cnvrg, the device plugin itself can also be installed manually and independently outside of cnvrg context. This will allow you to run independent fractional GPU workloads.
- Clone the MetaGPU git repo:
git clone https://github.com/cnvrg/MetaGPU.git
- Navigate to the helm chart folder:
cd MetaGPU/chart
- Install the helm chart, set ocp=true if installing on an OpenShift cluster
helm install metagpu . --set ocp=false
- Ensure the GPU node is properly labeled and tainted as required:
kubectl label node GPU_NODE accelerator=nvidia
kubectl taint node GPU_NODE cnvrg.io/metagpu=present:NoSchedule
kubectl taint node GPU_NODE nvidia.com/gpu=present:NoSchedule
After the helm installation is complete, the MetaGPU device plugin daemonSet will deploy a pod on each GPU node, which will allow you to request the resource cnvrg.io/metagpu in your deployments.
Note Each GPU is split to 100 MetaGPU units. When requesting manually, state in your deployments resource request:
cnvrg.io/metagpu: 50
for half of a GPU, for example.
# How to use
Now that MetaGPU is enabled, you can go ahead and create a custom compute template with fractional GPU specification:
- Navigate to the Compute tab → Templates → Add Compute Template → Choose the cluster you’ve enabled MetaGPU on.
- Provide all needed specifications, for GPU you can now input a fractional value. Enter a number with a single decimal place, such as 0.5. You may also input values greater than 1.
Note: When you allocate 0.5 GPU, you also allocate the relative amount (50%) of GPU memory that exists on your machine.
Note: If you’re not able to specify fractions of GPU, MetaGPU installation did not succeed. please review previous steps or contact support.
- The compute template is versatile and can be used with any cnvrg job:
Please be advised that in cases where fractional GPU workloads are executed, it is possible to over commit and surpass the allocated portion of GPU utilization.
To illustrate, assuming a workspace runs on a compute template that has a 0.5 GPU value, it is possible for its running process to utilize the entire GPU capacity if no other consumers are utilizing it.
# Memory Allocation and the MetaGPU Binary Tool
When you dedicate 0.5 of your GPU's computational power, the system also reserves a proportional 50% of your GPU's memory for your processes.
The system is designed to not exceed this 50% memory allocation limit, and if your processes exceed it, you will encounter an out of Memory error.
To prevent this OOM error, there are a couple of solutions:
- set the
memoryEnforcer
parameter to "false" on the cluster level, allowing for over-allocation of all metagpu processes. This will permit your system to allocate more memory than the initial 50% boundary, thus averting potential OOM errors. for more info, see Memory inforcement - Through your code, limit the maximum amount of memory your process is permitted to use. This will reduce the likelihood of your code surpassing the memory limit and triggering an OOM error. You can use the
METAGPU_MAX_MEM
environment variable which is being added to any metagpu job.
In this section, we will use an environmental variable to limit memory allocation in code execution. This is an integral part of managing a cnvrg job with MetaGPU.
For example, TensorFlow's tf.config.LogicalDeviceConfiguration method can be used to limit the amount of GPU memory available for a particular logical device. Here's how to set the memory limit for a GPU using TensorFlow, with the "METAGPU_MAX_MEMORY" environment variable:
import os,sys
import tensorflow as tf
#CNVRG_COMPUTE_MEMORY=4.0
#METAGPU_MAX_MEM=8150
os.environ['METAGPU_MAX_MEM']
my_variable = os.environ.get('METAGPU_MAX_MEM')
memory_limit = int(my_variable) # Convert the string to an integer
# Set the limit to GPU memory usage
gpus = tf.config.list_physical_devices('GPU')
#print(gpus)
#sys.exit(0)
if gpus:
try:
# Set the memory limit for each GPU
for gpu in gpus:
tf.config.set_logical_device_configuration(
gpu,
[tf.config.LogicalDeviceConfiguration(memory_limit)])
print("GPU memory limit set successfully.")
except RuntimeError as e:
print("Error setting GPU memory limit:", e)
# Load MNIST dataset
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
# Normalize pixel values to [0, 1]
x_train, x_test = x_train / 255.0, x_test / 255.0
# Create TensorFlow model
with tf.device('/GPU:0'): # Use GPU
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation='softmax')
])
# Compile the model
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
# Train the model
model.fit(x_train, y_train, epochs=10, batch_size=4)
# Evaluate the model
model.evaluate(x_test, y_test)
# MetaGPU Binary Tool - mgctl
To assist you in managing your resources effectively, MetaGPU includes a Binary tool. This tool enables you to monitor your current GPU memory status and how much of your GPU's computational power is being utilized in real-time. When running the container, you can get more information on the "Memory for GPU" with this command:
mgctl get process -w
- Here the
-w
flag stands for a "watch".
Review the output column that shows: "MEMORY". Per our code above, it will allocated slightly more than 8GB.
# Support Matrix
Environment | Operator version | App version | Kubernetes version |
---|---|---|---|
eks | 4.3.28 | v4.7.80 | 1.23 |
aks | 4.3.28 | v4.7.80 | 1.23 |
On-premise | 4.3.28 | v4.7.80 | 1.23 |