# Distributed Jobs
cnvrg supports running distributed jobs, allowing you to run a single job using more than one Kubernetes node. This enables you to achieve higher utilization of your resources and perform the best compute methodology for each job.
cnvrg supports two forms of distributed jobs:
- Open MPI
The topics in this page:
Running distributed jobs using Spark is fully supported. To make use of Spark in a job, ensure you have chosen a compatible Spark container, Spark compatible code and a Spark template.
To use Spark, ensure you pick both a Spark compute and Spark container. cnvrg comes preconfigured with a default cnvrg Spark container you can use.
# Spark compute templates
A Spark compute template represents a master node and a certain amount of executors. The template sets the resources for the master nodes and executors separately.
For example, a template called
spark-medium with resources
4 CPUs 8 GB + 4 Workers: 4 CPUs 8.0 GB 0 GPUs, indicates that the Spark master node will have 4 CPUs and 8 GB of memory. It also means that when a Spark Session is opened, 4 new executor pods will be spun up with 4 CPUs and 8 GB of memory each.
# Master nodes and executors (Spark)
When you spin up a job using a Spark compute template and compatible Spark container, you are initially only spinning up a master node. This is true for both workspaces and experiments.
When you open a SparkSession within the job, the executor pods are spun up. As the resources need to be allocated and the pods spun up, this process is not immediate. The request may even trigger a scale up of your Kubernetes cluster and more nodes may be added. You can track this process from the SparkUI. Once all the executors are live, the Spark job will run.
When the SparkSession concludes, the executors will be spun down, but the master node will remain live until the workspace is closed or the experiment ends.
Whenever you create a SparkSession, wether in an experiment or a workspace, you will gain access to the SparkUI.
In a workspace
You can access the SparkUI by clicking SparkUI on the workspaces' sidebar.
In a experiment
You can access the SparkUI in an experiment from the menu on the experiments' page.
# Using data with Spark
When using Spark, a Spark compute template and a compatible Spark container, the data that you will be processing using Spark needs to be made available to all the Spark executors.
Therefore, you will need to use the remote link to the data you want to train on and not simply train on a mounted dataset. Instead of running your spark session using data from
/data use the link to the actual file.
spark = SparkSession.builder.getOrCreate() df = spark.read.csv("s3a://link-to-file")
If you are using cnvrg datasets, you can find the correct link by navigating to the file viewer for the dataset, finding the file and then copying the link.
# Open MPI
Open MPI is a technology that allows th running of distributed jobs over multiple nodes on your Kubernetes. cnvrg fully supports this use case and will simplify the DevOps, allowing you to spin up Open MPI jobs with little additional effort.
To run an Open MPI job, you must use a compatible Open MPI container, Open MPI compatible code and an Open MPI compute template.
To use Open MPI, ensure you pick both an Open MPI compute and Open MPI container.
# Open MPI compute templates
An Open MPU compute template represents a master node and a certain amount of executors. The template sets the resources for the master nodes and executors separately.
For example, a template called
Open-MPI-medium with resources
2 CPUs 4 GB + 2 Workers: 4 CPUs 16.0 GB 1 GPUs, indicates that the Open MPI master node will have 2 CPUs and 4 GB of memory. It also means that when the
mpirun command is called 2 new executor pods will be spun up with 4 CPUs and 16 GB of memory and 1 GPU each.
# Master nodes and executors (Open MPI)
When you spin up a job using an Open MPI compute template and compatible Open MPI container, both the master node and the workers nodes (executors) are spun up.
The code (git repo and Files) as well as chosen datasets are cloned to the master node and the workers nodes, so you can access them as you would in a regular cnvrg job.
When the job concludes, all the resources will be spun down.
# More information
To learn how to leverage Open MPI within your code, consult the Open MPI documentation.