# Distributed Jobs

cnvrg supports running distributed jobs, allowing you to run a single job using more than one Kubernetes node. This enables you to achieve higher utilization of your resources and perform the best compute methodology for each job.

cnvrg supports two forms of distributed jobs:

  • Open MPI
  • PyTorch Distributed
  • Spark on Kubernetes
  • Remote Spark Clusters

The topics in this page:

# Spark in cnvrg

Spark is a great tool for running distributed jobs and processing data efficiently. cnvrg supports two different ways of using Spark:

# Spark on Kubernetes

There is no setup required to use Spark on Kubernetes. As soon as you add a new Kuberenetes cluster, cnvrg automatically creates Spark templates. When you run a job on one of these templates with a compatible Spark container and Spark compatible code, cnvrg will spin up the Spark nodes on Kubernetes for you.

When the Spark session concludes, cnvrg will spin down the Spark nodes.

WARNING

To use Spark, ensure you pick both a Spark compute and Spark container. cnvrg comes pre-configured with a default cnvrg Spark container you can use.

# Spark on Kubernetes compute templates

A Spark on Kubernetes compute template represents a master node and a certain amount of executors. The template sets the resources for the master nodes and executors separately.

You can create your own custom Spark compute templates.

For example, a template called spark-medium with resources 4 CPUs 8 GB + 4 Workers: 4 CPUs 8.0 GB 0 GPUs, indicates that the Spark master node will have 4 CPUs and 8 GB of memory. It also means that when a Spark Session is opened, 4 new executor pods will be spun up with 4 CPUs and 8 GB of memory each.

# Spark on Kubernetes master nodes and executors

When you spin up a job using a Spark compute template and compatible Spark container, you are initially only spinning up a master node. This is true for both workspaces and experiments.

When you open a SparkSession within the job, the executor pods are spun up. As the resources need to be allocated and the pods spun up, this process is not immediate. The request may even trigger a scale up of your Kubernetes cluster and more nodes may be added. You can track this process from the SparkUI. Once all the executors are live, the Spark job will run.

When the SparkSession concludes, the executors will be spun down, but the master node will remain live until the workspace is closed or the experiment ends.

# Remote Spark clusters

If you have created a persistent remote Spark cluster, you can easily connect with and leverage it within cnvrg. There are some key differences when compared to Spark on Kubernetes:

  1. You must first add the remote Spark cluster as a compute resource.
  2. Then you need to define templates for the remote Spark cluster.

# Remote Spark compute templates

A remote Spark compute template represents a set of regular compute templates on a Kubernetes cluster or machine and a certain amount of executors in the remote Spark cluster.

To get started using the remote Spark cluster you must first create one or more templates. In this template, you define the compute templates that cnvrg will spin up as a driver and the specifications for workers inside the remote Spark cluster.

For example, a template called remoteSparkSmall could have its driver set as small and have 2 workers each with 2 CPUs and 4 GB of memory.

# Remote Spark driver and executors

When you spin up a job using a remote Spark compute template and compatible Spark container, you are only spinning up a driver node on a Kubernetes cluster or machine (as defined in the compute template). This is true for both workspaces and experiments.

When you open a SparkSession within the job, instead of spinning up new nodes on Kubernetes, cnvrg will connect the driver to the remote Spark cluster using the configuration you supplied when setting up the Spark cluster in cnvrg. It will then use the amount of workers and their specifications (as defined in the compute template) and start the Spark processing on the remote cluster. cnvrg will not directly manage the remote Spark cluster to spin up or spin down nodes.

When the SparkSession concludes, the driver node will remain live until the workspace is closed or the experiment ends.

# SparkUI

Whenever you create a SparkSession, wether in an experiment or a workspace, you will gain access to the SparkUI.

TIP

You can only view the Spark UI while the Spark context is live. Before and after the Spark context, you will be unable to view the Spark UI.

In a workspace

You can access the SparkUI by clicking SparkUI on the workspaces' sidebar.

In a experiment

You can access the SparkUI in an experiment from the menu on the experiments' page.

# Using data with Spark

When using Spark, a Spark compute template and a compatible Spark container, the data that you will be processing using Spark needs to be made available to all the Spark executors.

Therefore, you will need to use the remote link to the data you want to train on and not simply train on a mounted dataset. Instead of running your spark session using data from /data use the link to the actual file.

For example:

spark = SparkSession.builder.getOrCreate()
df = spark.read.csv("s3a://link-to-file")

If you are using cnvrg datasets, you can find the correct link by navigating to the file viewer for the dataset, finding the file and then copying the link.

# Open MPI

Open MPI is a technology that allows th running of distributed jobs over multiple nodes on your Kubernetes. cnvrg fully supports this use case and will simplify the DevOps, allowing you to spin up Open MPI jobs with little additional effort.

To run an Open MPI job, you must use a compatible Open MPI container, Open MPI compatible code and an Open MPI compute template.

WARNING

To use Open MPI, ensure you pick both an Open MPI compute and Open MPI container.

# Open MPI compute templates

An Open MPI compute template represents a master node and a certain amount of executors. The template sets the resources for the master nodes and executors separately.

You can create your own custom Open MPI compute templates.

For example, a template called Open-MPI-medium with resources 2 CPUs 4 GB + 2 Workers: 4 CPUs 16.0 GB 1 GPUs, indicates that the Open MPI master node will have 2 CPUs and 4 GB of memory. It also means that when the mpirun command is called 2 new executor pods will be spun up with 4 CPUs and 16 GB of memory and 1 GPU each.

# Master nodes and executors (Open MPI)

When you spin up a job using an Open MPI compute template and compatible Open MPI container, both the master node and the workers nodes (executors) are spun up.

The code (git repo and Files) as well as chosen datasets are cloned to the master node and the workers nodes, so you can access them as you would in a regular cnvrg job.

When the job concludes, all the resources will be spun down.

# More information

To learn how to leverage Open MPI within your code, consult the Open MPI documentation.

# PyTorch Distributed

PyTorch has a fully native module torch.distributed that allows data scientists to easily parallelize their computations across multiple nodes. To do so, it leverages messaging passing semantics allowing each process to communicate data to any of the other processes. cnvrg has full support for using PyTorch in this fashion without needing to perform any of the DevOps or MLOps work for yourself.

To run an PyTorch Distributed job, you must use a compatible PyTorch container, and an PyTorch Distributed compute template.

WARNING

To use PyTorch Distributed, ensure you pick both a PyTorch Distributed compute and PyTorch container.

# PyTorch Distributed compute templates

A PyTorch Distributed compute template represents a certain amount of workers nodes (executors). The template sets the resources for the executors.

You can create your own custom PyTorch Distributed compute templates.

For example, a template called PyTorch-Dist-medium with resources 4 Workers: 4 CPUs 16.0 GB 1 GPUs, indicates that the workers nodes will have 4 CPUs and 16 GB of memory and 1 GPU each.

# Workers nodes (PyTorch Distributed)

When you spin up a job using an PyTorch Distributed compute template and compatible PyTorch container, the defined workers nodes (executors) are spun up.

The code (git repo and Files) as well as chosen datasets are cloned to the workers nodes, so you can access them as you would in a regular cnvrg job.

When the job concludes, all the resources will be spun down.

# PyTorch Distributed commands

Once you have selected a compatible PyTorch Distributed template, you can run your code with a regular python command. For example, python my_script.py. cnvrg will automatically add the required flags to the rest of your command to ensure it runs correctly.

NOTE

Normally, you would need this full line to run your code: python -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=0 --master_addr="localhost" --master_port=3000" my_script.py.

However, cnvrg automatically injects the extra part into your command, so you only need to enter python my_script.py.

# More information

To learn how to leverage PyTorch Distributed within your code, consult the torch.distributed documentation.

Last Updated: 7/26/2020, 3:39:05 PM