# Architecture Overview

# cnvrg Software Architecture

cnvrg is a Kubernetes-based deployment managed by a Kubernetes Operator. The platform consists of control plane nodes and worker nodes that are running the ML workloads.

# Control Plane

cnvrg Application: Runs the main application; in charge of the Web UI, API services, cnvrg application logic, and cnvrg Scheduler.
cnvrg Sidekiq: Handles jobs orchestration, executes all cnvrg jobs, and monitors the lifecycle of each job. It also manages system metrics and sends alerts.
cnvrg Scheduler (when enabled): Picks app jobs submitted by users according to their submission time and priority.
Postgresql: A free and open-source relational database management system (RDBMS) that stores the cnvrg platform metadata. External PostgreSQL and managed solutions like Amazon RDS are supported.
Redis: A distributed in-memory key-value database, cache, and message broker used as Sidekiq’s database; Stores job executions, schedules, and cron types of jobs.

# Logging Stack

ElasticSearch: Used to store cnvrg logs, index datasets metadata, and endpoints logs.
Kibana: A free and open-source interface that helps visualize Elasticsearch data, navigate the Elastic Stack, and provide a dashboard for viewing data.
Fluentbit: Collects logs from different cnvrg pods and forwards them to Elasticsearch.
ElastAlert: An alert framework on top of ElasticSearch to monitor and alert on specific rules. Used to configure custom alerts on cnvrg Endpoints.

# Monitoring Stack

Prometheus: A time-series database used to store system metrics and custom metrics from cnvrg job exporters and other system exporters.
Grafana: A dashboard to view different metrics and visualizations from Prometheus and other sources.
Node Exporter: Provides hardware and OS-level system metrics exposed by *NIX kernels through metric collectors.

# cnvrg Storage

# Object Store

cnvrg uses S3 storage-compatible components to allow users to save their project files, artifacts, and datasets in a managed data science-oriented version control.

Supported Storage Types:

S3 Bucket: For EKS clusters.
Google Cloud Bucket: For GKE clusters.
Azure Blob Storage: For AKS clusters.
MinIo: For on-premise clusters.

cnvrg can connect to different storage solutions as long as they support S3-compatible object storage.

# NFS Server

cnvrg can connect to an external NFS server to enable “Dataset Caching''. This ensures the job starts immediately without re-downloading the datasets and copying the files from the storage to the pod every time. If a PVC is already provisioned on the cluster and can serve as the NFS server (allows read/write), cnvrg can use it to enable dataset caching.

# cnvrg Networking

Ai Studio supports different network configurations. The cluster can be installed and configured internally within the customer’s network or can be publicly available. Users will interact with the cnvrg application through HTTP (port 80) or HTTPS (port 443). For HTTPS, a trusted wildcard TLS certificate should be provided.

# Ingress Controller

Ai Studio supports different types of ingress controllers:

Istio (default)
Vanilla K8S Ingress Rules
OpenShift Routes
NodePort

# cnvrg Installation Requirements

This section describes the minimum and recommended resource requirements of a Kubernetes cluster. Ensure the following requirements are met for each orchestration.

# Compute Resources

Node Type	CPU (Per Node)	Memory (Per Node)	Storage (Per Node)	Nodes Count
Kubernetes control plane	4 CPU	4GB	100GB	3
Cnvrg control plane	8 CPU	32 GB	100GB	3

# Storage Resources

Workload	Size (Minimum)	Size (Recommended)	Type
PostgreSQL	80GB	200GB	CSI-compatible (preferably block)
ElasticSearch	80GB	200GB	CSI-compatible (preferably block)
Prometheus	50GB	100GB	CSI-compatible (preferably block)
ElastAlert	30GB	50GB	CSI-compatible (preferably block)
Object storage	1TB	-	-
Notebooks/Experiments	500GB	-	CSI-compatible (preferably block)
DataSet Caching	500GB	-	CSI-compatible/NFS

# Network Resources

Allocatable unused IP from the Kubernetes subnet for kube-proxy with IPVS mode.
Domain name (internal or external) e.g., cnvrg.my-company.com
DNS A wildcard record e.g., *.cnvrg.my-company.com -> 192.168.1.2
Trusted wildcard TLS certificates for HTTPS e.g., *.cnvrg.my-company.com
User/Password for SMTP access (if enabled)

# Control Plane Pods Resource Requirements

Below are the CPU and Memory requirements represented by:

Request: Minimum resources required to run the application.
Limit: Resources to allow components to burst under load.

cnvrg deploys HPA (horizontal pod autoscaler) to automatically scale the workload to match demand by increasing component pod count.

Workload	Replicas	CPU Request	CPU Limit	Memory Request	Memory Limit	Storage
webapp	1	2000m	4000m	4Gi	8Gi	-
sidekiq	2	1000m	2000m	3750Mi	8Gi	-
searchkiq	1	750m	2000m	1Gi	8Gi	-
systemkiq	1	500m	2000m	1Gi	8Gi	-
hyper	1	100m	2000m	200Mi	4Gi	-
postgres	1	4000m	12000m	4Gi	32Gi	80Gi
redis	1	100m	1000m	200Mi	2Gi	10Gi
elasticsearch	1	2000m	4000m	4Gi	8Gi	80Gi
kibana	1	100m	1000m	200Mi	2Gi	-
elastalert	1	100m	400m	200Mi	800Mi	30Gi
grafana	1	100m	200m	100Mi	200Mi	-
prometheus	1	200m	2000m	500Mi	4Gi	50Gi
capsule	1	200m	1000m	500Mi	1Gi	100Gi
cnvrg-operator	1	500m	1000m	200Mi	1Gi	-
config-reloader	1	100m	1000m	200Mi	1Gi	-
kube-state-metrics	1	200m	1000m	200Mi	1Gi	-
mpi-operator	1	100m	1000m	100Mi	1Gi	-
scheduler	1	500m	2000m	100Mi	4Gi	-
fluentbit	per node	200m	200m	2000m	2Gi	-
node-exporter	per node	10m	20m	20Mi	40Mi	-
dcgm-exporter	per NVIDIA GPU node	100m	500m	100Mi	1Gi	-

TOTAL: 19 components, 12550m (~13 CPU) Request, 37600m (~38 CPU) Limit, 20Gi Request, 94Gi Limit, 350Gi Storage

*Disclaimer: The above chart refers to cnvrg control plane components and not to users' workloads (experiments/workspaces/etc).

← Migrating cnvrg data to a new cnvrg instance Create a Cluster →