# Architecture Overview
# cnvrg Software Architecture
cnvrg is a Kubernetes-based deployment managed by a Kubernetes Operator. The platform consists of control plane nodes and worker nodes that are running the ML workloads.
# Control Plane
- cnvrg Application: Runs the main application; in charge of the Web UI, API services, cnvrg application logic, and cnvrg Scheduler.
- cnvrg Sidekiq: Handles jobs orchestration, executes all cnvrg jobs, and monitors the lifecycle of each job. It also manages system metrics and sends alerts.
- cnvrg Scheduler (when enabled): Picks app jobs submitted by users according to their submission time and priority.
- Postgresql: A free and open-source relational database management system (RDBMS) that stores the cnvrg platform metadata. External PostgreSQL and managed solutions like Amazon RDS are supported.
- Redis: A distributed in-memory key-value database, cache, and message broker used as Sidekiq’s database; Stores job executions, schedules, and cron types of jobs.
# Logging Stack
- ElasticSearch: Used to store cnvrg logs, index datasets metadata, and endpoints logs.
- Kibana: A free and open-source interface that helps visualize Elasticsearch data, navigate the Elastic Stack, and provide a dashboard for viewing data.
- Fluentbit: Collects logs from different cnvrg pods and forwards them to Elasticsearch.
- ElastAlert: An alert framework on top of ElasticSearch to monitor and alert on specific rules. Used to configure custom alerts on cnvrg Endpoints.
# Monitoring Stack
- Prometheus: A time-series database used to store system metrics and custom metrics from cnvrg job exporters and other system exporters.
- Grafana: A dashboard to view different metrics and visualizations from Prometheus and other sources.
- Node Exporter: Provides hardware and OS-level system metrics exposed by *NIX kernels through metric collectors.
# cnvrg Storage
# Object Store
cnvrg uses S3 storage-compatible components to allow users to save their project files, artifacts, and datasets in a managed data science-oriented version control.
Supported Storage Types:
- S3 Bucket: For EKS clusters.
- Google Cloud Bucket: For GKE clusters.
- Azure Blob Storage: For AKS clusters.
- MinIo: For on-premise clusters.
cnvrg can connect to different storage solutions as long as they support S3-compatible object storage.
# NFS Server
cnvrg can connect to an external NFS server to enable “Dataset Caching''. This ensures the job starts immediately without re-downloading the datasets and copying the files from the storage to the pod every time. If a PVC is already provisioned on the cluster and can serve as the NFS server (allows read/write), cnvrg can use it to enable dataset caching.
# cnvrg Networking
Ai Studio supports different network configurations. The cluster can be installed and configured internally within the customer’s network or can be publicly available. Users will interact with the cnvrg application through HTTP (port 80) or HTTPS (port 443). For HTTPS, a trusted wildcard TLS certificate should be provided.
# Ingress Controller
Ai Studio supports different types of ingress controllers:
- Istio (default)
- Vanilla K8S Ingress Rules
- OpenShift Routes
- NodePort
# cnvrg Installation Requirements
This section describes the minimum and recommended resource requirements of a Kubernetes cluster. Ensure the following requirements are met for each orchestration.
# Compute Resources
Node Type | CPU (Per Node) | Memory (Per Node) | Storage (Per Node) | Nodes Count |
---|---|---|---|---|
Kubernetes control plane | 4 CPU | 4GB | 100GB | 3 |
Cnvrg control plane | 8 CPU | 32 GB | 100GB | 3 |
# Storage Resources
Workload | Size (Minimum) | Size (Recommended) | Type |
---|---|---|---|
PostgreSQL | 80GB | 200GB | CSI-compatible (preferably block) |
ElasticSearch | 80GB | 200GB | CSI-compatible (preferably block) |
Prometheus | 50GB | 100GB | CSI-compatible (preferably block) |
ElastAlert | 30GB | 50GB | CSI-compatible (preferably block) |
Object storage | 1TB | - | - |
Notebooks/Experiments | 500GB | - | CSI-compatible (preferably block) |
DataSet Caching | 500GB | - | CSI-compatible/NFS |
# Network Resources
- Allocatable unused IP from the Kubernetes subnet for kube-proxy with IPVS mode.
- Domain name (internal or external) e.g., cnvrg.my-company.com
- DNS A wildcard record e.g., *.cnvrg.my-company.com -> 192.168.1.2
- Trusted wildcard TLS certificates for HTTPS e.g., *.cnvrg.my-company.com
- User/Password for SMTP access (if enabled)
# Control Plane Pods Resource Requirements
Below are the CPU and Memory requirements represented by:
- Request: Minimum resources required to run the application.
- Limit: Resources to allow components to burst under load.
cnvrg deploys HPA (horizontal pod autoscaler) to automatically scale the workload to match demand by increasing component pod count.
Workload | Replicas | CPU Request | CPU Limit | Memory Request | Memory Limit | Storage |
---|---|---|---|---|---|---|
webapp | 1 | 2000m | 4000m | 4Gi | 8Gi | - |
sidekiq | 2 | 1000m | 2000m | 3750Mi | 8Gi | - |
searchkiq | 1 | 750m | 2000m | 1Gi | 8Gi | - |
systemkiq | 1 | 500m | 2000m | 1Gi | 8Gi | - |
hyper | 1 | 100m | 2000m | 200Mi | 4Gi | - |
postgres | 1 | 4000m | 12000m | 4Gi | 32Gi | 80Gi |
redis | 1 | 100m | 1000m | 200Mi | 2Gi | 10Gi |
elasticsearch | 1 | 2000m | 4000m | 4Gi | 8Gi | 80Gi |
kibana | 1 | 100m | 1000m | 200Mi | 2Gi | - |
elastalert | 1 | 100m | 400m | 200Mi | 800Mi | 30Gi |
grafana | 1 | 100m | 200m | 100Mi | 200Mi | - |
prometheus | 1 | 200m | 2000m | 500Mi | 4Gi | 50Gi |
capsule | 1 | 200m | 1000m | 500Mi | 1Gi | 100Gi |
cnvrg-operator | 1 | 500m | 1000m | 200Mi | 1Gi | - |
config-reloader | 1 | 100m | 1000m | 200Mi | 1Gi | - |
kube-state-metrics | 1 | 200m | 1000m | 200Mi | 1Gi | - |
mpi-operator | 1 | 100m | 1000m | 100Mi | 1Gi | - |
scheduler | 1 | 500m | 2000m | 100Mi | 4Gi | - |
fluentbit | per node | 200m | 200m | 2000m | 2Gi | - |
node-exporter | per node | 10m | 20m | 20Mi | 40Mi | - |
dcgm-exporter | per NVIDIA GPU node | 100m | 500m | 100Mi | 1Gi | - |
TOTAL: 19 components, 12550m (~13 CPU) Request, 37600m (~38 CPU) Limit, 20Gi Request, 94Gi Limit, 350Gi Storage
*Disclaimer: The above chart refers to cnvrg control plane components and not to users' workloads (experiments/workspaces/etc).