Skip to content

Workload Support

Workloads are the basic unit of work in Run:ai. Researchers and Engineers use workloads for every stage in their AI Project lifecycle. Workloads can be used to build, train, or deploy a model. Run:ai supports all types of Kubernetes workloads. Researchers can work with any workload in their organization but will get the largest value working with Run:ai native workloads.

Run:ai offers three native types of workloads:

  • Workspace. Used for data preparation and model-building tasks.
  • Training. Used for training tasks.
  • Inference. Used for inference and model serving tasks

Run:ai native workloads can be created via the Run:ai User interface, API or Command-line interface.

Levels of support

Different types of workloads have different levels of support. Understanding what capabilities are needed before selecting the workload type to work with is important. The table below details the level of support for each workload type in Run:ai. The Run:ai native workloads are fully supported with all of Run:ai advanced features and capabilities. While third-party workloads are partially supported. The list of capabilities can change between different Run:ai versions.

Functionality Workload Type
Run:ai workloads Third-party workloads
Training - Standard Workspace Inference Training - distributed All K8s workloads
Fairness v v v v v
Priority and preemption v v v v v
Over quota v v v v v
Node pools v v v v v
Bin packing / Spread v v v v v
Fractions v v v v v
Dynamic fractions v v v v v
Node level scheduler v v v v v
GPU swap v v v v v
Elastic scaling NA NA v v v
Gang scheduling v v v v v
Monitoring v v v v v
RBAC v v v v
Workload awareness v v v v
Workload submission v v v v
Workload actions (stop/run) v v v v
Policies v v v v
Scheduling rules v v v v

Note

Workload awareness

Specific workload-aware visibility, so that different pods are identified and treated as a single workload (for example GPU utilization, workload view, dashboards).

Note

Workload actions, Scheduling rules

Actions and scheduling rules for distributed training are supported from clusters v2.20 and above with the matching training operator versions. (see installation docs).

Workload scopes

Workloads must be created under a project. A project is the fundamental organization unit in the Run:ai account. To manage workloads, it’s required to first create a project or have one created by the administrator.

Policies and rules

Policies and rules empower administrators to establish default values and implement restrictions on workloads allowing enhanced control, assuring compatibility with organizational policies, and optimizing resource usage and utilization.

Workload statuses

The following table describes the different phases in a workload life cycle.

Phase Description Entry condition Exit condition
Creating Workload setup is initiated in the Cluster. Resources and pods are now provisioning A workload is submitted A multi-pod group is created
Pending Workload is queued and awaiting resource allocation. A pod group exists All pods are scheduled
Initializing Workload is retrieving images, starting containers, and preparing pods All pods are scheduled—handling of multi-pod groups TBD All pods are initialized or a failure to initialize is detected
Running Workload is currently in progress with all pods operational All pods initialized (all containers in pods are ready) workload completion or failure
Degraded Pods may not align with specifications, network services might be incomplete, or persistent volumes may be detached. Check your logs for specific details. Pending: All pods are running but with issues Running: All pods are running with no issues. Running: All resources are OK Completed: Workload finished with fewer resources Failed: Workload failure or user-defined rules
Deleting Workload and its associated resources are being decommissioned from the cluster Deleting the workload. Resources are fully deleted
Stopped The workload is on hold and resources are intact but inactive Stopping the workload without deleting resources Transitioning back to the initializing phase or proceeding to deleting the workload
Failed Image retrieval failed or containers experienced a crash. Check your logs for specific details. An error occurs preventing the successful completion of the workload Terminal State
Completed Workload has successfully finished its execution The workload has finished processing without errors Terminal State