Workload Support
Workloads are the basic unit of work in Run:ai. Researchers and Engineers use workloads for every stage in their AI Project lifecycle. Workloads can be used to build, train, or deploy a model. Run:ai supports all types of Kubernetes workloads. Researchers can work with any workload in their organization but will get the largest value working with Run:ai native workloads.
Run:ai offers three native types of workloads:
- Workspace. Used for data preparation and model-building tasks.
- Training. Used for training tasks.
- Inference. Used for inference and model serving tasks
Run:ai native workloads can be created via the Run:ai User interface, API or Command-line interface.
Levels of support¶
Different types of workloads have different levels of support. Understanding what capabilities are needed before selecting the workload type to work with is important. The table below details the level of support for each workload type in Run:ai. The Run:ai native workloads are fully supported with all of Run:ai advanced features and capabilities. While third-party workloads are partially supported. The list of capabilities can change between different Run:ai versions.
Functionality | Workload Type | ||||
---|---|---|---|---|---|
Run:ai workloads | Third-party workloads | ||||
Training - Standard | Workspace | Inference | Training - distributed | All K8s workloads | |
Fairness | v | v | v | v | v |
Priority and preemption | v | v | v | v | v |
Over quota | v | v | v | v | v |
Node pools | v | v | v | v | v |
Bin packing / Spread | v | v | v | v | v |
Fractions | v | v | v | v | v |
Dynamic fractions | v | v | v | v | v |
Node level scheduler | v | v | v | v | v |
GPU swap | v | v | v | v | v |
Elastic scaling | NA | NA | v | v | v |
Gang scheduling | v | v | v | v | v |
Monitoring | v | v | v | v | v |
RBAC | v | v | v | v | |
Workload awareness | v | v | v | v | |
Workload submission | v | v | v | v | |
Workload actions (stop/run) | v | v | v | v | |
Policies | v | v | v | v | |
Scheduling rules | v | v | v | v |
Note
Workload awareness
Specific workload-aware visibility, so that different pods are identified and treated as a single workload (for example GPU utilization, workload view, dashboards).
Note
Workload actions, Scheduling rules
Actions and scheduling rules for distributed training are supported from clusters v2.20 and above with the matching training operator versions. (see installation docs).
Workload scopes¶
Workloads must be created under a project. A project is the fundamental organization unit in the Run:ai account. To manage workloads, it’s required to first create a project or have one created by the administrator.
Policies and rules¶
Policies and rules empower administrators to establish default values and implement restrictions on workloads allowing enhanced control, assuring compatibility with organizational policies, and optimizing resource usage and utilization.
Workload statuses¶
The following table describes the different phases in a workload life cycle.
Phase | Description | Entry condition | Exit condition |
---|---|---|---|
Creating | Workload setup is initiated in the Cluster. Resources and pods are now provisioning | A workload is submitted | A multi-pod group is created |
Pending | Workload is queued and awaiting resource allocation. | A pod group exists | All pods are scheduled |
Initializing | Workload is retrieving images, starting containers, and preparing pods | All pods are scheduled—handling of multi-pod groups TBD | All pods are initialized or a failure to initialize is detected |
Running | Workload is currently in progress with all pods operational | All pods initialized (all containers in pods are ready) | workload completion or failure |
Degraded | Pods may not align with specifications, network services might be incomplete, or persistent volumes may be detached. Check your logs for specific details. | Pending: All pods are running but with issues Running: All pods are running with no issues. | Running: All resources are OK Completed: Workload finished with fewer resources Failed: Workload failure or user-defined rules |
Deleting | Workload and its associated resources are being decommissioned from the cluster | Deleting the workload. | Resources are fully deleted |
Stopped | The workload is on hold and resources are intact but inactive | Stopping the workload without deleting resources | Transitioning back to the initializing phase or proceeding to deleting the workload |
Failed | Image retrieval failed or containers experienced a crash. Check your logs for specific details. | An error occurs preventing the successful completion of the workload | Terminal State |
Completed | Workload has successfully finished its execution | The workload has finished processing without errors | Terminal State |