Metrics and telemetry¶
Telemetry¶
Telemetry is a numeric measurement recorded in real-time when emitted from the Run:ai cluster.
Metrics¶
Metrics are numeric measurements recorded over time that are emitted from the Run:ai cluster. Typical metrics involve utilization, allocation, time measurements and so on. Metrics are used in Run:ai dashboards as well as in the Run:ai administration user interface.
The purpose of this document is to detail the structure and purpose of metrics emitted by Run:ai. This enables customers to create custom dashboards or integrate metric data into other monitoring systems.
Run:ai provides metrics via the Run:ai Control-plane API. Previoulsy, Run:ai provided metrics information via direct access to an internal metrics store. This method is deprecated but is still documented here.
Metric and telemetry Scopes¶
Run:ai provides Control-plane API which supports and aggregates metrics at various levels.
| Level | Description |
|---|---|
| Cluster | A cluster is a set of Nodes Pools & Nodes. With Cluster metrics, metrics are aggregated at the Cluster level |
| Node | Data is aggregated at the Node level. |
| Node Pool | Data is aggregated at the Node Pool level. |
| Workload | Data is aggregated at the Workload level. In some Workloads, e.g. with distributed workloads, these metrics aggregate data from all worker pods |
| Pod | The basic execution unit |
Supported Metrics¶
| Metric | Cluster | Node Pool | Node | Workload | Pod |
|---|---|---|---|---|---|
| API | Cluster API | Node Pool API | Workload API | Pod API | |
| ALLOCATED_GPU | TRUE | TRUE | TRUE | ||
| AVG_WORKLOAD_WAIT_TIME | TRUE | TRUE | |||
| CPU_LIMIT_CORES | TRUE | ||||
| CPU_MEMORY_LIMIT_BYTES | TRUE | ||||
| CPU_MEMORY_REQUEST_BYTES | TRUE | ||||
| CPU_MEMORY_USAGE_BYTES | TRUE | TRUE | TRUE | ||
| CPU_MEMORY_UTILIZATION | TRUE | TRUE | TRUE | ||
| CPU_REQUEST_CORES | TRUE | ||||
| CPU_USAGE_CORES | TRUE | TRUE | TRUE | ||
| CPU_UTILIZATION | TRUE | TRUE | TRUE | ||
| GPU_ALLOCATION | TRUE | ||||
| GPU_MEMORY_REQUEST_BYTES | TRUE | ||||
| GPU_MEMORY_USAGE_BYTES | TRUE | TRUE | |||
| GPU_MEMORY_USAGE_BYTES_PER_GPU | TRUE | TRUE | |||
| GPU_MEMORY_UTILIZATION | TRUE | TRUE | |||
| GPU_MEMORY_UTILIZATION_PER_GPU | TRU | ||||
| GPU_QUOTA | TRUE | TRUE | |||
| GPU_UTILIZATION | TRUE | TRUE | TRUE | TRUE | |
| GPU_UTILIZATION_PER_GPU | TRUE | TRUE | |||
| POD_COUNT | TRUE | ||||
| RUNNING_POD_COUNT | TRUE | ||||
| TOTAL_GPU | TRUE | TRUE | |||
| TOTAL_GPU_NODES | TRUE | TRUE | |||
| GPU_UTILIZATION_DISTRIBUTION | TRUE | TRUE | |||
| UNALLOCATED_GPU | TRUE | TRUE | |||
Advanced Metrics¶
NVIDIA provides extended metrics at the Pod level. These are documented here. To enable these metrics please contact Run:ai customer support.
| Metric | Cluster | Node Pool | Workload | Pod |
|---|---|---|---|---|
| GPU_FP16_ENGINE_ACTIVITY_PER_GPU | TRUE | |||
| GPU_FP32_ENGINE_ACTIVITY_PER_GPU | TRUE | |||
| GPU_FP64_ENGINE_ACTIVITY_PER_GPU | TRUE | |||
| GPU_GRAPHICS_ENGINE_ACTIVITY_PER_GPU | TRUE | |||
| GPU_MEMORY_BANDWIDTH_UTILIZATION_PER_GPU | TRUE | |||
| GPU_NVLINK_RECEIVED_BANDWIDTH_PER_GPU | TRUE | |||
| GPU_NVLINK_TRANSMITTED_BANDWIDTH_PER_GPU | TRUE | |||
| GPU_PCIE_RECEIVED_BANDWIDTH_PER_GPU | TRUE | |||
| GPU_PCIE_TRANSMITTED_BANDWIDTH_PER_GPU | TRUE | |||
| GPU_SM_ACTIVITY_PER_GPU | TRUE | |||
| GPU_SM_OCCUPANCY_PER_GPU | TRUE | |||
| GPU_TENSOR_ACTIVITY_PER_GPU | TRUE |
¶
Supported telemetry¶
| telemetry | Node | Workload |
|---|---|---|
| API | Node API | Workload API |
| WORKLOADS_COUNT | TRUE | |
| ALLOCATED_GPUS | TRUE | TRUE |
| READY_GPU_NODES | TRUE | |
| READY_GPUS | TRUE | |
| TOTAL_GPU_NODES | TRUE | |
| TOTAL_GPUS | TRUE | |
| IDLE_ALLOCATED_GPUS | TRUE | |
| FREE_GPUS | TRUE | |
| TOTAL_CPU_CORES | TRUE | |
| USED_CPU_CORES | TRUE | |
| ALLOCATED_CPU_CORES | TRUE | |
| TOTAL_GPU_MEMORY_BYTES | TRUE | |
| USED_GPU_MEMORY_BYTES | TRUE | |
| TOTAL_CPU_MEMORY_BYTES | TRUE | |
| USED_CPU_MEMORY_BYTES | TRUE | |
| ALLOCATED_CPU_MEMORY_BYTES | TRUE |