AMD Resource Manager cluster node GPU metrics

Node GPU Metrics#

The node detail page provides per-GPU metric charts for each node in a cluster. Use these charts to monitor GPU health, resource consumption, and interconnect performance over time.

Filtering and time range#

At the top of the Device metrics section you can:

  • Filter by device – restrict charts to a specific GPU using the device selector.

  • Set the time range – choose between the last 1 hour, 24 hours, or 7 days.

Device metrics filter and time range selector.

GPU Utilization#

Shows GPU activity over time. Switch between three views using the tabs:

Tab

Description

Memory utilization

VRAM used as a percentage of total GPU memory

Clock speed

GPU clock frequency in MHz

GPU usage

Compute activity (GFX activity) as a percentage

GPU Utilization chart with tab selector.

GPU Temperature#

Shows thermal metrics over time. Switch between two views:

Tab

Description

Junction temperature

Die (hotspot) temperature in °C

Memory temperature

HBM memory temperature in °C

GPU Temperature chart with tab selector.

GPU Power Consumption#

Shows the power draw in watts for each GPU over time.

GPU Power Consumption chart.

PCIe Traffic#

Shows PCIe interconnect data over time. Switch between two views using the tabs:

Tab

Description

PCIe bandwidth

Raw data throughput (B/s, KB/s, MB/s, or GB/s depending on scale)

PCIe performance

Link speed as a percentage of the GPU’s maximum PCIe link speed (0–100%)

PCIe bandwidth is useful for identifying whether the interconnect is a bottleneck for data-intensive workloads. PCIe performance (efficiency) shows how much of the available link capacity is in use and is useful for checking whether the link is operating at its expected generation and width.

PCIe Traffic chart.

GPU Devices#

Below the charts, the GPU Devices table lists each GPU on the node together with its current snapshot values for memory utilization, junction temperature, memory temperature, and power usage.

GPU Devices table with per-device snapshot values.

Workloads on Node#

The Workloads on Node table shows the workloads currently running on this node.

Column

Description

Name

Unique identifier of the workload

Type

Workload type (e.g. Inference, Custom)

Status

Current execution status (e.g. Running, Pending, Failed)

GPU Devices

GPU device(s) assigned to the workload on this node

VRAM

Amount of GPU memory consumed by the workload

Created at

Date and time the workload was created

Project

Project the workload belongs to

Workloads on Node table listing active workloads with type, status, GPU devices, VRAM, and project.