AMD Resource Manager installation cluster preparation connecting cluster

Connecting a cluster to AMD Resource Manager#

Table of contents#

Overview#

To connect a cluster to AMD Resource Manager, several components need to be installed on the cluster. They let the cluster communicate with AMD Resource Manager and integrate with the product. This guide walks through that install. The main pieces are:

Component

What it does

Resource Manager cluster agent and Admission Webhook

Connects the cluster to AMD Resource Manager over RabbitMQ, applies cluster commands, keeps supported workload state in sync with the UI, and enforces policy on those workloads.

AMD GPU Operator and AMD Device Metrics Exporter

The operator makes AMD GPUs schedulable to Kubernetes workloads (driver, device plugin, GPU node labels) and runs the Device Metrics Exporter on every GPU node. The exporter reads AMD GPU hardware telemetry directly from each device and exposes it as a per-node Prometheus /metrics endpoint with per-GPU and per-workload labels (pod, namespace, container)—so AMD Resource Manager can attribute utilization, clocks, power, thermals, memory (VRAM), ECC error counts, PCIe, and XGMI activity back to the workload that produced it.

Details and versions follow in the next sections. If your cluster was already built by Cluster Forge and already includes the dependencies below, follow that product’s docs and only add what is missing.

Cluster Prerequisites#

  • Kubernetes 1.29+ on the cluster you are connecting.

  • kubectl — configured for the target cluster context.

  • Helm 3.8+ — required for reliable helm install / helm upgrade of charts from OCI (for example oci://registry-1.docker.io/...). Older Helm 3 may not support OCI the same way.

Important - Before you start#

Two things to line up before the installs: the Resource Manager cluster agent values you will pass to Helm, and a Prometheus-compatible time-series database for the metrics path you will wire up later in Forward Device Metrics Exporter metrics.

Resource Manager cluster agent (RabbitMQ and cluster name)#

Value

Notes

RabbitMQ host

<AIRM_RABBITMQ_HOST>: Hostname or IP of RabbitMQ as used by AMD Resource Manager; set on the chart as airm.agent.rabbitmq.host. Must resolve from inside the cluster (for example a service DNS name or a host your network allows).

RabbitMQ port

<AIRM_RABBITMQ_PORT>: RabbitMQ AMQP port; set as airm.agent.rabbitmq.port. If you omit it, the chart defaults to 5672; confirm for your environment.

RabbitMQ username

<AIRM_RABBITMQ_USERNAME>: From the Connect a cluster command in AMD Resource Manager (see Connect a cluster in the UI). Use the value shown for cluster-id in the UI snippet (that value is the RabbitMQ user the agent must use) as the username key in the Secret for the agent chart.

RabbitMQ password

<AIRM_RABBITMQ_PASSWORD>: From the same Connect a cluster snippet, use the value for cluster-secret. Put it in the password key in the same Secret as username so the agent can authenticate to RabbitMQ.

Cluster name

<KUBE_CLUSTER_NAME>: User-defined and unique in your org. Set as airm.agent.clusterName; AMD Resource Manager uses it to identify the cluster. Use the same string in the AMD GPU Metrics Exporter config as CustomLabels.KUBE_CLUSTER_NAME so node heartbeats and metrics stay aligned (see Forward Device Metrics Exporter metrics).

Time-series database#

You will need a Prometheus-compatible time-series database to receive Device Metrics Exporter samples — AMD Resource Manager queries it to render GPU and workload metrics in the UI. The choice of backend, scrape-and-forward agent, endpoint, tenancy, and authentication is yours; coordinate with your platform or observability team and see Forward Device Metrics Exporter metrics for the behavior the pipeline must satisfy.

Install platform prerequisites#

These two operators are needed by both the AMD GPU Operator and the Resource Manager cluster agent that follow, so install them first. Skip any item that is already on the cluster (healthy, at a compatible version, and exposing the APIs the rest of this guide expects)—confirm with kubectl, the vendor docs, or your platform team. Baselines that ship Cluster Forge-style stacks often already include these.

Prerequisite

Why it is required

cert-manager

Issues TLS for admission webhooks used in this guide—the AMD GPU Operator, the Resource Manager Admission Webhook, and others such as Kaiwo. Your chosen metrics-forwarding stack may also require it.

External Secrets Operator

Supports ClusterSecretStore / SecretStore patterns common in AMD-qualified environments; configure stores for your secret backend (you can still create the RabbitMQ Secret manually if your process does not use ESO yet).

cert-manager#

Install using the official method your platform has qualified — see the cert-manager installation guide. After installing, confirm the operator is healthy with kubectl get pods -n cert-manager.


External Secrets Operator#

Install using the official guide — see External Secrets — getting started. After installing, confirm the operator is healthy with kubectl get pods -n external-secrets.

You must still configure SecretStores / ClusterSecretStores and policies for your secret backend; that is environment-specific.


Install the AMD GPU Operator#

The AMD GPU Operator deploys the GPU control plane: NFD, KMM (for driver modules where that applies), the device plugin and node labeller, and the Device Metrics Exporter through the same stack—see the operator overview and the product features list. This guide assumes the operator is installed and that you let it manage the metrics exporter; the Configure the DeviceConfig and Metrics Exporter subsection then configures that operator-managed exporter for Resource Manager—no separate exporter chart is installed.

Prerequisitescert-manager (admission webhooks for the operator) must be installed and healthy from the previous section. Match Kubernetes and Helm minimums in the AMD GPU operator install documentation.

Install the operator#

Install the AMD GPU Operator using the upstream guides (commands, chart versions, values, and platform-specific notes are maintained there):

Use the chart version your AMD or platform team has qualified. NFD and KMM ship as subcharts of the same install; you only install them separately if you changed chart values to disable the bundled subcharts. The upstream guides also cover post-install verification (controller, NFD, KMM, and device-stack pod health). After install, follow Configure the DeviceConfig and Metrics Exporter below to wire the operator-managed Device Metrics Exporter to Resource Manager.

Configure the DeviceConfig and Metrics Exporter#

Create or apply a DeviceConfig so the operator runs the device plugin, node labeller, and the Device Metrics Exporter on GPU nodes:

For Resource Manager, your DeviceConfig must:

  • Enable the operator-managed metrics exporter — set spec.metricsExporter.enable: true.

  • Reference the gpu-config ConfigMap (created below) so the exporter emits the CustomLabels.KUBE_CLUSTER_NAME that Resource Manager expects. The exact field path for the ConfigMap reference varies by operator release — check the DeviceConfig reference for your version.

Create the gpu-config Metrics Exporter ConfigMap. The operator-managed exporter reads its JSON from a ConfigMap named in spec.metricsExporter.config.name. Use gpu-config as the metadata.name (and reference the same name in your DeviceConfig). Start from the upstream Metrics Exporter ConfigMap example in the AMD GPU Operator reference, then add the AIRM-specific stanzas below to the GPUConfig block so Resource Manager can correlate workloads, projects, and the cluster name. Set CustomLabels.KUBE_CLUSTER_NAME to the same value as airm.agent.clusterName on the Resource Manager cluster agent (see Before you start):

{
 "GPUConfig": {
  // Keep the rest of your upstream GPUConfig settings here.
  "ExtraPodLabels" : {
   "WORKLOAD_ID"   : "airm.silogen.ai/workload-id",
   "PROJECT_ID"    : "airm.silogen.ai/project-id"
  },
  "CustomLabels" : {
   "KUBE_CLUSTER_NAME" : "<KUBE_CLUSTER_NAME>"
  }
 }
}

Save the metrics exporter configmap file (ex.gpu-metrics-exporter-config.yaml, then apply it before you apply (or re-apply) the DeviceConfig that references it so the operator picks it up on the first reconcile:

kubectl apply -f gpu-metrics-exporter-config.yaml

Ensure your DeviceConfig to enable the operator-managed exporter and reference the gpu-config ConfigMap you just applied. The relevant stanza looks like this — merge it into the spec of the DeviceConfig you copied from the upstream Full DeviceConfig example:

apiVersion: amd.com/v1alpha1
kind: DeviceConfig
metadata:
  name: gpu-operator
  namespace: kube-amd-gpu
spec:
  # ...other fields from the upstream Full DeviceConfig example...
  metricsExporter:
    enable: true
    config:
      name: gpu-config

Apply your edited DeviceConfig with kubectl apply -f <your-deviceconfig>.yaml and let the operator roll out the exporter and other GPU-stack pods.

Verify the Device Metrics Exporter#

Confirm the gpu-config ConfigMap is in place and that the operator-managed exporter pods and Service are running:

kubectl get configmap gpu-config -n kube-amd-gpu
kubectl get pods,svc -n kube-amd-gpu

The exporter pods should be Running and the Service (typically on HTTP port 5000) exposes Prometheus-format metrics for your scrape-and-forward agent. If pods CrashLoop, check kubectl logs -n kube-amd-gpu <exporter-pod> — most failures are a malformed config.json in the ConfigMap or the exporter scheduled on a non-GPU node. Note the Service DNS name and port for the next step.

Forward Device Metrics Exporter metrics#

With the DeviceConfig and gpu-config ConfigMap applied and the exporter verified healthy, you need to ship the exporter’s metrics into a Prometheus-compatible time-series database AMD Resource Manager can query. The choice of agent and backend is yours — follow your platform team’s standard.

What AMD Resource Manager expects:

  • The metrics emitted by the operator-managed Device Metrics Exporter Service (Prometheus exposition format on the exporter’s HTTP port) end up in a time-series database the Resource Manager UI can read.

  • Each sample carries the CustomLabels.KUBE_CLUSTER_NAME you set in the gpu-config ConfigMap, matching airm.agent.clusterName on the cluster agent, so heartbeats and metrics correlate to the same cluster in the UI.

Pick a scrape-and-forward agent and a backing store that meets that expectation. Common options include:

Wire your chosen agent to the Prometheus-compatible time-series database you lined up in Before you start.


Installing the Resource Manager cluster agent and Admission Webhook#

This is the in-cluster part of the flow: install agent-specific CRD providers, create RabbitMQ credentials from the UI, then install the Helm chart that runs the Resource Manager cluster agent and Admission Webhook Deployments.

Install agent-specific prerequisites#

The agent and its Admission Webhook reconcile workload CRDs supplied by Kaiwo and AIM Engine. Install whichever of these is not yet on the cluster (skip if your baseline already includes them at a compatible version).

Prerequisite

Why it is required

Kaiwo

Supplies KaiwoJob, KaiwoService, and related APIs the Resource Manager cluster agent and Admission Webhook integrate with.

AIM Engine

Supplies AIMService and related CRDs/controllers the Resource Manager cluster agent and Admission Webhook reconcile.

Kaiwo#

Install Kaiwo (dependencies and operator) using the official guide. That guide covers Helm and manifest installs, optional dependency bundles, GPU-related prerequisites, and how to verify the installation:

Follow the release or version your AMD or platform team has qualified for use with AMD Resource Manager.


AIM Engine#

Install AIM Engine (CRDs and controllers) using the AIM Engine GitHub repository — follow the README / docs installation section (or the tag your platform was validated against).


Install the cluster agent (Helm)#

This example assumes the platform prerequisites, agent-specific prerequisites, and RabbitMQ Secret from Connect a cluster are in place (see Connect a cluster in the UI and the placeholders in Important - Before you start).

1 — Inspect the chart from OCI before you install. Replace '<CHART_VERSION>' with a tag your platform has qualified (the version and appVersion fields in the output of helm show chart can help you pick it):

helm show chart oci://registry-1.docker.io/amdenterpriseai/airm-agent-chart

2 — Namespace, RabbitMQ Secret, and install from OCI:

# Create the airm namespace
kubectl create namespace airm --dry-run=client -o yaml | kubectl apply -f -

# RabbitMQ user credentials (AMD Resource Manager > Connect a cluster: cluster-id and cluster-secret)
kubectl create secret generic airm-rabbitmq-common-vhost-user \
  --namespace airm \
  --from-literal=username='<AIRM_RABBITMQ_USERNAME>' \
  --from-literal=password='<AIRM_RABBITMQ_PASSWORD>'

# Install the Cluster Agent and Admission Webhook from the OCI chart. Pick <CHART_VERSION> from
#   `helm show chart`.
helm upgrade --install agent oci://registry-1.docker.io/amdenterpriseai/airm-agent-chart \
  --version '<CHART_VERSION>' \
  -n airm --create-namespace \
  --set airm.agent.clusterName='<KUBE_CLUSTER_NAME>' \
  --set airm.agent.rabbitmq.host='<AIRM_RABBITMQ_HOST>'

airm.agent.rabbitmq.port defaults to 5672; add --set airm.agent.rabbitmq.port=<port> for a different AMQP port.

Verify agent and webhook#

After install, confirm all controller Pods are healthy: the agent should show a good RabbitMQ connection in its logs, and the webhook should become Ready once cert-manager has delivered the webhook TLS secret. Use the commands below to list Pods and tail logs if something is stuck (for example ImagePullBackOff or the webhook never Ready).

kubectl get pods -n airm
kubectl logs -n airm deploy/agent-agent --tail=50
kubectl logs -n airm deploy/agent-agent-webhook --tail=30

Cluster readiness checklist (before expecting “healthy” in the AMD Resource Manager UI)#

After completing the installation, connecting the UI, and installing the Resource Manager cluster agent, open AMD Resource Manager and confirm the cluster shows Connected (heartbeats can take a short time).

Check

Command or action

cert-manager

kubectl get pods -n cert-manager

External Secrets

kubectl get pods -n external-secrets

AMD GPU Operator

kubectl get pods -n kube-amd-gpu (or your org’s namespace) — expect controller, NFD, KMM, and device stack Ready after you applied DeviceConfig (see Install the AMD GPU Operator). Skip if Cluster Forge already installed it.

Kaiwo

As described in the Kaiwo installation guide (for example operator pods in kaiwo-system and CRDs present).

AIM Engine

As described in the AIM Engine installation instructions for your release.

Resource Manager cluster agent + Admission Webhook

kubectl get pods -n airm — expect …-agent and …-agent-webhook Running; kubectl logs -n airm deploy/agent-agent should show RabbitMQ connection and heartbeats; kubectl logs -n airm deploy/agent-agent-webhook should show the Admission Webhook serving without TLS errors after cert-manager issues the certificate

Device Metrics Exporter

Operator-managed Device Metrics Exporter running on GPU nodes after DeviceConfig is applied and the gpu-config ConfigMap is in place (see Verify the Device Metrics Exporter); check with kubectl get pods,svc -n kube-amd-gpu

Metrics forwarding

Your chosen scrape-and-forward agent is healthy and writing exporter samples to your time-series database; verify by querying a known AMD exporter metric in your backend a few scrape intervals after deploy (see Forward Device Metrics Exporter metrics).

AMD Resource Manager metrics backend

The AMD Resource Manager API queries metrics with a Prometheus client, so the time-series database you chose must expose the Prometheus query API (Mimir, Cortex, Thanos, VictoriaMetrics, and Prometheus all do). Configure the AIRM control plane with that backend’s query URL (plus tenant and auth if required), reachable from the AIRM API. Verify by opening the cluster in the UI; GPU and workload metrics should appear a few minutes after the agent connects.

Resource Manager UI

Cluster shows healthy after heartbeats (typical window up to a few minutes)