platform installation demo environment kubernetes

Install AMD Enterprise AI Suite on Premises#

This article explains how to install AMD Enterprise AI Suite in an on-premises environment, covering the full stack from metal to application layer in a streamlined manner.

You will use an installation tool called Cluster Bloom to first install and configure a Kubernetes cluster, and then install AMD Enterprise AI Suite.

Prerequisites#

Before beginning the software installation, please ensure your system and network environment meets the following requirements. Proper configuration of these elements is crucial for the security, accessibility, and functionality of the application.

System requirements#

In order to install the platform your system should meet the following requirements:

  • AMD MI300X GPU

  • Ubuntu (supported versions checked at runtime)

  • Sufficient disk space (500GB+ recommended for root partition, 3TB+ for data and workloads)

  • Unformatted (“raw”) NVMe drives for optimal storage configuration

  • Root/sudo access

Domain names#

Before installing the AMD Enterprise AI Suite, you’ll need a domain name (such as myapp.example.com) that points to your server’s IP address.

If you don’t have a DNS-enabled domain available, you may use a .nip.io domain, which automatically resolves to your service’s IP address. Example: https://203.0.113.10.nip.io (format is https://<master-node-ip-address>.nip.io). This will resolve directly to 203.0.113.10 and allow HTTPS access without DNS setup.

Note

A .nip.io domain is automatically created for you as part of the installation process.

TLS certificates#

A valid TLS certificate must be configured for the chosen domain to enable secure HTTPS connections to your services. To setup TLS certificate you need to select one of the following options:

  • You can provide your own trusted TLS certificate purchased from a certificate authority.

  • You can create a self-signed certificate during the installation process.

  • You can use the free Let’s Encrypt service to automatically generate one. When using Let’s Encrypt, your setup must meet one of these requirements: either have port 80 accessible from the internet (allowing Let’s Encrypt to verify domain ownership through your website), or have DNS management capabilities that allow automated domain validation (where Let’s Encrypt can verify ownership by temporarily adding DNS records to your domain).

Network access#

AMD Enterprise AI Suite requires HTTPS for all external access. When you specify a domain for the service (whether it’s a custom domain or a convenience domain like *.nip.io), the application will also use this domain name internally when making callbacks to itself. Specifically, you should ensure that:

  • The application is reachable at https://your-domain.

  • Port 443 must be open on your firewall and routed to the application service

Note

.nip.io only provides DNS resolution. You still need to ensure port 443 is open and that your TLS certificate is valid for the chosen domain. If your application is configured to call itself at a .nip.io address, you must use the same domain consistently for external and internal access.

Load balancing#

For production environments, the domain should point to a load balancer that distributes traffic across multiple servers. For smaller setups or demonstrations, the domain can point directly to a single server’s IP address, and MetalLB will be configured to handle load balancing within the Kubernetes cluster.

Install AMD Enterprise AI Suite#

You will use an installation tool called Cluster Bloom to first install and configure a Kubernetes cluster, and then install the applications. The installation tool performs the following steps to prepare an AMD GPU node to be part of a Kubernetes cluster:

  • Automated RKE2 Kubernetes cluster deployment

  • ROCm setup and configuration for AMD GPU nodes

  • Disk management and Longhorn storage integration

  • Multi-node cluster support with easy node joining

  • 1Password integration for secrets management

  • Installs AMD Enterprise AI Suite using Cluster Forge tool

Note

The current platform version supports only one cluster per installation. Support for multiple clusters is on the product roadmap.

SSH to node as root user#

Access the node using SSH as root user.

Download the software#

Go to the working folder where you want to install the platform.

Run the following commands to download the latest software release. This includes both the installation script (“bloom”) and software application:

https://github.com/silogen/cluster-bloom/releases/download/v1.5.2/bloom

Make the installation script executable:

chmod +x bloom

Edit the values in configuration file#

Before you can start the installation you need to edit the installation configuration, which adapts the installation to your environment. You can edit the values directly in the configuration file.

Note

For the standard installation you should use the default values, only exception is the DOMAIN and OIDC_URL values which should match your specific domain name. You can find more details about the configuration values in this section.

Open the configuration file bloom.yaml. Edit the DOMAIN and OIDC_URL values to match your specific domain name.

DOMAIN: <your-ip-address>.nip.io
CERT_OPTION: generate
CLUSTERFORGE_RELEASE: none
FIRST_NODE: true
GPU_NODE: true
USE_CERT_MANAGER: false
CLUSTER_DISKS: /dev/vdc1
OIDC_URL: https://kc.<your-ip-address>.nip.io/realms/airm

Install Kubernetes cluster#

Run the following command to start the installation (this sets up a Kubernetes cluster for you):

sudo ./bloom --config bloom.yaml

The installation will take roughly 20 minutes. You can now follow the installation progress through the user interface, follow the instructions on the screen.

Optional step: Adding a second node to cluster#

After successful installation, Cluster Bloom generates additional_node_command.txt, which contains the command for installing additional nodes into the cluster.

View the kubernetes installation#

Exit and re-login to source the .bashrc, or run

source ~/.bashrc

Then run following commands to view the software installation on kubernetes:

k9s

Specify HuggingFace token#

In order to download and access gated models from Hugging Face, you need to provide a Hugging Face token. Tokens contain sensitive information. To keep them secure and prevent unauthorized access, they should not be stored in plain text in your code or configuration files. Instead, they are stored as secrets, a secure way to manage sensitive data.

How to get a Hugging Face token#

  1. Create or log in to your Hugging Face account: https://huggingface.co.

  2. Navigate to your account settings.

  3. Under Access Tokens, generate a new token.

  4. Copy the token (keep it safe; don’t share it).

Where to install the token#

The AMD AI Workbench provides a built-in secrets management system for storing HuggingFace tokens securely. You can add tokens through the UI when needed as part of the process of deploying or fine-tuning your models. You will be prompted to enter your Hugging Face token, which will be securely saved on your Kubernetes cluster.

Adding HF Tokens#

  1. In the deployment or fine-tune dialog, you’ll see a Hugging Face Authentication section.

  2. Choose one of two options:

    • Select existing token: Pick from previously added tokens in this project.

    • Add new token: Enter a name and paste your HuggingFace token.

  3. The token is automatically stored as a Kubernetes secret scoped to your project.

Token Management#

  • Scope: HuggingFace tokens are scoped to individual projects for security.

  • Reusability: Once added, tokens can be reused for multiple finetuning jobs or AIM deployments within the same project.

  • Storage: Tokens are stored as Kubernetes secrets and never exposed in plain text after creation.

  • Validation: Tokens must start with hf_ to be accepted.

Login#

To confirm that the installation was successful, ensure you are able to log in to the AMD AI Workbench:

  • Access the login URL (your domain name).

    • For nip.io domain: https://airmui.<master-node-ip-address>.nip.io

    • If using a registered domain, the web address of the service will be: https://airmui.<your-domain>

  • Login as devuser@domain user and use the default password (the default password is initially password, and you will be prompted to change it upon first login).

See more details about login here.

Install software into an existing Kubernetes cluster#

To install AMD Enterprise AI Suite in an existing Kubernetes cluster, download a Cluster Forge release package and run deploy.sh. This assumes there is a working Kubernetes cluster to deploy into, and the current Kubeconfig context refers to that cluster.

Run following commands to install the software:

wget https://github.com/silogen/cluster-forge/releases/download/v1.5.2/release-enterprise-ai-v1.5.2.tar.gz
tar -xzvf release-enterprise-ai-v1.5.2.tar.gz
cd cluster-forge/scripts
bash ./bootstrap.sh <DOMAIN>

Appendix#

Installation configuration values#

This section describes the installation configuration values.

FIRST_NODE:
Specifies if this is the first node in the cluster. Set to false for additional nodes joining an existing cluster.

GPU_NODE:
Specifies whether the node has GPUs. Set to false for CPU-only nodes. When true, ROCm will be installed and configured.

OIDC_URL:
URL of the OIDC provider for authentication. To use the bundled cluster-internal Keycloak, use kc.<your-ip-address>.nip.io/realms/airm. Leave empty to skip OIDC configuration.

NO_DISKS_FOR_CLUSTER: In rare instances you may choose to install a node without storage.

SKIP_RANCHER_PARTITION_CHECK:
Rancher partition at /var/lib is required for >= 500GB. Setting this at your own risk can allow installations but may limit image download capacity.

CLUSTER_DISKS:
List of disk devices to use. Example: dev/sdb. This should be used by default

CLUSTER_PREMOUNTED_DISKS:
Absolute path to pre-mounted disks. This is not preferred but sometimes necessary. This also requires:
DISABLED_STEPS: CleanLonghornMountsStep, CleanDisksStep

CLUSTERFORGE_RELEASE:
The Cluster Forge release URL or none to skip the SW installation.

DOMAIN:
Domain name for the cluster, for example, cluster.example.com. The domain name is used for ingress configuration. If you don’t have a DNS-enabled domain available, you may use a .nip.io domain with your IP address. Example: <master-node-ip-address>.nip.io.

USE_CERT_MANAGER:
Set to Yes to use cert-manager with Let’s encrypt for automatic TLS certificates. Set to false to provide your own certificates.

CERT_OPTION:
Certificate option when Use cert manager is false. Choose existing to use existing certificate files, or generate to create a self-signed certificate.

Create installation configuration using the Installation wizard#

Instead of editing the configuration file directly you can also use an installation wizard, which guides you in creating the configuration for the installation.

To start the wizard:

sudo ./bloom

Next, you need to complete each of the steps in the wizard.

Note

For the standard installation you should select the default values, only exception is the domain, OIDC_URL and cert option where you should provide a specific value.

You can find more details about the configuration values in this section.

Configuration complete!
Once the wizard has completed you can find the configuration file in bloom.yaml.

Start the installation
To run the actual installation select y in the following step Would you like to run bloom with this configuration no? (y/n)