AMD AI Workbench deploy models inference advanced performance

Advanced Deployment Options#

When deploying AMD Inference Microservices (AIMs) from the model catalog, you have access to advanced configuration options that allow you to optimize your deployment for specific use cases and requirements.

Performance Metric Selection#

The performance metric option allows you to specify the optimization goal for your model deployment. This setting influences how the AIM container configures itself to best serve your workload.

Available Metrics#

Latency: Prioritizes low end-to-end latency for individual requests. Choose this metric when:

  • You need fast responses for real-time applications

  • User experience depends on quick individual predictions

  • You’re building interactive chatbots or live inference services

Throughput: Prioritizes sustained requests per second. Choose this metric when:

  • You need to process high volumes of requests

  • Batch processing is more important than individual response time

  • You’re running background inference tasks or bulk predictions

How to Select a Metric#

  1. Navigate to the Models page and go to the AIM Catalog tab

  2. Click Deploy on your chosen model

  3. In the deployment drawer, locate the Performance metric section

  4. Select your preferred metric from the dropdown:

    • Latency - Optimize for low response times

    • Throughput - Optimize for high request volume

    • Default - Let the AIM automatically select the best metric based on the model and hardware

Performance metric selection in the deployment drawer

Note

If you don’t specify a metric, the AIM will automatically select the most appropriate optimization based on the model type and available hardware. For most use cases, the default selection provides excellent performance.

Performance Metric Impact#

The selected metric affects:

  • Memory allocation strategies

  • Batch processing configurations

  • Cache utilization patterns

  • GPU resource scheduling

Different models may respond differently to metric selection based on their architecture and size. Larger language models typically benefit more from throughput optimization when handling multiple concurrent users, while smaller models may show better results with latency optimization.

Unoptimized Deployments#

By default, AMD AI Workbench only deploys models using fully optimized configurations that have been validated for production performance on AMD hardware. However, you can optionally enable unoptimized deployments to access preview and experimental configurations.

Profile Types#

AIM deployments use one of three profile types:

Optimized Profiles: Configurations that achieve ≥90% performance compared to competitive platforms

  • Production-ready and fully supported

  • Recommended for all production workloads

  • Thoroughly tested and validated

  • Only type used by default

Preview Profiles: Configurations that achieve ≥70% performance compared to competitive platforms

  • Suitable for testing and development

  • May be promoted to optimized in future releases

  • Generally stable but may have minor limitations

  • Only available when unoptimized deployments are enabled

Unoptimized Profiles: Experimental configurations with <70% performance

  • Not recommended for production use

  • May result in significantly subpar performance

  • Useful for testing new hardware or experimental features

  • Only available when unoptimized deployments are enabled

When to Enable Unoptimized Deployments#

Enable unoptimized deployments when:

  • Testing models on newly released AMD hardware (e.g., MI325X)

  • Participating in early access programs or demos

  • Evaluating preview or experimental features

  • Working with AMD support on performance optimization

  • Deploying models that don’t yet have optimized profiles for your hardware

Do not enable unoptimized deployments for:

  • Production workloads

  • Performance benchmarking or comparisons

  • Customer-facing applications

  • Any use case where performance matters

How to Enable Unoptimized Deployments#

Warning

Unoptimized deployments may result in significantly lower performance. Only enable this option if you specifically need access to preview or experimental configurations.

  1. Navigate to the Models page and go to the AIM Catalog tab

  2. Click Deploy on your chosen model

  3. Scroll to the Unoptimized deployment section

  4. Toggle the Allow switch to enable unoptimized deployments

When enabled, the AIM deployment system will:

  • First attempt to use optimized profiles (if available)

  • Fall back to preview profiles if no optimized profile exists

  • Use unoptimized profiles only if no better option is available

When disabled (default), the deployment will fail if no optimized profiles are available for your hardware configuration.

Important

Even with this option enabled, the system always prefers optimized configurations when available. The toggle simply allows the use of preview and unoptimized profiles when necessary.

Profile Selection Priority#

The deployment system selects profiles in the following order:

With “Allow unoptimized” disabled (default):

  1. Optimized profiles only

  2. Deployment fails if no optimized profiles are available

With “Allow unoptimized” enabled:

  1. Optimized profiles (highest priority)

  2. Preview profiles (if no optimized available)

  3. Unoptimized profiles (if no optimized or preview available)

  4. Deployment fails only if no profiles exist at all

Performance Expectations#

When using unoptimized deployments with preview or unoptimized profiles, you may experience:

  • Higher latency per request

  • Lower throughput (requests per second)

  • Increased resource utilization

  • Less stable performance characteristics

  • Reduced GPU efficiency

These limitations are typically temporary as AMD continues to optimize configurations for new hardware and models. Preview profiles that demonstrate stable performance are regularly promoted to optimized status.

Combining Deployment Options#

You can combine both performance metric selection and unoptimized deployment options:

Example: Deploying a model on MI325X hardware with throughput optimization

  1. Select Throughput as your performance metric

  2. Enable Allow unoptimized deployments (if MI325X profiles are still in preview/unoptimized status)

  3. Deploy the model

The system will select the best available throughput-focused profile, prioritizing optimized configurations first, then falling back to preview or unoptimized profiles if necessary.

Monitoring Deployment Performance#

After deploying with custom options, monitor your deployment’s performance:

  1. Navigate to the Workloads page

  2. Select Open details from your inference workload

  3. View the Inference metrics section

Key metrics to watch:

  • Time to First Token: Indicates responsiveness (latency-focused)

  • Throughput: Shows requests per second (throughput-focused)

  • Resource Utilization: Helps identify if configuration is appropriate

See Inference Metrics for detailed information about monitoring your deployments.

Best Practices#

  1. Start with defaults: Use default settings unless you have specific requirements

  2. Use optimized profiles only in production: Keep “Allow unoptimized” disabled for production workloads

  3. Test thoroughly: When using preview or unoptimized deployments, conduct thorough performance testing before production use

  4. Monitor actively: Keep an eye on metrics to ensure your configuration meets expectations

  5. Stay updated: Check release notes for when preview profiles are promoted to optimized

  6. Contact support: Work with AMD support when deploying on new hardware or encountering performance issues

Troubleshooting#

Deployment Fails with “No optimized profile available”#

If your deployment fails with this error:

  1. Verify your hardware is supported for this model

  2. Consider enabling “Allow unoptimized deployments” if you need to proceed with preview or experimental configurations

  3. Check the AIM catalog for alternative models that have optimized profiles for your hardware

  4. Contact AMD support for information about optimization timelines

Performance Lower Than Expected#

If your deployed model shows poor performance:

  1. Check if you’re using an unoptimized or preview profile (visible in deployment details)

  2. Try different performance metrics (latency vs throughput)

  3. Review the inference metrics to identify bottlenecks

  4. Ensure you’re using the latest AIM version (check for updates in the catalog)