AMD AI Workbench inference metrics

Inference Metrics#

AMD AI Workbench provides comprehensive real-time metrics for monitoring AIM inference workload performance. These metrics are available on the workload details page when an inference workload is in a running state.

Available metrics#

Latency metrics#

Time to First Token

  • Measures the latency from request submission to the generation of the first token

  • Critical for evaluating perceived responsiveness in streaming applications

  • Lower values indicate better user experience

Inter-Token Latency

  • Tracks the time elapsed between consecutive token generations

  • Important for smooth streaming output

  • Consistent low values ensure fluid text generation

End-to-End Latency

  • Captures the total time from request submission to complete response

  • Comprehensive measure of overall request processing performance

  • Useful for batch processing and non-streaming scenarios

Request metrics#

Inference Requests

  • Displays real-time count of running and waiting requests

  • Helps identify processing bottlenecks and queue buildup

  • Color-coded visualization for quick status assessment

Max Concurrent Requests

  • Shows the peak number of concurrent requests handled

  • Useful for capacity planning and load testing

  • Indicates maximum workload throughput

Resource metrics#

Total Tokens

  • Cumulative count of tokens processed by the workload

  • Useful for usage tracking and billing estimations

  • Includes both input and output tokens

KV Cache Utilization

  • Percentage of key-value cache currently in use

  • Critical for understanding memory pressure

  • High utilization may indicate need for scaling or optimization

Time range selection#

Metrics can be viewed across different time periods:

  • 1 Hour - Real-time monitoring and immediate troubleshooting

  • 24 Hours - Daily performance patterns and trends

  • 7 Days - Weekly analysis and capacity planning

The metrics dashboard automatically refreshes to display the latest data based on the selected time range.

Accessing metrics#

  1. Navigate to the Workloads page.

  2. Select Open details from the context menu of a running inference workload.

  3. View the Inference metrics section on the workload details page.

  4. Use the time range selector to adjust the viewing period.

Note

Metrics are only available for AIM inference workloads. Custom and fine-tuned models do not support this feature yet.