Inference Metrics

Inference Metrics#

AMD AI Workbench provides comprehensive real-time metrics for monitoring AIM inference workload performance. These metrics are available on the workload details page when an inference workload is in a running state.

Available metrics#

Latency metrics#

Time to First Token

Measures the latency from request submission to the generation of the first token
Critical for evaluating perceived responsiveness in streaming applications
Lower values indicate better user experience

Inter-Token Latency

Tracks the time elapsed between consecutive token generations
Important for smooth streaming output
Consistent low values ensure fluid text generation

End-to-End Latency

Captures the total time from request submission to complete response
Comprehensive measure of overall request processing performance
Useful for batch processing and non-streaming scenarios

Request metrics#

Inference Requests

Displays real-time count of running and waiting requests
Helps identify processing bottlenecks and queue buildup
Color-coded visualization for quick status assessment

Max Concurrent Requests

Shows the peak number of concurrent requests handled
Useful for capacity planning and load testing
Indicates maximum workload throughput

Resource metrics#

Total Tokens

Cumulative count of tokens processed by the workload
Useful for usage tracking and billing estimations
Includes both input and output tokens

KV Cache Utilization

Percentage of key-value cache currently in use
Critical for understanding memory pressure
High utilization may indicate need for scaling or optimization

Time range selection#

Metrics can be viewed across different time periods:

1 Hour - Real-time monitoring and immediate troubleshooting
24 Hours - Daily performance patterns and trends
7 Days - Weekly analysis and capacity planning

The metrics dashboard automatically refreshes to display the latest data based on the selected time range.

Accessing metrics#

Navigate to the Workloads page.
Select Open details from the context menu of a running inference workload.
View the Inference metrics section on the workload details page.
Use the time range selector to adjust the viewing period.

Note

Metrics are only available for AIM inference workloads. Custom and fine-tuned models do not support this feature yet.