Inference Metrics#
AMD AI Workbench provides comprehensive real-time metrics for monitoring AIM inference workload performance. These metrics are available on the workload details page when an inference workload is in a running state.
Available metrics#
Latency metrics#
Time to First Token
Measures the latency from request submission to the generation of the first token
Critical for evaluating perceived responsiveness in streaming applications
Lower values indicate better user experience
Inter-Token Latency
Tracks the time elapsed between consecutive token generations
Important for smooth streaming output
Consistent low values ensure fluid text generation
End-to-End Latency
Captures the total time from request submission to complete response
Comprehensive measure of overall request processing performance
Useful for batch processing and non-streaming scenarios
Request metrics#
Inference Requests
Displays real-time count of running and waiting requests
Helps identify processing bottlenecks and queue buildup
Color-coded visualization for quick status assessment
Max Concurrent Requests
Shows the peak number of concurrent requests handled
Useful for capacity planning and load testing
Indicates maximum workload throughput
Resource metrics#
Total Tokens
Cumulative count of tokens processed by the workload
Useful for usage tracking and billing estimations
Includes both input and output tokens
KV Cache Utilization
Percentage of key-value cache currently in use
Critical for understanding memory pressure
High utilization may indicate need for scaling or optimization
Time range selection#
Metrics can be viewed across different time periods:
1 Hour - Real-time monitoring and immediate troubleshooting
24 Hours - Daily performance patterns and trends
7 Days - Weekly analysis and capacity planning
The metrics dashboard automatically refreshes to display the latest data based on the selected time range.
Accessing metrics#
Navigate to the Workloads page.
Select Open details from the context menu of a running inference workload.
View the Inference metrics section on the workload details page.
Use the time range selector to adjust the viewing period.
Note
Metrics are only available for AIM inference workloads. Custom and fine-tuned models do not support this feature yet.