AMD Instinct MI50 benchmark summary

Abstract

This field note consolidates public AMD Instinct MI50 benchmark observations with a local 3×MI50 llama.cpp run. The objective is to document practical performance behavior, compare VBIOS and power-limit effects, and establish a baseline for future local AI workload testing.

The external reference comes from a public MI50 32GB VBIOS note that compares V420.rom benchmark behavior across power caps, SCLK/MCLK settings, and LLM throughput. The local benchmark was performed on a 3×MI50/gfx906 ROCm system using Qwen2.5-Coder-32B-Instruct in Q4_K_M GGUF format.

This is not a strict apples-to-apples comparison. The external reference uses 4×MI50 with gpt-oss:120b through Ollama, while the local run uses 3×MI50 with Qwen2.5-Coder-32B-Instruct through llama.cpp. The value of the comparison is operational: it highlights what to measure, what configuration variables matter, and where future controlled tests should focus.

Benchmark Context

The public benchmark table was designed to check stability under different overclocks and power caps. The source also notes that power limits and thermals can hold the card back. All listed tests were performed with rocm-smi --setperflevel high, after warm-up, on a Ryzen 9 5950X test system.

The external LLM test used the following configuration:

Item External Reference
GPU 4× AMD Instinct MI50
ROM V420.rom
Workload gpt-oss:120b
Runtime Ollama ROCm container
Context 32768
KV cache q8_0
Flash attention Enabled
Output metrics Prompt processing rate and token generation rate

Extracted V420.rom Results

The table below extracts the average values from the public benchmark data.

ROM / Power Cap SCLK / MCLK Avg FPS Avg pp/s Avg tg/s
V420.rom @ 178W 1800 / 1000 72.65 2562.27 31.52
V420.rom @ 178W 1800 / 1180 72.98 2623.45 33.01
V420.rom @ 178W 2000 / 1000 73.46 2679.17 33.20
V420.rom @ 178W 2000 / 1180 74.01 2810.14 34.86
V420.rom @ 225W 2000 / 1000 N/A 2683.52 33.29
V420.rom @ 225W 2000 / 1180 78.51 2796.71 34.78
V420.rom @ 300W 2000 / 1000 N/A 2682.30 33.24
V420.rom @ 300W 2000 / 1150 80.73 2764.77 34.60
V420.rom @ 300W 2000 / 1180 80.85 2802.37 34.85

V420 generation throughput chart

Key Observations from the External Benchmark

The strongest LLM generation result in the extracted table was approximately 34.86 tg/s at 178W, 2000/1180 MHz. A near-identical result appeared at 300W, 2000/1180 MHz with 34.85 tg/s. This suggests that, for this specific workload, increasing the power cap alone did not materially improve LLM token generation once clock and memory settings were already favorable.

Memory frequency appears more important than raw power cap in several rows. At 178W and 2000 MHz SCLK, increasing MCLK from 1000 to 1180 MHz improved average generation throughput from 33.20 tg/s to 34.86 tg/s, an improvement of roughly 5%.

Game FPS behaved differently. The Cyberpunk 2077 result improved from 74.01 FPS at 178W / 2000/1180 to 80.85 FPS at 300W / 2000/1180. This indicates that synthetic, game, and LLM workloads do not stress the card in the same way.

Local Benchmark Baseline

The local benchmark was performed using a 3×MI50 ROCm setup and llama.cpp.

Local llama.cpp run

Item Local Run
GPUs 3× AMD Instinct MI50 / gfx906
Runtime llama.cpp / llama-server
Model Qwen2.5-Coder-32B-Instruct
Quantization Q4_K_M GGUF
Context 4096
Model size 18.48 GiB
Parameters 32.76B
GPU offload 65 / 65 layers
Prompt eval 313.68 ms / 35 tokens
Prompt throughput 111.58 tokens/s
Decode eval 183.36 ms / 4 tokens
Decode throughput 21.82 tokens/s
Total 497.03 ms / 39 tokens

Interpretation

The local run confirms that the model loads correctly, the 65/65 layers are offloaded to GPU, and the stack is operational. The prompt evaluation rate of 111.58 tokens/s is a useful positive signal for prompt ingestion on the current setup.

The decode result of 21.82 tokens/s should be treated carefully because it was measured over only four generated tokens. This is too short to represent stable long-generation throughput. A better future benchmark should use longer output, repeated runs, warm-up, and consistent logging of GPU clocks, power draw, temperature, and memory usage.

Practical Benefits

This benchmark note provides four practical benefits:

  1. It documents a working MI50 local inference baseline.
  2. It separates external reference behavior from local results.
  3. It shows that memory clock and VBIOS behavior may matter more than simply increasing power.
  4. It gives a repeatable structure for future tests on the same server.

Recommended Next Benchmark Plan

A stronger next run should use the same model and runtime across all tests:

Test Area Recommended Method
Runtime llama.cpp only
Model Same GGUF across all tests
Prompt Fixed prompt with fixed token length
Output At least 256 generated tokens
Runs 5 runs after warm-up
Metrics pp/s, tg/s, total time, GPU temperature, SCLK, MCLK, power
Comparison Same VBIOS and same GPU count before changing one variable

The most important rule is to change only one variable at a time: model, quantization, context size, GPU count, clocks, or VBIOS.

Conclusion

The AMD Instinct MI50 remains a useful low-cost accelerator for local AI workloads when VBIOS, ReBAR, ROCm, cooling, and runtime configuration are handled carefully. The public V420.rom data shows that LLM throughput can remain nearly flat across 178W to 300W when clocks are already favorable, while memory frequency improvements can have a clearer effect.

The local 3×MI50 llama.cpp run confirms functional multi-GPU offload with Qwen2.5-Coder-32B-Instruct Q4_K_M, reaching 111.58 tokens/s in prompt evaluation and 21.82 tokens/s in a short decode sample. The result is a valid operational baseline, not a final performance ceiling.

Future testing should use longer generations, multiple runs, and consistent telemetry. That will make it possible to determine whether the next improvement should come from VBIOS tuning, clock configuration, quantization choice, model selection, or runtime parameters.

References

  • evilJazz, AMD Instinct MI50 32GB VBIOS, GitHub Gist.
  • Local benchmark notes from the current 3×MI50 llama.cpp setup.