Benchmark Memory
This guide shows how to measure HBM and DDR memory bandwidth on a V80 board
using v80-smi validate and the performance example.
Prerequisites
The SLASH stack is installed,
vrtdis running, and a V80 board is visible inv80-smi list.Root or sufficient permissions for device reset (
v80-smi validateperforms a full reset before testing).
Quick Start — v80-smi validate
The fastest way to benchmark memory is with the built-in validate command:
v80-smi validate -d <BDF> [-j <threads>]
<BDF>— board address fromv80-smi list(e.g.0000:03:00).[threads]— number of parallel buffers (default: 8). Each buffer is 64 MB (one allocator sub-region).
Example:
v80-smi validate -d 0000:03:00
v80-smi validate -d 0000:03:00 -j 4 # 4 parallel buffers
What validate Measures
The command runs three phases:
Device reset — performs a full hotplug reset sequence via
vrtdto bring the board to a clean state.HBM test — allocates N buffers in HBM, then:
Integrity check: fills each buffer with an XOR pattern (
data[i] = i ^ seed), syncs host-to-device, clears the host copy, syncs device-to-host, and verifies every word matches.Bandwidth measurement: launches N threads in parallel. Each thread performs a full-buffer host-to-device (H2C) transfer, then a device-to-host (C2H) transfer. Wall-clock time is recorded for each direction.
DDR test — repeats the same integrity and bandwidth measurements using DDR memory.
Output includes per-direction bandwidth in MB/s and a pass/fail verdict for data integrity.
Using Example 05 (perf)
The 05_perf example provides a more configurable benchmark that runs
inside your own application. Build it against the repository:
cd examples/05_perf
cmake -B build -S . -G Ninja -DSLASH_USE_REPO=ON
cmake --build build
cmake --build build --target hls
cmake --build build --target perf_hw # or perf_emu / perf_sim
Run:
./05_perf <BDF> perf_hw.vbin
This example allocates buffers across HBM and DDR banks and measures round-trip throughput, giving you a baseline for your own kernel designs.
Interpreting Results
Metric |
Direction |
Description |
|---|---|---|
Write bandwidth |
H2C (host-to-device) |
Rate at which data is DMA’d from host memory to the device. |
Read bandwidth |
C2H (device-to-host) |
Rate at which data is DMA’d from the device back to host memory. |
Data integrity |
Both |
PASS if every word survives the round-trip; FAIL indicates a transfer or memory error. |
HBM typically delivers significantly higher aggregate bandwidth than DDR due to its 32 independent channels. Increasing the thread count can improve utilisation of these parallel channels.
Note
Bandwidth numbers depend on PCIe link width and generation, BIOS settings, IOMMU configuration, and host CPU/memory speed. Results are most useful for relative comparisons (e.g. before/after a configuration change) rather than absolute guarantees.
Tuning Parameters
Thread count — more parallel buffers can saturate more HBM channels, but returns diminish once the PCIe link is the bottleneck.
Buffer size — each buffer is 64 MB by default (one allocator sub-region). The validate command does not expose a size parameter; use the VRT API directly if you need different sizes.
See Also
Memory Model — how HBM and DDR banks are organised.
Buffers and Memory — buffer allocation and synchronisation in application code.
Command Reference — full
v80-smicommand reference.