Chain Streaming Kernels
This guide shows how to connect multiple HLS kernels via AXI-Stream to form a processing pipeline where data flows kernel-to-kernel without touching device memory.
Prerequisites
The SLASH stack is installed,
vrtdis running, and a V80 board is visible.Familiarity with HLS kernel basics. See Your First Kernel.
Streaming Pipeline Concept
In a streaming pipeline, kernels are wired together through on-chip AXI-Stream channels. Data bypasses device memory entirely between stages:
Host Memory ──► [dma_in] ──axis──► [passthrough] ──axis──► [dma_out] ──► Host Memory
dma_in — reads from device memory and writes to a stream.
passthrough — a freerunning kernel that processes each element as it arrives (in this example, a simple pass-through).
dma_out — reads from a stream and writes to device memory.
Writing Streaming HLS Kernels
DMA-In Kernel (Stream Producer)
The DMA-in kernel reads from a memory-mapped port and pushes each element onto an AXI-Stream output:
void dma_in(ap_uint<64>* in, hls::stream<ap_uint<64>>& axis_out, ap_uint<32> size) {
#pragma hls interface mode=s_axilite port=size
#pragma hls interface mode=axis port=axis_out
#pragma hls interface m_axi bundle=gmem0 port=in max_widen_bitwidth=64
#pragma hls interface mode=s_axilite port=return
for (ap_uint<32> i = 0; i < size; i++) {
#pragma HLS PIPELINE II=1
axis_out.write(in[i]);
}
}
Key pragmas:
m_axi— memory-mapped master for the input buffer.axis— AXI-Stream output port.s_axilite port=return— allows the host to start and poll the kernel.
Freerunning Kernel (Stream Processor)
A freerunning kernel has no host control interface. It runs continuously, processing data whenever the input stream has elements:
void passthrough(hls::stream<ap_uint<64>>& axis_in, hls::stream<ap_uint<64>>& axis_out) {
#pragma HLS INTERFACE axis port=axis_in
#pragma HLS INTERFACE axis port=axis_out
#pragma HLS INTERFACE ap_ctrl_none port=return
ap_uint<64> data;
while (true) {
#pragma HLS PIPELINE II=1
if (!axis_in.empty()) {
data = axis_in.read();
axis_out.write(data);
}
}
}
The ap_ctrl_none pragma is critical — it removes the start/done/idle
control registers, making the kernel autonomous. You do not call
kernel.start() or kernel.wait() for freerunning kernels.
DMA-Out Kernel (Stream Consumer)
The DMA-out kernel reads from a stream and writes each element to device memory:
void dma_out(ap_uint<32> size, hls::stream<ap_uint<64>>& axis_in, ap_uint<64>* out) {
#pragma hls interface mode=s_axilite port=size
#pragma hls interface mode=axis port=axis_in
#pragma hls interface m_axi bundle=gmem0 port=out max_widen_bitwidth=64
#pragma hls interface mode=s_axilite port=return
for (ap_uint<32> i = 0; i < size; i++) {
#pragma HLS PIPELINE II=1
ap_uint<64> val;
axis_in.read(val);
out[i] = val;
}
}
Linker Configuration
Connect the kernels with stream_connect directives in config.cfg:
[connectivity]
nk=dma_in:1:dma_in_0
nk=passthrough:1:passthrough_0
nk=dma_out:1:dma_out_0
stream_connect=dma_in_0.axis_out:passthrough_0.axis_in
stream_connect=passthrough_0.axis_out:dma_out_0.axis_in
nk— instantiates each kernel (same syntax as non-streaming designs).stream_connect— wires AXI-Stream ports between kernel instances using<instance>.<port>:<instance>.<port>syntax.
No sp= lines are needed for the streaming ports themselves. Only the
memory-mapped ports on dma_in and dma_out require memory mapping, which
the linker assigns automatically when no explicit sp= is given.
Host Application
In the host code, only the DMA endpoint kernels need to be controlled. The
freerunning passthrough kernel is not instantiated:
vrt::Kernel dma_in(device, "dma_in_0");
vrt::Kernel dma_out(device, "dma_out_0");
// passthrough_0 is freerunning — no host handle needed
Allocate buffers using argMemoryConfig() so the VRT runtime automatically
selects the correct memory bank for each kernel’s memory-mapped argument:
vrt::Buffer<uint64_t> buffer_in(device, size, dma_in.argMemoryConfig("in"));
vrt::Buffer<uint64_t> buffer_out(device, size, dma_out.argMemoryConfig("out"));
Set arguments, start both DMA kernels, and verify the output:
buffer_in.sync(vrt::SyncType::HOST_TO_DEVICE);
dma_in.setArg(0, buffer_in);
dma_in.setArg(1, size);
dma_out.setArg(0, size);
dma_out.setArg(1, buffer_out);
dma_in.start();
dma_out.start();
dma_in.wait();
dma_out.wait();
buffer_out.sync(vrt::SyncType::DEVICE_TO_HOST);
Note
Both dma_in and dma_out must be started. If dma_out is not
ready to consume data, the pipeline will stall due to back-pressure.
Build and Run
Ensure you have sourced Vivado and Vitis HLS before building:
source <path-to-vivado>/settings64.sh
source <path-to-vitis-hls>/settings64.sh
cd examples/02_chain
cmake -B build -S . -G Ninja -DSLASH_USE_REPO=ON
cmake --build build
cmake --build build --target hls
cmake --build build --target chain_hw # or chain_emu / chain_sim
./02_chain <BDF> chain_hw.vbin
Replace <BDF> with your board’s address from v80-smi list.
Key Design Considerations
ap_ctrl_none kernels cannot be started or stopped from the host. They run whenever data is available on their input streams.
Stream widths must match between connected ports. In this example all three kernels use
ap_uint<64>.Back-pressure is handled automatically — if a downstream kernel is not consuming, upstream stalls.
For multi-stage pipelines, extend the
stream_connectchain inconfig.cfg.
Next Steps
Your First Kernel — basic kernel authoring.
Buffers and Memory — buffer management for DMA endpoints.
Use CMake Modules — CMake setup for HLS and vrtbin linking.
Architecture — how streaming fits in the SLASH stack.