Chain Streaming Kernels

This guide shows how to connect multiple HLS kernels via AXI-Stream to form a processing pipeline where data flows kernel-to-kernel without touching device memory.

Prerequisites

The SLASH stack is installed, vrtd is running, and a V80 board is visible.
Familiarity with HLS kernel basics. See Your First Kernel.

Streaming Pipeline Concept

In a streaming pipeline, kernels are wired together through on-chip AXI-Stream channels. Data bypasses device memory entirely between stages:

Host Memory ──► [dma_in] ──axis──► [passthrough] ──axis──► [dma_out] ──► Host Memory

dma_in — reads from device memory and writes to a stream.
passthrough — a freerunning kernel that processes each element as it arrives (in this example, a simple pass-through).
dma_out — reads from a stream and writes to device memory.

Writing Streaming HLS Kernels

DMA-In Kernel (Stream Producer)

The DMA-in kernel reads from a memory-mapped port and pushes each element onto an AXI-Stream output:

void dma_in(ap_uint<64>* in, hls::stream<ap_uint<64>>& axis_out, ap_uint<32> size) {
    #pragma hls interface mode=s_axilite port=size
    #pragma hls interface mode=axis port=axis_out
    #pragma hls interface m_axi bundle=gmem0 port=in max_widen_bitwidth=64
    #pragma hls interface mode=s_axilite port=return

    for (ap_uint<32> i = 0; i < size; i++) {
        #pragma HLS PIPELINE II=1
        axis_out.write(in[i]);
    }
}

Key pragmas:

m_axi — memory-mapped master for the input buffer.
axis — AXI-Stream output port.
s_axilite port=return — allows the host to start and poll the kernel.

Freerunning Kernel (Stream Processor)

A freerunning kernel has no host control interface. It runs continuously, processing data whenever the input stream has elements:

void passthrough(hls::stream<ap_uint<64>>& axis_in, hls::stream<ap_uint<64>>& axis_out) {
    #pragma HLS INTERFACE axis port=axis_in
    #pragma HLS INTERFACE axis port=axis_out
    #pragma HLS INTERFACE ap_ctrl_none port=return

    ap_uint<64> data;
    while (true) {
        #pragma HLS PIPELINE II=1
        if (!axis_in.empty()) {
            data = axis_in.read();
            axis_out.write(data);
        }
    }
}

The ap_ctrl_none pragma is critical — it removes the start/done/idle control registers, making the kernel autonomous. You do not call kernel.start() or kernel.wait() for freerunning kernels.

DMA-Out Kernel (Stream Consumer)

The DMA-out kernel reads from a stream and writes each element to device memory:

void dma_out(ap_uint<32> size, hls::stream<ap_uint<64>>& axis_in, ap_uint<64>* out) {
    #pragma hls interface mode=s_axilite port=size
    #pragma hls interface mode=axis port=axis_in
    #pragma hls interface m_axi bundle=gmem0 port=out max_widen_bitwidth=64
    #pragma hls interface mode=s_axilite port=return

    for (ap_uint<32> i = 0; i < size; i++) {
        #pragma HLS PIPELINE II=1
        ap_uint<64> val;
        axis_in.read(val);
        out[i] = val;
    }
}

Linker Configuration

Connect the kernels with stream_connect directives in config.cfg:

[connectivity]
nk=dma_in:1:dma_in_0
nk=passthrough:1:passthrough_0
nk=dma_out:1:dma_out_0

stream_connect=dma_in_0.axis_out:passthrough_0.axis_in
stream_connect=passthrough_0.axis_out:dma_out_0.axis_in

nk — instantiates each kernel (same syntax as non-streaming designs).
stream_connect — wires AXI-Stream ports between kernel instances using <instance>.<port>:<instance>.<port> syntax.

No sp= lines are needed for the streaming ports themselves. Only the memory-mapped ports on dma_in and dma_out require memory mapping, which the linker assigns automatically when no explicit sp= is given.

Host Application

In the host code, only the DMA endpoint kernels need to be controlled. The freerunning passthrough kernel is not instantiated:

vrt::Kernel dma_in(device, "dma_in_0");
vrt::Kernel dma_out(device, "dma_out_0");
// passthrough_0 is freerunning — no host handle needed

Allocate buffers using argMemoryConfig() so the VRT runtime automatically selects the correct memory bank for each kernel’s memory-mapped argument:

vrt::Buffer<uint64_t> buffer_in(device, size, dma_in.argMemoryConfig("in"));
vrt::Buffer<uint64_t> buffer_out(device, size, dma_out.argMemoryConfig("out"));

Set arguments, start both DMA kernels, and verify the output:

buffer_in.sync(vrt::SyncType::HOST_TO_DEVICE);

dma_in.setArg(0, buffer_in);
dma_in.setArg(1, size);
dma_out.setArg(0, size);
dma_out.setArg(1, buffer_out);

dma_in.start();
dma_out.start();
dma_in.wait();
dma_out.wait();

buffer_out.sync(vrt::SyncType::DEVICE_TO_HOST);

Note

Both dma_in and dma_out must be started. If dma_out is not ready to consume data, the pipeline will stall due to back-pressure.

Build and Run

Ensure you have sourced Vivado and Vitis HLS before building:

source <path-to-vivado>/settings64.sh
source <path-to-vitis-hls>/settings64.sh

cd examples/02_chain
cmake -B build -S . -G Ninja -DSLASH_USE_REPO=ON
cmake --build build
cmake --build build --target hls
cmake --build build --target chain_hw    # or chain_emu / chain_sim

./02_chain <BDF> chain_hw.vbin

Replace <BDF> with your board’s address from v80-smi list.

Key Design Considerations

ap_ctrl_none kernels cannot be started or stopped from the host. They run whenever data is available on their input streams.
Stream widths must match between connected ports. In this example all three kernels use ap_uint<64>.
Back-pressure is handled automatically — if a downstream kernel is not consuming, upstream stalls.
For multi-stage pipelines, extend the stream_connect chain in config.cfg.

Next Steps

Your First Kernel — basic kernel authoring.
Buffers and Memory — buffer management for DMA endpoints.
Use CMake Modules — CMake setup for HLS and vrtbin linking.
Architecture — how streaming fits in the SLASH stack.