What you’ll build and learn

You’ll explore operator fusion in Halide, where each stage is computed inside its consumer instead of storing intermediate results. This approach reduces memory traffic and improves cache efficiency. You’ll also learn when it’s better to materialize intermediates using compute_root() or compute_at(), such as with large filters or when results are reused by multiple stages. By the end, you’ll understand how to choose between fusion and materialization for real-time image processing on Arm devices.

You’ll also use print_loop_nest() to see how Halide arranges the computation, switch between different scheduling modes (fuse all, fuse blur only, materialize, tile and materialize per tile) in a live camera pipeline, and measure the impact using ms, FPS, and MPix/s.

Note on scope

This section doesn’t cover loop fusion using the fuse directive. You’ll focus instead on operator fusion, which is Halide’s default behavior.

Explore the code

To explore how fusion in Halide works create a new file called camera-capture-fusion.cpp, and copy in the code below. This code uses a live camera pipeline (BGR → gray → 3×3 blur → threshold), adds a few schedule variants to toggle operator fusion compared to materialization, and print ms / FPS / MPix/s. - you’ll be able to see the impact immediately:

    

        
        
#include "Halide.h"
#include <opencv2/opencv.hpp>
#include <chrono>
#include <iomanip>
#include <iostream>
#include <string>
#include <cstdint>
#include <exception>

using namespace Halide;
using namespace cv;
using namespace std;

enum class Schedule : int {
    Simple = 0,              // materialize gray + blur
    FuseBlurAndThreshold = 1,// materialize gray; fuse blur+threshold
    FuseAll = 2,             // fuse everything (default)
    Tile = 3,                // tile output; materialize gray per tile; blur fused
};

static const char* schedule_name(Schedule s) {
    switch (s) {
        case Schedule::Simple:               return "Simple";
        case Schedule::FuseBlurAndThreshold: return "FuseBlurAndThreshold";
        case Schedule::FuseAll:              return "FuseAll";
        case Schedule::Tile:                 return "Tile";
        default:                             return "Unknown";
    }
}

// Build the BGR->Gray -> 3x3 binomial blur -> threshold pipeline.
// Clamp the *ImageParam* at the borders (Func clamp of ImageParam works in Halide 19).
Pipeline make_pipeline(ImageParam& input, Schedule schedule) {
    Var x("x"), y("y");

    // Assume 3-channel BGR interleaved frames (converted if needed).
    input.dim(0).set_stride(3);      // x-stride = channels
    input.dim(2).set_stride(1);      // c-stride = 1
    input.dim(2).set_bounds(0, 3);   // three channels

    Func inputClamped = BoundaryConditions::repeat_edge(input);

    // Gray (Rec.601)
    Func gray("gray");
    gray(x, y) = cast<uint8_t>(0.114f * inputClamped(x, y, 0)
                             + 0.587f * inputClamped(x, y, 1)
                             + 0.299f * inputClamped(x, y, 2));

    // 3x3 binomial blur (sum/16)
    Func blur("blur");
    const uint16_t k[3][3] = {{1,2,1},{2,4,2},{1,2,1}};
    Expr blurSum = cast<uint16_t>(0);
    for (int j = 0; j < 3; ++j)
        for (int i = 0; i < 3; ++i)
            blurSum = blurSum + cast<uint16_t>(gray(x + i - 1, y + j - 1)) * k[j][i];
    blur(x, y) = cast<uint8_t>(blurSum / 16);

    // Threshold (binary)
    Func thresholded("thresholded");
    Expr T = cast<uint8_t>(128);
    thresholded(x, y) = select(blur(x, y) > T, cast<uint8_t>(255), cast<uint8_t>(0));

    // Final output
    Func output("output");
    output(x, y) = thresholded(x, y);
    output.compute_root(); // always realize 'output'

    // Scheduling to demonstrate OPERATOR FUSION vs MATERIALIZATION
    // Default in Halide = fusion/inlining (no schedule on producers).
    Var xo("xo"), yo("yo"), xi("xi"), yi("yi");

    switch (schedule) {
        case Schedule::Simple:
            // Materialize gray and blur (two loop nests); thresholded fuses into output
            gray.compute_root();
            blur.compute_root();
            break;

        case Schedule::FuseBlurAndThreshold:
            // Materialize gray; blur and thresholded remain fused into output
            gray.compute_root();
            break;

        case Schedule::FuseAll:
            // No schedule on producers: gray, blur, thresholded all fuse into output
            break;

        case Schedule::Tile:
            // Tile the output; compute gray per tile; blur stays fused within tile
            output.tile(x, y, xo, yo, xi, yi, 64, 64);
            gray.compute_at(output, xo);
            break;
    }

    // (Optional) Print loop nest once to “x-ray” the schedule
    std::cout << "\n---- Loop structure (" << schedule_name(schedule) << ") ----\n";
    output.print_loop_nest();
    std::cout << "-----------------------------------------------\n";

    return Pipeline(output);
}

int main(int argc, char** argv) {
    // Optional CLI: start with a given schedule number 0..3
    Schedule current = Schedule::FuseAll;
    if (argc >= 2) {
        int s = std::atoi(argv[1]);
        if (s >= 0 && s <= 3) current = static_cast<Schedule>(s);
    }
    std::cout << "Starting with schedule: " << schedule_name(current)
              << " (press 0..3 to switch; q/Esc to quit)\n";

    // Open camera
    VideoCapture cap(0);
    if (!cap.isOpened()) {
        std::cerr << "Error: Unable to open camera.\n";
        return 1;
    }
    cap.set(CAP_PROP_CONVERT_RGB, true); // ask OpenCV for BGR frames

    // Grab one frame to get size/channels
    Mat frame;
    cap >> frame;
    if (frame.empty()) {
        std::cerr << "Error: empty first frame.\n";
        return 1;
    }
    if (frame.channels() == 4) {
        cvtColor(frame, frame, COLOR_BGRA2BGR);
    } else if (frame.channels() == 1) {
        cvtColor(frame, frame, COLOR_GRAY2BGR);
    }
    if (!frame.isContinuous()) frame = frame.clone();

    const int width  = frame.cols;
    const int height = frame.rows;

    // Halide inputs/outputs
    ImageParam input(UInt(8), 3, "input");
    Buffer<uint8_t, 2> outBuf(width, height, "out");

    // Build pipeline for the starting schedule
    Pipeline pipe = make_pipeline(input, current);

    bool warmed_up = false;
    namedWindow("Fusion Demo (live)", WINDOW_NORMAL);

    for (;;) {
        cap >> frame;
        if (frame.empty()) break;
        if (frame.channels() == 4) {
            cvtColor(frame, frame, COLOR_BGRA2BGR);
        } else if (frame.channels() == 1) {
            cvtColor(frame, frame, COLOR_GRAY2BGR);
        }
        if (!frame.isContinuous()) frame = frame.clone();

        // Wrap interleaved frame
        Halide::Buffer<uint8_t> inputBuf = Runtime::Buffer<uint8_t>::make_interleaved(
            frame.data, frame.cols, frame.rows, frame.channels());
        
        input.set(inputBuf);

        // Time the Halide realize() only
        auto t0 = std::chrono::high_resolution_clock::now();
        try {
            pipe.realize(outBuf);
        } catch (const Halide::RuntimeError& e) {
            std::cerr << "Halide runtime error: " << e.what() << "\n";
            break;
        } catch (const std::exception& e) {
            std::cerr << "std::exception: " << e.what() << "\n";
            break;
        }
        auto t1 = std::chrono::high_resolution_clock::now();

        double ms = std::chrono::duration<double, std::milli>(t1 - t0).count();
        double fps = ms > 0.0 ? 1000.0 / ms : 0.0;
        double mpixps = ms > 0.0 ? (double(width) * double(height)) / (ms * 1000.0) : 0.0;

        std::cout << std::fixed << std::setprecision(2)
                  << (warmed_up ? "" : "[warm-up] ")
                  << schedule_name(current) << " | "
                  << ms << " ms  |  "
                  << fps << " FPS  |  "
                  << mpixps << " MPix/s\r" << std::flush;
        warmed_up = true;

        // Show result
        Mat view(height, width, CV_8UC1, outBuf.data());
        imshow("Fusion Demo (live)", view);
        int key = waitKey(1);
        if (key == 27 || key == 'q' || key == 'Q') break;

        // Hotkeys 0..3 to switch schedules live
        if (key >= '0' && key <= '3') {
            Schedule next = static_cast<Schedule>(key - '0');
            if (next != current) {
                std::cout << "\nSwitching to schedule: " << schedule_name(next) << "\n";
                current = next;
                try {
                    pipe = make_pipeline(input, current); // rebuild JIT with new schedule
                } catch (const Halide::CompileError& e) {
                    std::cerr << "Halide compile error: " << e.what() << "\n";
                    break;
                }
                warmed_up = false; // next frame includes JIT, label as warm-up
            }
        }
    }

    std::cout << "\n";
    destroyAllWindows();
    return 0;
}

    

The heart of this program is the make_pipeline function. This function builds the camera processing pipeline in Halide and lets you switch between different scheduling modes. Each mode changes how intermediate results are handled, by either fusing stages together to minimize memory use, or materializing them to avoid recomputation. By adjusting the schedule, you can see how these choices affect both the loop structure and the real-time performance of your image processing pipeline.

Start by declaring Var x, y to represent pixel coordinates. The camera frames use a 3-channel interleaved BGR format. This means:

  • The stride along the x-axis is 3, because each step moves across all three color channels.
  • The stride along the channel axis (c) is 1, so channels are stored contiguously.
  • The channel bounds are set from 0 to 2, covering the three BGR channels.

These settings tell Halide exactly how the image data is organized in memory, so it can process each pixel and channel correctly.

To avoid errors when applying filters near the edges of an image, clamp the input at the borders. In Halide 19, you can use BoundaryConditions::repeat_edge directly on an ImageParam, because it includes dimension information. This ensures that all stages in your pipeline can safely access pixels, even at the image boundaries.

    

        
        
Pipeline make_pipeline(ImageParam& input, Schedule schedule) {
    Var x("x"), y("y");

    // (a) Interleaved constraints for BGR frames
    input.dim(0).set_stride(3);      // x stride = channels
    input.dim(2).set_stride(1);      // channel stride = 1
    input.dim(2).set_bounds(0, 3);   // channels = 0..2

    // (b) Border handling: clamp the *ImageParam* (works cleanly in Halide 19)
    Func inputClamped = BoundaryConditions::repeat_edge(input);

    

The next stage converts the image to grayscale. Use the Rec.601 weights for BGR to gray conversion, just like in the previous section. For the blur, apply a 3×3 binomial kernel with values:

    

        
        
1 2 1
2 4 2
1 2 1

    

This kernel closely approximates a Gaussian filter. Instead of using Halide’s reduction domain (RDom), unroll the sum directly in C++ using two nested loops over the kernel values. For each pixel, calculate the weighted sum of its 3×3 neighborhood and divide by 16 to get the blurred result. This approach makes the computation straightforward and easy to follow. Now, add a threshold stage to your pipeline. This stage checks each pixel value after the blur and sets it to white (255) if it’s above 128, or black (0) otherwise. This produces a binary image, making it easy to see which areas are brighter than the threshold.

Here’s how you define the thresholded stage and the output Func:

    

        
        
// Threshold (binary)
Func thresholded("thresholded");
Expr T = cast<uint8_t>(128);
thresholded(x, y) = select(blur(x, y) > T, cast<uint8_t>(255), cast<uint8_t>(0));

// Final output
Func output("output");
output(x, y) = thresholded(x, y);
output.compute_root(); // Realize 'output' explicitly when running the pipeline

    

This setup ensures that the output is a binary image, and Halide will compute and store the result when you run the pipeline. By calling compute_root() on the output Func, you tell Halide to materialize the final result, making it available for display or further processing.

Now comes the interesting part: the scheduling choices. Depending on the Schedule enum passed in, you instruct Halide to either fuse everything (the default), materialize some intermediates, or even tile the output.

  • Simple: Here you’ll explicitly compute and store both gray and blur across the whole frame with compute_root(). This makes them easy to reuse or parallelize, but requires extra memory traffic.
  • FuseBlurAndThreshold: You compute gray once as a planar buffer, but leave blur and thresholded fused into output. This often works well when the input is interleaved, because subsequent stages read from a planar gray.
  • FuseAll: You’ll apply no scheduling to producers, so gray, blur, and thresholded are all inlined into output. This minimizes memory usage but can recompute gray many times inside the 3×3 stencil.
  • Tile: You’ll split the output into 64×64 tiles. Within each tile, you materialize gray (compute_at(output, xo)), so the working set is small and stays in cache. blur remains fused within each tile.

To help you examine what’s happening, print the loop nest Halide generates for each schedule using print_loop_nest(). This will give you a clear view of how fusion or materialization changes the structure of the computation.

    

        
        
Var xo("xo"), yo("yo"), xi("xi"), yi("yi");

switch (schedule) {
    case Schedule::Simple:
        // Materialize gray and blur as whole-frame buffers.
        gray.compute_root();
        blur.compute_root();
        break;

    case Schedule::FuseBlurAndThreshold:
        // Materialize only gray; leave blur+threshold fused into output.
        gray.compute_root();
        break;

    case Schedule::FuseAll:
        // No schedules on producers → gray, blur, thresholded all inline into output.
        break;

    case Schedule::Tile:
        // Tile the output; compute gray per tile; blur stays fused inside each tile.
        output.tile(x, y, xo, yo, xi, yi, 64, 64);
        gray.compute_at(output, xo);
        break;
}

// Optional: print loop nest to “x-ray” the shape of the generated loops
std::cout << "\n---- Loop structure (" << schedule_name(schedule) << ") ----\n";
output.print_loop_nest();
std::cout << "-----------------------------------------------\n";

return Pipeline(output);
}

    

All the camera handling is just like before: open the default webcam with OpenCV, normalize frames to 3-channel BGR if needed, wrap each frame as an interleaved Halide buffer, run the pipeline, and show the result. Time only the realize() call and print ms / FPS / MPix/s, with the first frame marked as [warm-up].

The new part is that you can toggle scheduling modes from the keyboard while the application is running:

  1. Keys:
  • 0 – Simple (materialize gray and blur)
  • 1 – FuseBlurAndThreshold (materialize gray; fuse blur+threshold)
  • 2 – FuseAll (default fusion: fuse gray+blur+threshold)
  • 3 – Tile (tile output; materialize gray per tile; blur fused inside tile)
  • q / Esc – quit

Under the hood, pressing 0–3 triggers a rebuild of the Halide pipeline with the chosen schedule:

  1. Map the key to a Schedule enum value.
  2. Call make_pipeline(input, next) to construct the new scheduled pipeline.
  3. Reset the warm-up flag, so the next line of stats is labeled [warm-up] (that frame includes JIT).
  4. The main loop keeps grabbing frames; only the Halide schedule changes.

This live switching makes fusion tangible: you can watch the loop nest printout change, see the visualization update, and compare throughput numbers in real time as you move between Simple, FuseBlurAndThreshold, FuseAll, and Tile.

Now, build and run the sample:

    

        
        
g++ -std=c++17 camera-capture-fusion.cpp -o camera-capture-fusion \
  -I/path/to/halide/include -L/path/to/halide/lib -lHalide \
  $(pkg-config --cflags --libs opencv4) -lpthread -ldl \
  -Wl,-rpath,/path/to/halide/lib
./camera-capture-fusion

    

You’ll see the following output:

    

        
        % ./camera-capture-fusion
Starting with schedule: FuseAll (press 0..3 to switch; q/Esc to quit)

---- Loop structure (FuseAll) ----
produce output:
  for y:
    for x:
      output(...) = ...
-----------------------------------------------
FuseAll | 18.90 ms  |  52.92 FPS  |  109.74 MPix/s2 MPix/s
Switching to schedule: FuseBlurAndThreshold

---- Loop structure (FuseBlurAndThreshold) ----
produce gray:
  for y:
    for x:
      gray(...) = ...
consume gray:
  produce output:
    for y:
      for x:
        output(...) = ...
-----------------------------------------------
FuseBlurAndThreshold | 4.85 ms  |  206.19 FPS  |  427.55 MPix/s97 MPix/s
Switching to schedule: FuseAll

---- Loop structure (FuseAll) ----
produce output:
  for y:
    for x:
      output(...) = ...
-----------------------------------------------
FuseAll | 18.14 ms  |  55.12 FPS  |  114.30 MPix/s22 MPix/s
Switching to schedule: Tile

---- Loop structure (Tile) ----
produce output:
  for y.yo:
    for x.xo:
      produce gray:
        for y:
          for x:
            gray(...) = ...
      consume gray:
        for y.yi in [0, 63]:
          for x.xi in [0, 63]:
            output(...) = ...
-----------------------------------------------
Tile | 4.98 ms  |  200.73 FPS  |  416.23 MPix/s28 MPix/s
Switching to schedule: Simple

---- Loop structure (Simple) ----
produce gray:
  for y:
    for x:
      gray(...) = ...
consume gray:
  produce blur:
    for y:
      for x:
        blur(...) = ...
  consume blur:
    produce output:
      for y:
        for x:
          output(...) = ...
-----------------------------------------------
Simple | 6.01 ms  |  166.44 FPS  |  345.12 MPix/s15 MPix/s

        
    

The console output combines two kinds of information:

  1. Loop nests – printed by print_loop_nest(). These show how Halide actually arranges the computation for the chosen schedule. They’re a great “x-ray” view of fusion and materialization:
  • In FuseAll, the loop nest contains only output. That’s because gray, blur, and thresholded are all inlined (fused) into it. Each pixel of output recomputes its 3×3 neighborhood of gray.
  • In FuseBlurAndThreshold, there is an extra loop for gray, because we explicitly called gray.compute_root(). The blur and thresholded stages are still fused into output. This reduces recomputation of gray and makes downstream loops simpler to vectorize.
  • In Simple, both gray and blur have their own loop nests, and thresholded fuses into output. This introduces two extra buffers, but each stage is computed once and can be parallelized independently.
  • In Tile, you see the outer tile loops (y.yo and x.xo) and the inner per-tile loops (y.yi, x.xi). Inside each tile, gray is produced once and then consumed by the fused blur and threshold. This keeps the working set small and cache-friendly.
  1. Performance metrics – printed after each realize(). They report:
  • ms – the average time to process one frame.
  • FPS – frames per second (1000 / ms).
  • MPix/s – millions of pixels per second processed.

Comparing the numbers:

  • FuseAll runs at ~53 FPS. It has minimal memory traffic but pays for recomputation of gray under the blur.
  • FuseBlurAndThreshold jumps to over 200 FPS. By materializing gray, redundant recomputation is avoided and blur+threshold stays fused. This is often the sweet spot for interleaved camera input.
  • Simple reaches ~166 FPS. Both gray and blur are materialized, so no recomputation occurs, but memory traffic is higher than in FuseBlurAndThreshold.
  • Tile achieves similar speed (~200 FPS). Producing gray per tile balances recomputation and memory traffic by keeping intermediates local to cache.

By toggling schedules live, you can see and measure how operator fusion and materialization change both the loop structure and the throughput:

  • Fusion is the default in Halide and eliminates temporary storage, but may cause recomputation for spatial filters.
  • Materializing selected stages with compute_root() or compute_at() can reduce recomputation and improve locality. It can also make vectorization and parallelization easier or more effective, but they are not strictly required by materialization and can be applied independently. For best performance, consider these choices together and measure on your target.
  • Tile-level materialization (compute_at) provides a hybrid - fusing within tiles while keeping intermediates small and cache-resident.

This demo makes these trade-offs concrete: the loop nest diagrams explain the structure, and the live FPS/MPix/s stats show the real performance impact.

What “fusion” means in Halide

Halide’s defining feature is that, by default, it performs operator fusion, also called inlining. This means that if a stage produces some intermediate values, those values aren’t stored in a separate buffer and then re-read later—instead, the stage is computed directly inside the consumer’s loop. In other words, unless you tell Halide otherwise, every producer Func is fused into the next stage that uses it.

Why is this important? Fusion reduces memory traffic, because Halide doesn’t need to write intermediates out to RAM and read them back again. On CPUs, where memory bandwidth is often the bottleneck, this can be a major performance win. Fusion also improves cache locality, since values are computed exactly where they are needed and the working set stays small. The trade-off, however, is that fusion can cause recomputation: if a consumer uses a neighborhood (like a blur that reads 3×3 or 9×9 pixels), the fused producer may be recalculated multiple times for overlapping regions. Whether fusion is faster depends on the balance between compute cost and memory traffic.

Consider the difference in pseudocode:

    

        
        
for y:
  for x:
    out(x,y) = threshold( sum_{i,j in 3x3} kernel(i,j) * gray(x+i,y+j) )
    // gray(...) is computed on the fly for each (i,j)

    

Materialized with compute_root():

    

        
        
for y: for x: gray(x,y) = ...                // write one planar gray image
for y: for x: out(x,y) = threshold( sum kernel * gray(x+i,y+j) )

    

The fused version eliminates buffer writes but recomputes gray under the blur stencil. The materialized version performs more memory operations but avoids recomputation, and also provides a clean point to parallelize or vectorize the gray stage.

Note that Halide also supports a loop fusion directive (fuse) that merges two loop variables together. That’s a different concept and not the focus here. This tutorial focuses specifically on operator fusion—the decision of whether to inline or materialize stages.

How this looks in the live camera demo

The pipeline is: BGR input → gray → 3×3 blur → thresholded → output. Depending on the schedule, different kinds of fusion are shown:

  • FuseAll. No schedules on producers. gray, blur, and thresholded are all inlined into output. This minimizes memory traffic but recomputes gray repeatedly inside the 3×3 blur.
  • FuseBlurAndThreshold: Adding gray.compute_root() materializes gray once as a planar buffer. This avoids recomputation of gray and makes downstream blur and thresholded vectorize better. blur and thresholded remain fused.
  • Simple. Both gray and blur are materialized across the frame. This avoids recomputation entirely but increases memory traffic.
  • Tile. The output is split into 64×64 tiles and gray is computed per tile (compute_at(output, xo)). This keeps intermediate results local to cache while still fusing blur inside each tile.

By toggling between these modes in the live demo, you can see how the loop nests and throughput numbers change, which makes the abstract idea of fusion much more concrete.

When to use operator fusion

Fusion is Halide’s default and usually the right place to start. It’s especially effective for:

  • Element-wise chains, where each pixel is transformed independently: examples include intensity scaling or offset, gamma correction, channel mixing, color-space conversions, and logical masking.
  • Cheap post-ops after spatial filters: for instance, there’s no reason to materialize a blurred image to threshold it. Fuse the threshold directly into the blur’s consumer.

In the code, FuseAll inlines gray, blur, and thresholded into output. FuseBlurAndThreshold materializes only gray, then keeps blur and thresholded fused—a common middle ground that balances memory use and compute reuse.

When to materialize instead of fuse

Fusion isn’t always best. You’ll want to materialize an intermediate (compute_root() or compute_at()) if:

  • The producer would be recomputed many times under a large stencil.
  • The producer is read from an interleaved source and it’s easier to vectorize a planar buffer.
  • The intermediate is reused by multiple consumers.
  • You need a natural stage to apply parallelization or tiling.

Profiling

The fastest way to check whether fusion helps is to measure it. The demo prints timing and throughput per frame, but Halide also includes a built-in profiler that reports per-stage runtimes. To learn how to enable and interpret the profiler, see the official Halide profiling tutorial .

Summary

You’ve seen how operator fusion in Halide can make your image processing pipeline faster and more efficient. Fusion means Halide computes each stage directly inside its consumer, reducing memory traffic and keeping data in cache. You learned when fusion is best—like for simple pixel operations or cheap post-processing—and when materializing intermediates with compute_root() or compute_at() can help, especially for large stencils or multi-use buffers. By switching schedules in the live demo, you saw how fusion and materialization affect both the loop structure and real-time performance. Now you know how to choose the right approach for your own Arm-based image processing tasks.

Back
Next