Introduction
Install and configure Halide for Arm development
Build a simple camera image processing workflow
Apply operator fusion in Halide for real-time image processing
Generate optimized Halide pipelines for Android using ahead-of-time cross-compilation
Integrate Halide into an Android project with Kotlin
Next Steps
You’ll explore operator fusion in Halide, where each stage is computed inside its consumer instead of storing intermediate results. This approach reduces memory traffic and improves cache efficiency. You’ll also learn when it’s better to materialize intermediates using compute_root() or compute_at(), such as with large filters or when results are reused by multiple stages. By the end, you’ll understand how to choose between fusion and materialization for real-time image processing on Arm devices.
You’ll also use print_loop_nest() to see how Halide arranges the computation, switch between different scheduling modes (fuse all, fuse blur only, materialize, tile and materialize per tile) in a live camera pipeline, and measure the impact using ms, FPS, and MPix/s.
This section doesn’t cover loop fusion using the fuse directive. You’ll focus instead on operator fusion, which is Halide’s default behavior.
To explore how fusion in Halide works create a new file called camera-capture-fusion.cpp, and copy in the code below. This code uses a live camera pipeline (BGR → gray → 3×3 blur → threshold), adds a few schedule variants to toggle operator fusion compared to materialization, and print ms / FPS / MPix/s. - you’ll be able to see the impact immediately:
#include "Halide.h"
#include <opencv2/opencv.hpp>
#include <chrono>
#include <iomanip>
#include <iostream>
#include <string>
#include <cstdint>
#include <exception>
using namespace Halide;
using namespace cv;
using namespace std;
enum class Schedule : int {
Simple = 0, // materialize gray + blur
FuseBlurAndThreshold = 1,// materialize gray; fuse blur+threshold
FuseAll = 2, // fuse everything (default)
Tile = 3, // tile output; materialize gray per tile; blur fused
};
static const char* schedule_name(Schedule s) {
switch (s) {
case Schedule::Simple: return "Simple";
case Schedule::FuseBlurAndThreshold: return "FuseBlurAndThreshold";
case Schedule::FuseAll: return "FuseAll";
case Schedule::Tile: return "Tile";
default: return "Unknown";
}
}
// Build the BGR->Gray -> 3x3 binomial blur -> threshold pipeline.
// Clamp the *ImageParam* at the borders (Func clamp of ImageParam works in Halide 19).
Pipeline make_pipeline(ImageParam& input, Schedule schedule) {
Var x("x"), y("y");
// Assume 3-channel BGR interleaved frames (converted if needed).
input.dim(0).set_stride(3); // x-stride = channels
input.dim(2).set_stride(1); // c-stride = 1
input.dim(2).set_bounds(0, 3); // three channels
Func inputClamped = BoundaryConditions::repeat_edge(input);
// Gray (Rec.601)
Func gray("gray");
gray(x, y) = cast<uint8_t>(0.114f * inputClamped(x, y, 0)
+ 0.587f * inputClamped(x, y, 1)
+ 0.299f * inputClamped(x, y, 2));
// 3x3 binomial blur (sum/16)
Func blur("blur");
const uint16_t k[3][3] = {{1,2,1},{2,4,2},{1,2,1}};
Expr blurSum = cast<uint16_t>(0);
for (int j = 0; j < 3; ++j)
for (int i = 0; i < 3; ++i)
blurSum = blurSum + cast<uint16_t>(gray(x + i - 1, y + j - 1)) * k[j][i];
blur(x, y) = cast<uint8_t>(blurSum / 16);
// Threshold (binary)
Func thresholded("thresholded");
Expr T = cast<uint8_t>(128);
thresholded(x, y) = select(blur(x, y) > T, cast<uint8_t>(255), cast<uint8_t>(0));
// Final output
Func output("output");
output(x, y) = thresholded(x, y);
output.compute_root(); // always realize 'output'
// Scheduling to demonstrate OPERATOR FUSION vs MATERIALIZATION
// Default in Halide = fusion/inlining (no schedule on producers).
Var xo("xo"), yo("yo"), xi("xi"), yi("yi");
switch (schedule) {
case Schedule::Simple:
// Materialize gray and blur (two loop nests); thresholded fuses into output
gray.compute_root();
blur.compute_root();
break;
case Schedule::FuseBlurAndThreshold:
// Materialize gray; blur and thresholded remain fused into output
gray.compute_root();
break;
case Schedule::FuseAll:
// No schedule on producers: gray, blur, thresholded all fuse into output
break;
case Schedule::Tile:
// Tile the output; compute gray per tile; blur stays fused within tile
output.tile(x, y, xo, yo, xi, yi, 64, 64);
gray.compute_at(output, xo);
break;
}
// (Optional) Print loop nest once to “x-ray” the schedule
std::cout << "\n---- Loop structure (" << schedule_name(schedule) << ") ----\n";
output.print_loop_nest();
std::cout << "-----------------------------------------------\n";
return Pipeline(output);
}
int main(int argc, char** argv) {
// Optional CLI: start with a given schedule number 0..3
Schedule current = Schedule::FuseAll;
if (argc >= 2) {
int s = std::atoi(argv[1]);
if (s >= 0 && s <= 3) current = static_cast<Schedule>(s);
}
std::cout << "Starting with schedule: " << schedule_name(current)
<< " (press 0..3 to switch; q/Esc to quit)\n";
// Open camera
VideoCapture cap(0);
if (!cap.isOpened()) {
std::cerr << "Error: Unable to open camera.\n";
return 1;
}
cap.set(CAP_PROP_CONVERT_RGB, true); // ask OpenCV for BGR frames
// Grab one frame to get size/channels
Mat frame;
cap >> frame;
if (frame.empty()) {
std::cerr << "Error: empty first frame.\n";
return 1;
}
if (frame.channels() == 4) {
cvtColor(frame, frame, COLOR_BGRA2BGR);
} else if (frame.channels() == 1) {
cvtColor(frame, frame, COLOR_GRAY2BGR);
}
if (!frame.isContinuous()) frame = frame.clone();
const int width = frame.cols;
const int height = frame.rows;
// Halide inputs/outputs
ImageParam input(UInt(8), 3, "input");
Buffer<uint8_t, 2> outBuf(width, height, "out");
// Build pipeline for the starting schedule
Pipeline pipe = make_pipeline(input, current);
bool warmed_up = false;
namedWindow("Fusion Demo (live)", WINDOW_NORMAL);
for (;;) {
cap >> frame;
if (frame.empty()) break;
if (frame.channels() == 4) {
cvtColor(frame, frame, COLOR_BGRA2BGR);
} else if (frame.channels() == 1) {
cvtColor(frame, frame, COLOR_GRAY2BGR);
}
if (!frame.isContinuous()) frame = frame.clone();
// Wrap interleaved frame
Halide::Buffer<uint8_t> inputBuf = Runtime::Buffer<uint8_t>::make_interleaved(
frame.data, frame.cols, frame.rows, frame.channels());
input.set(inputBuf);
// Time the Halide realize() only
auto t0 = std::chrono::high_resolution_clock::now();
try {
pipe.realize(outBuf);
} catch (const Halide::RuntimeError& e) {
std::cerr << "Halide runtime error: " << e.what() << "\n";
break;
} catch (const std::exception& e) {
std::cerr << "std::exception: " << e.what() << "\n";
break;
}
auto t1 = std::chrono::high_resolution_clock::now();
double ms = std::chrono::duration<double, std::milli>(t1 - t0).count();
double fps = ms > 0.0 ? 1000.0 / ms : 0.0;
double mpixps = ms > 0.0 ? (double(width) * double(height)) / (ms * 1000.0) : 0.0;
std::cout << std::fixed << std::setprecision(2)
<< (warmed_up ? "" : "[warm-up] ")
<< schedule_name(current) << " | "
<< ms << " ms | "
<< fps << " FPS | "
<< mpixps << " MPix/s\r" << std::flush;
warmed_up = true;
// Show result
Mat view(height, width, CV_8UC1, outBuf.data());
imshow("Fusion Demo (live)", view);
int key = waitKey(1);
if (key == 27 || key == 'q' || key == 'Q') break;
// Hotkeys 0..3 to switch schedules live
if (key >= '0' && key <= '3') {
Schedule next = static_cast<Schedule>(key - '0');
if (next != current) {
std::cout << "\nSwitching to schedule: " << schedule_name(next) << "\n";
current = next;
try {
pipe = make_pipeline(input, current); // rebuild JIT with new schedule
} catch (const Halide::CompileError& e) {
std::cerr << "Halide compile error: " << e.what() << "\n";
break;
}
warmed_up = false; // next frame includes JIT, label as warm-up
}
}
}
std::cout << "\n";
destroyAllWindows();
return 0;
}
The heart of this program is the make_pipeline function. This function builds the camera processing pipeline in Halide and lets you switch between different scheduling modes. Each mode changes how intermediate results are handled, by either fusing stages together to minimize memory use, or materializing them to avoid recomputation. By adjusting the schedule, you can see how these choices affect both the loop structure and the real-time performance of your image processing pipeline.
Start by declaring Var x, y to represent pixel coordinates. The camera frames use a 3-channel interleaved BGR format. This means:
These settings tell Halide exactly how the image data is organized in memory, so it can process each pixel and channel correctly.
To avoid errors when applying filters near the edges of an image, clamp the input at the borders. In Halide 19, you can use BoundaryConditions::repeat_edge directly on an ImageParam, because it includes dimension information. This ensures that all stages in your pipeline can safely access pixels, even at the image boundaries.
Pipeline make_pipeline(ImageParam& input, Schedule schedule) {
Var x("x"), y("y");
// (a) Interleaved constraints for BGR frames
input.dim(0).set_stride(3); // x stride = channels
input.dim(2).set_stride(1); // channel stride = 1
input.dim(2).set_bounds(0, 3); // channels = 0..2
// (b) Border handling: clamp the *ImageParam* (works cleanly in Halide 19)
Func inputClamped = BoundaryConditions::repeat_edge(input);
The next stage converts the image to grayscale. Use the Rec.601 weights for BGR to gray conversion, just like in the previous section. For the blur, apply a 3×3 binomial kernel with values:
1 2 1
2 4 2
1 2 1
This kernel closely approximates a Gaussian filter. Instead of using Halide’s reduction domain (RDom), unroll the sum directly in C++ using two nested loops over the kernel values. For each pixel, calculate the weighted sum of its 3×3 neighborhood and divide by 16 to get the blurred result. This approach makes the computation straightforward and easy to follow.
Now, add a threshold stage to your pipeline. This stage checks each pixel value after the blur and sets it to white (255) if it’s above 128, or black (0) otherwise. This produces a binary image, making it easy to see which areas are brighter than the threshold.
Here’s how you define the thresholded stage and the output Func:
// Threshold (binary)
Func thresholded("thresholded");
Expr T = cast<uint8_t>(128);
thresholded(x, y) = select(blur(x, y) > T, cast<uint8_t>(255), cast<uint8_t>(0));
// Final output
Func output("output");
output(x, y) = thresholded(x, y);
output.compute_root(); // Realize 'output' explicitly when running the pipeline
This setup ensures that the output is a binary image, and Halide will compute and store the result when you run the pipeline. By calling compute_root() on the output Func, you tell Halide to materialize the final result, making it available for display or further processing.
Now comes the interesting part: the scheduling choices. Depending on the Schedule enum passed in, you instruct Halide to either fuse everything (the default), materialize some intermediates, or even tile the output.
To help you examine what’s happening, print the loop nest Halide generates for each schedule using print_loop_nest(). This will give you a clear view of how fusion or materialization changes the structure of the computation.
Var xo("xo"), yo("yo"), xi("xi"), yi("yi");
switch (schedule) {
case Schedule::Simple:
// Materialize gray and blur as whole-frame buffers.
gray.compute_root();
blur.compute_root();
break;
case Schedule::FuseBlurAndThreshold:
// Materialize only gray; leave blur+threshold fused into output.
gray.compute_root();
break;
case Schedule::FuseAll:
// No schedules on producers → gray, blur, thresholded all inline into output.
break;
case Schedule::Tile:
// Tile the output; compute gray per tile; blur stays fused inside each tile.
output.tile(x, y, xo, yo, xi, yi, 64, 64);
gray.compute_at(output, xo);
break;
}
// Optional: print loop nest to “x-ray” the shape of the generated loops
std::cout << "\n---- Loop structure (" << schedule_name(schedule) << ") ----\n";
output.print_loop_nest();
std::cout << "-----------------------------------------------\n";
return Pipeline(output);
}
All the camera handling is just like before: open the default webcam with OpenCV, normalize frames to 3-channel BGR if needed, wrap each frame as an interleaved Halide buffer, run the pipeline, and show the result. Time only the realize() call and print ms / FPS / MPix/s, with the first frame marked as [warm-up].
The new part is that you can toggle scheduling modes from the keyboard while the application is running:
Under the hood, pressing 0–3 triggers a rebuild of the Halide pipeline with the chosen schedule:
This live switching makes fusion tangible: you can watch the loop nest printout change, see the visualization update, and compare throughput numbers in real time as you move between Simple, FuseBlurAndThreshold, FuseAll, and Tile.
Now, build and run the sample:
g++ -std=c++17 camera-capture-fusion.cpp -o camera-capture-fusion \
-I/path/to/halide/include -L/path/to/halide/lib -lHalide \
$(pkg-config --cflags --libs opencv4) -lpthread -ldl \
-Wl,-rpath,/path/to/halide/lib
./camera-capture-fusion
You’ll see the following output:
% ./camera-capture-fusion
Starting with schedule: FuseAll (press 0..3 to switch; q/Esc to quit)
---- Loop structure (FuseAll) ----
produce output:
for y:
for x:
output(...) = ...
-----------------------------------------------
FuseAll | 18.90 ms | 52.92 FPS | 109.74 MPix/s2 MPix/s
Switching to schedule: FuseBlurAndThreshold
---- Loop structure (FuseBlurAndThreshold) ----
produce gray:
for y:
for x:
gray(...) = ...
consume gray:
produce output:
for y:
for x:
output(...) = ...
-----------------------------------------------
FuseBlurAndThreshold | 4.85 ms | 206.19 FPS | 427.55 MPix/s97 MPix/s
Switching to schedule: FuseAll
---- Loop structure (FuseAll) ----
produce output:
for y:
for x:
output(...) = ...
-----------------------------------------------
FuseAll | 18.14 ms | 55.12 FPS | 114.30 MPix/s22 MPix/s
Switching to schedule: Tile
---- Loop structure (Tile) ----
produce output:
for y.yo:
for x.xo:
produce gray:
for y:
for x:
gray(...) = ...
consume gray:
for y.yi in [0, 63]:
for x.xi in [0, 63]:
output(...) = ...
-----------------------------------------------
Tile | 4.98 ms | 200.73 FPS | 416.23 MPix/s28 MPix/s
Switching to schedule: Simple
---- Loop structure (Simple) ----
produce gray:
for y:
for x:
gray(...) = ...
consume gray:
produce blur:
for y:
for x:
blur(...) = ...
consume blur:
produce output:
for y:
for x:
output(...) = ...
-----------------------------------------------
Simple | 6.01 ms | 166.44 FPS | 345.12 MPix/s15 MPix/s
The console output combines two kinds of information:
Comparing the numbers:
By toggling schedules live, you can see and measure how operator fusion and materialization change both the loop structure and the throughput:
This demo makes these trade-offs concrete: the loop nest diagrams explain the structure, and the live FPS/MPix/s stats show the real performance impact.
Halide’s defining feature is that, by default, it performs operator fusion, also called inlining. This means that if a stage produces some intermediate values, those values aren’t stored in a separate buffer and then re-read later—instead, the stage is computed directly inside the consumer’s loop. In other words, unless you tell Halide otherwise, every producer Func is fused into the next stage that uses it.
Why is this important? Fusion reduces memory traffic, because Halide doesn’t need to write intermediates out to RAM and read them back again. On CPUs, where memory bandwidth is often the bottleneck, this can be a major performance win. Fusion also improves cache locality, since values are computed exactly where they are needed and the working set stays small. The trade-off, however, is that fusion can cause recomputation: if a consumer uses a neighborhood (like a blur that reads 3×3 or 9×9 pixels), the fused producer may be recalculated multiple times for overlapping regions. Whether fusion is faster depends on the balance between compute cost and memory traffic.
Consider the difference in pseudocode:
for y:
for x:
out(x,y) = threshold( sum_{i,j in 3x3} kernel(i,j) * gray(x+i,y+j) )
// gray(...) is computed on the fly for each (i,j)
Materialized with compute_root():
for y: for x: gray(x,y) = ... // write one planar gray image
for y: for x: out(x,y) = threshold( sum kernel * gray(x+i,y+j) )
The fused version eliminates buffer writes but recomputes gray under the blur stencil. The materialized version performs more memory operations but avoids recomputation, and also provides a clean point to parallelize or vectorize the gray stage.
Note that Halide also supports a loop fusion directive (fuse) that merges two loop variables together. That’s a different concept and not the focus here. This tutorial focuses specifically on operator fusion—the decision of whether to inline or materialize stages.
The pipeline is: BGR input → gray → 3×3 blur → thresholded → output. Depending on the schedule, different kinds of fusion are shown:
By toggling between these modes in the live demo, you can see how the loop nests and throughput numbers change, which makes the abstract idea of fusion much more concrete.
Fusion is Halide’s default and usually the right place to start. It’s especially effective for:
In the code, FuseAll inlines gray, blur, and thresholded into output. FuseBlurAndThreshold materializes only gray, then keeps blur and thresholded fused—a common middle ground that balances memory use and compute reuse.
Fusion isn’t always best. You’ll want to materialize an intermediate (compute_root() or compute_at()) if:
The fastest way to check whether fusion helps is to measure it. The demo prints timing and throughput per frame, but Halide also includes a built-in profiler that reports per-stage runtimes. To learn how to enable and interpret the profiler, see the official Halide profiling tutorial .
You’ve seen how operator fusion in Halide can make your image processing pipeline faster and more efficient. Fusion means Halide computes each stage directly inside its consumer, reducing memory traffic and keeping data in cache. You learned when fusion is best—like for simple pixel operations or cheap post-processing—and when materializing intermediates with compute_root() or compute_at() can help, especially for large stencils or multi-use buffers. By switching schedules in the live demo, you saw how fusion and materialization affect both the loop structure and real-time performance. Now you know how to choose the right approach for your own Arm-based image processing tasks.