Introduction
Understand ONNX fundamentals and architecture
Set up your development environment
Generate a synthetic Sudoku digit dataset
Train the digit recognizer
Run inference and evaluate the model
Build the Sudoku processor pipeline
Optimize the model for Arm64 deployment
Deploy the model to Android
Next Steps
In this section, you’ll transform the Sudoku system from a working prototype into one that is faster, smaller, and more robust on Arm64 hardware. Start by measuring a baseline, then apply ONNX Runtime optimizations and quantization, and finally address the most common bottleneck: image preprocessing. At each step, re-check accuracy and solve rate so performance gains don’t come at the cost of correctness.
Before applying any optimizations, it is essential to understand where time is actually being spent in the Sudoku pipeline. Without this baseline, it is impossible to tell whether an optimization is effective or whether it simply shifts the bottleneck elsewhere.
The total latency of processing a single Sudoku image is composed of four main stages:
To quantify these contributions, you will add simple timing measurements around each stage of the pipeline using a high-resolution clock (time.perf_counter()). For each processed image, you will print a breakdown:
In sudoku_processor.py, add the following import
import time
Then, modify the process_image function as follows
def process_image(self, bgr: np.ndarray, overlay: bool = True):
"""
Returns:
board (9x9 ints with 0 for blank),
solved_board (9x9 ints, or None if unsolved),
debug dict (warped, homography, confidence, timing),
overlay_bgr (optional solution overlay)
"""
timing = {}
t_total0 = time.perf_counter()
# --- Grid detection + warp ---
t0 = time.perf_counter()
warped, H, quad = self.detect_and_warp_board(bgr)
timing["warp_ms"] = (time.perf_counter() - t0) * 1000.0
# --- Cell splitting ---
t0 = time.perf_counter()
cells = self.split_cells(warped)
timing["split_ms"] = (time.perf_counter() - t0) * 1000.0
# --- Preprocessing (81 cells) ---
t0 = time.perf_counter()
xs = []
coords = []
for r, c, cell in cells:
coords.append((r, c))
xs.append(self.preprocess_cell(cell))
X = np.concatenate(xs, axis=0).astype(np.float32) # [81,1,28,28]
timing["preprocess_ms"] = (time.perf_counter() - t0) * 1000.0
# --- ONNX inference ---
t0 = time.perf_counter()
logits = self.sess.run([self.output_name], {self.input_name: X})[0]
timing["onnx_ms"] = (time.perf_counter() - t0) * 1000.0
# --- Postprocess predictions ---
probs = softmax(logits, axis=1)
pred = probs.argmax(axis=1)
conf = probs.max(axis=1)
board = [[0 for _ in range(9)] for _ in range(9)]
conf_grid = [[0.0 for _ in range(9)] for _ in range(9)]
for i, (r, c) in enumerate(coords):
p = int(pred[i])
cf = float(conf[i])
if cf < self.blank_conf_threshold:
p = self.blank_class
board[r][c] = p
conf_grid[r][c] = cf
# --- Solve ---
t0 = time.perf_counter()
solved = [row[:] for row in board]
ok = solve_sudoku(solved)
timing["solve_ms"] = (time.perf_counter() - t0) * 1000.0
# --- Overlay (optional) ---
overlay_img = None
if overlay and ok:
t0 = time.perf_counter()
overlay_img = self.overlay_solution(bgr, H, board, solved)
timing["overlay_ms"] = (time.perf_counter() - t0) * 1000.0
else:
timing["overlay_ms"] = 0.0
timing["total_ms"] = (time.perf_counter() - t_total0) * 1000.0
debug = {
"warped": warped,
"homography": H,
"quad": quad,
"confidence": conf_grid,
"timing": timing,
}
return board, (solved if ok else None), debug, overlay_img
Finally, print the timings in the 05_RunSudokuProcessor.py as shown:
def main():
# Use any image path you like:
# - a real photo
# - a synthetic grid, e.g. data/grids/val/000001_cam.png
img_path = "data/grids/val/000002_cam.png"
onnx_path = os.path.join("artifacts", "sudoku_digitnet.onnx")
bgr = cv.imread(img_path)
if bgr is None:
raise RuntimeError(f"Could not read image: {img_path}")
proc = SudokuProcessor(onnx_path=onnx_path, warp_size=450, blank_conf_threshold=0.65)
board, solved, dbg, overlay = proc.process_image(bgr, overlay=True)
print_board(board, "Recognized board")
if solved is None:
print("\nSolver failed (board might contain recognition errors).")
else:
print_board(solved, "Solved board")
# Save debug outputs
cv.imwrite("artifacts/warped.png", dbg["warped"])
if overlay is not None:
cv.imwrite("artifacts/overlay_solution.png", overlay)
print("\nSaved: artifacts/overlay_solution.png")
print("Saved: artifacts/warped.png")
tim = dbg["timing"]
print(
f"warp={tim['warp_ms']:.1f} ms | "
f"preprocess={tim['preprocess_ms']:.1f} ms | "
f"onnx={tim['onnx_ms']:.1f} ms | "
f"solve={tim['solve_ms']:.1f} ms | "
f"total={tim['total_ms']:.1f} ms"
)
if __name__ == "__main__":
main()
Run the script:
python3 05_RunSudokuProcessor.py
The output will look like:
Recognized board
. . . | 7 . . | 6 . .
. . 4 | . . . | 1 . 9
. . . | 1 5 . | . . .
---------------------
. . . | . 1 . | . . .
. . . | . . . | . . .
3 . . | . . . | . 6 .
---------------------
7 . . | . . . | . . .
. . 9 | . . . | . . .
. . . | . . . | . . .
Solved board
1 2 3 | 7 4 9 | 6 5 8
5 6 4 | 2 3 8 | 1 7 9
8 9 7 | 1 5 6 | 2 3 4
---------------------
2 4 5 | 6 1 3 | 8 9 7
9 1 6 | 4 8 7 | 3 2 5
3 7 8 | 5 9 2 | 4 6 1
---------------------
7 3 1 | 8 2 5 | 9 4 6
4 5 9 | 3 6 1 | 7 8 2
6 8 2 | 9 7 4 | 5 1 3
Saved: artifacts/overlay_solution.png
Saved: artifacts/warped.png
warp=11.9 ms | preprocess=3.3 ms | onnx=1.9 ms | solve=3.1 ms | total=48.2 ms
The single-image measurements introduced earlier are useful for understanding the rough structure of the pipeline and for verifying that ONNX inference is not the main computational bottleneck. In our case, batched ONNX inference typically takes less than 2 ms, while grid detection, warping, and preprocessing dominate the runtime. However, individual measurements can be noisy due to caching effects, operating system scheduling, and Python overhead.
To obtain more reliable performance numbers, you can extend the evaluation to multiple images and compute aggregated statistics. This allows us to track not only average performance, but also variability and tail latency, which are particularly important for interactive applications.
To do this, add two helper functions to 05_RunSudokuProcessor.py, and make sure you have import glob and import numpy as np at the top of the runner script.
The first function, summarize, computes basic statistics from a list of timing measurements:
def summarize(values):
values = np.asarray(values, dtype=np.float64)
return {
"mean": float(values.mean()),
"median": float(np.median(values)),
"p90": float(np.percentile(values, 90)),
"p95": float(np.percentile(values, 95)),
}
The second function, benchmark_folder, runs the full Sudoku pipeline on a collection of images and aggregates timing results across multiple runs:
def benchmark_folder(proc, folder_glob, limit=100, warmup=10, overlay=False):
paths = sorted(glob.glob(folder_glob))
if not paths:
raise RuntimeError(f"No images matched: {folder_glob}")
paths = paths[:limit]
# Warmup
for p in paths[:min(warmup, len(paths))]:
bgr = cv.imread(p)
if bgr is None:
continue
proc.process_image(bgr, overlay=overlay)
# Benchmark
agg = {k: [] for k in ["warp_ms", "preprocess_ms", "onnx_ms", "solve_ms", "total_ms"]}
solved_cnt = 0
total_cnt = 0
for p in paths:
bgr = cv.imread(p)
if bgr is None:
continue
board, solved, dbg, _ = proc.process_image(bgr, overlay=overlay)
tim = dbg["timing"]
for k in agg:
agg[k].append(tim[k])
total_cnt += 1
if solved is not None:
solved_cnt += 1
print(f"\nSolved {solved_cnt}/{total_cnt} ({(solved_cnt/total_cnt*100.0 if total_cnt else 0):.1f}%)")
print("\nTiming summary (ms):")
for k in ["warp_ms", "preprocess_ms", "onnx_ms", "solve_ms", "total_ms"]:
s = summarize(agg[k])
print(f"{k:14s} mean={s['mean']:.2f} median={s['median']:.2f} p90={s['p90']:.2f} p95={s['p95']:.2f}")
Finally, invoke the benchmark in the main() function:
def main():
onnx_path = os.path.join("artifacts", "sudoku_digitnet.onnx")
proc = SudokuProcessor(onnx_path=onnx_path, warp_size=450, blank_conf_threshold=0.65)
benchmark_folder(proc, "data/grids/val/*_cam.png", limit=30, warmup=10, overlay=False)
if __name__ == "__main__":
main()
This evaluates the processor on a representative subset of camera-like validation grids, prints aggregated timing statistics, and reports the overall solve rate.
Aggregated benchmarks provide a much more accurate picture than single measurements, especially when individual stages take only a few milliseconds. By reporting median and tail latencies, you can see whether occasional slow cases exist and whether an optimization truly improves user-perceived performance. Percentiles are particularly useful when a few slow cases exist (e.g., harder solves), because they reveal tail latency. These results form a solid quantitative baseline that you can reuse to evaluate every optimization that follows.
Run the updated script:
python3 05_RunSudokuProcessor.py
Here is the sample output of the updated script:
Solved 30/30 (100.0%)
Timing summary (ms):
warp_ms mean=10.25 median=10.27 p90=10.57 p95=10.59
preprocess_ms mean=3.01 median=2.98 p90=3.16 p95=3.21
onnx_ms mean=1.27 median=1.24 p90=1.30 p95=1.45
solve_ms mean=74.76 median=2.02 p90=48.51 p95=74.82
total_ms mean=89.41 median=16.97 p90=62.95 p95=89.43
Notice that solve_ms (and therefore total_ms) has a much larger mean than median. This indicates a small number of outliers where the solver takes significantly longer. In practice, this occurs when one or more digits are misrecognized, forcing the backtracking solver to explore many branches before finding a solution (or failing). For interactive applications, median and p95 latency are more informative than the mean, as they better reflect typical user experience.
Now that you can measure onnx_ms and total_ms, the first low-effort improvement is to enable ONNX Runtime’s built-in graph optimizations and tune CPU threading. These changes do not modify the model, but can reduce inference overhead and improve throughput.
In sudoku_processor.py, update the ONNX Runtime session initialization in init to use SessionOptions:
so = ort.SessionOptions()
so.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
self.sess = ort.InferenceSession(onnx_path, sess_options=so, providers=list(providers))
Re-run 05_RunSudokuProcessor.py and compare onnx_ms and total_ms to the baseline.
Solved 30/30 (100.0%)
Timing summary (ms):
warp_ms mean=10.43 median=10.36 p90=10.89 p95=10.96
preprocess_ms mean=3.13 median=3.11 p90=3.34 p95=3.42
onnx_ms mean=1.28 median=1.26 p90=1.37 p95=1.47
solve_ms mean=78.61 median=2.01 p90=50.15 p95=77.87
total_ms mean=93.58 median=17.06 p90=65.10 p95=92.55
This result is expected for such a small model: ONNX inference is already efficient, and the dominant costs lie in image preprocessing and occasional solver backtracking. This highlights why system-level profiling is essential before focusing on model-level optimizations.
Quantization is one of the most impactful optimizations for Arm64 and mobile deployments because it reduces both model size and compute cost. For CNNs, the most compatible approach is static INT8 quantization in QDQ format. This uses a small calibration set to estimate activation ranges and typically works well across runtimes.
Create a small script 06_QuantizeModel.py with the code below:
import os, glob
import numpy as np
import cv2 as cv
from onnxruntime.quantization import (
quantize_static, CalibrationDataReader, QuantFormat, QuantType
)
ARTI_DIR = "artifacts"
FP32_PATH = os.path.join(ARTI_DIR, "sudoku_digitnet.onnx")
INT8_PATH = os.path.join(ARTI_DIR, "sudoku_digitnet.int8.onnx")
# ---- Calibration data reader ----
class SudokuCalibReader(CalibrationDataReader):
def __init__(self, folder_glob="data/train/0/*.png", limit=500, input_name="input", input_size=28):
self.input_name = input_name
self.input_size = input_size
paths = sorted(glob.glob(folder_glob))[:limit]
self._iter = iter(paths)
def get_next(self):
try:
p = next(self._iter)
except StopIteration:
return None
g = cv.imread(p, cv.IMREAD_GRAYSCALE)
if g is None:
return self.get_next()
g = cv.resize(g, (self.input_size, self.input_size), interpolation=cv.INTER_AREA)
x = g.astype(np.float32) / 255.0
x = (x - 0.5) / 0.5
x = x[None, None, :, :] # [1,1,28,28]
return {self.input_name: x}
# ---- Run quantization ----
reader = SudokuCalibReader(folder_glob="data/train/*/*.png", limit=1000)
print("Quantizing (QDQ static INT8)...")
quantize_static(
model_input=FP32_PATH,
model_output=INT8_PATH,
calibration_data_reader=reader,
quant_format=QuantFormat.QDQ, # key: keep Conv as Conv with Q/DQ wrappers
activation_type=QuantType.QInt8,
weight_type=QuantType.QInt8,
per_channel=True # usually helps conv accuracy
)
print("Saved:", INT8_PATH)
Run the script:
python3 06_QuantizeModel.py
Then update the runner script to point to the quantized model:
onnx_path = os.path.join("artifacts", "sudoku_digitnet.int8.onnx")
Re-run the processor and compare:
Also compare file sizes:
ls -lh artifacts/sudoku_digitnet.onnx artifacts/sudoku_digitnet.int8.onnx
Expected file size reduction is approximately 4x (for example, from 52KB to 14KB). Even when inference time changes only modestly, size reduction is significant and matters for Android packaging.
In this pipeline, quantization primarily reduces model size and improves deployability, while runtime speedups may be modest because inference is already a small fraction of the total latency.
The measurements above show that ONNX inference accounts for only a small fraction of the total runtime. In practice, the largest performance gains come from optimizing image preprocessing.
The most effective improvements include:
These changes typically reduce preprocess_ms more than any model-level optimization, and therefore have the greatest impact on end-to-end latency.
You transformed the Sudoku solver from a functional prototype into a system with measurable, well-understood performance characteristics. You established quantitative baselines showing that ONNX inference takes approximately 1–2 ms per board, identified image preprocessing as the dominant cost (~3 ms) and the largest optimization opportunity, applied INT8 quantization achieving approximately 4x model size reduction, and demonstrated a systematic optimization workflow where you measure first, optimize second, and always re-validate correctness.
Next, you’ll deploy the optimized Sudoku pipeline as a fully on-device Android application, integrating the ONNX model with camera capture and real-time processing.