Deploying YOLO26 on Edge: Raspberry Pi, Jetson, and Embedded Systems
Deploying computer vision models on edge devices - Raspberry Pi, NVIDIA Jetson, smartphones, ARM microcontrollers - is a completely different engineering challenge compared to cloud or GPU server deployment. Resources are constrained: a few watts of power consumption, gigabytes of RAM instead of dozens, no dedicated GPU or entry-level GPU at best. Yet millions of applications require local inference: offline surveillance, robotics, portable medical devices, industrial automation in environments without connectivity.
In this article we'll explore optimization techniques for edge deployment: quantization, pruning, knowledge distillation, optimized formats (ONNX, TFLite, NCNN) and real benchmarks on Raspberry Pi 5 and NVIDIA Jetson Orin.
What You'll Learn
- Edge hardware overview: Raspberry Pi, Jetson Nano/Orin, Coral TPU, Hailo
- Quantization: INT8, FP16 - theory and practical implementation
- Structured and unstructured pruning to reduce parameters
- Knowledge Distillation: training small models from large ones
- TFLite and NCNN: deployment on ARM devices
- TensorRT: maximum speed on NVIDIA GPUs (Jetson)
- ONNX Runtime with optimizations for CPU and NPU
- YOLO26 on Raspberry Pi 5: benchmarks and complete setup
- Real-time video pipeline on Jetson Orin Nano
1. Edge Hardware for Computer Vision
Choosing the right hardware is the first critical decision in edge deployment. There is no single best device: the optimal choice depends on power budget, performance requirements, cost, and deployment environment.
Edge Hardware Comparison 2026
| Device | CPU | GPU/NPU | RAM | TDP | YOLOv8n FPS |
|---|---|---|---|---|---|
| Raspberry Pi 5 | ARM Cortex-A76 4-core | VideoCore VII | 8GB | 15W | ~5 FPS |
| Jetson Nano (2GB) | ARM A57 4-core | 128 CUDA cores | 2GB | 10W | ~20 FPS |
| Jetson Orin Nano | ARM Cortex-A78AE 6-core | 1024 CUDA + DLA | 8GB | 25W | ~80 FPS |
| Jetson AGX Orin | ARM Cortex-A78AE 12-core | 2048 CUDA + DLA | 64GB | 60W | ~200 FPS |
| Google Coral TPU | ARM Cortex-A53 4-core | 4 TOPS Edge TPU | 1GB | 4W | ~30 FPS (TFLite) |
| Hailo-8 | - (PCIe accelerator) | 26 TOPS Neural Engine | - | 5W | ~120 FPS |
Hardware Selection Guide
The key metric for battery-powered or solar-powered devices is FPS/Watt, not raw FPS. The Coral TPU achieves ~7.5 FPS/Watt, while the Jetson AGX Orin achieves ~3.3 FPS/Watt but with significantly higher absolute throughput. For industrial line inspection or retail analytics, the Jetson Orin Nano strikes the best balance between performance and power consumption.
2. Quantization: From FP32 to INT8
Quantization reduces the numerical precision of model weights and activations: from float32 (32 bits) to float16 (16 bits) or int8 (8 bits). The practical effect: 4x smaller model with INT8, 2-4x faster inference, reduced energy consumption. Accuracy loss with modern techniques is typically under 1%.
Quantization Methods Comparison
| Method | Requires Retraining | Accuracy Loss | Speedup | Use Case |
|---|---|---|---|---|
| Post-Training (PTQ) FP16 | No | <0.1% | 1.5-2x | GPU deployment (Jetson FP16) |
| Post-Training (PTQ) INT8 | No (calibration data only) | 0.5-2% | 2-4x | CPU ARM, Coral TPU |
| Quantization-Aware Training (QAT) | Yes (few epochs) | <0.3% | 2-4x | High accuracy requirements |
2.1 Post-Training Quantization (PTQ) with PyTorch
import torch
import torch.quantization as quant
from torch.ao.quantization import get_default_qconfig, prepare, convert
from torchvision import models
import copy
def quantize_model_ptq(
model: torch.nn.Module,
calibration_loader,
backend: str = 'qnnpack' # 'qnnpack' for ARM, 'x86' for Intel CPU
) -> torch.nn.Module:
"""
Post-Training Quantization (PTQ): quantize model without retraining.
Only requires a small calibration dataset (~100-1000 images).
Flow:
1. Fuse operations (Conv+BN+ReLU -> single op)
2. Insert observers for calibration
3. Run calibration (forward pass on calibration dataset)
4. Convert to quantized model
"""
torch.backends.quantized.engine = backend
model_to_quantize = copy.deepcopy(model)
model_to_quantize.eval()
# Step 1: Fuse common layers for efficiency
model_to_quantize = torch.quantization.fuse_modules(
model_to_quantize,
[['conv1', 'bn1', 'relu']], # adapt to your model's layer names
inplace=True
)
# Step 2: Set qconfig and prepare for calibration
qconfig = get_default_qconfig(backend)
model_to_quantize.qconfig = qconfig
prepared_model = prepare(model_to_quantize, inplace=False)
# Step 3: Calibration with real data
print("Running quantization calibration...")
prepared_model.eval()
with torch.no_grad():
for i, (images, _) in enumerate(calibration_loader):
prepared_model(images)
if i >= 99: # 100 calibration batches sufficient
break
if i % 10 == 0:
print(f" Batch {i+1}/100")
# Step 4: Convert to quantized model
quantized_model = convert(prepared_model, inplace=False)
# Verify size reduction
def model_size_mb(m: torch.nn.Module) -> float:
param_size = sum(p.nelement() * p.element_size() for p in m.parameters())
buffer_size = sum(b.nelement() * b.element_size() for b in m.buffers())
return (param_size + buffer_size) / (1024 ** 2)
original_size = model_size_mb(model)
quantized_size = model_size_mb(quantized_model)
print(f"Original size: {original_size:.1f} MB")
print(f"Quantized size: {quantized_size:.1f} MB")
print(f"Reduction: {original_size / quantized_size:.1f}x")
return quantized_model
def compare_inference_speed(original_model, quantized_model,
input_tensor: torch.Tensor, n_runs: int = 100) -> dict:
"""Compare inference speed between original and quantized model."""
import time
results = {}
for name, model in [('FP32', original_model), ('INT8', quantized_model)]:
model.eval()
# Warmup
with torch.no_grad():
for _ in range(10):
model(input_tensor)
# Benchmark
start = time.perf_counter()
with torch.no_grad():
for _ in range(n_runs):
model(input_tensor)
elapsed = time.perf_counter() - start
avg_ms = (elapsed / n_runs) * 1000
results[name] = avg_ms
print(f"{name}: {avg_ms:.2f}ms / inference")
speedup = results['FP32'] / results['INT8']
print(f"INT8 Speedup: {speedup:.2f}x")
return results
2.2 YOLO Export for Edge Targets
from ultralytics import YOLO
model = YOLO('yolo26n.pt') # nano variant for edge
# ---- TFLite INT8 for Raspberry Pi / Coral TPU ----
model.export(
format='tflite',
imgsz=320, # reduced resolution for edge
int8=True, # INT8 quantization
data='coco.yaml' # dataset for PTQ calibration
)
# Output: yolo26n_int8.tflite
# ---- NCNN for ARM CPU (Raspberry Pi, Android) ----
model.export(
format='ncnn',
imgsz=320,
half=False # NCNN uses native FP32 or INT8
)
# Output: yolo26n_ncnn_model/
# ---- TensorRT FP16 for Jetson ----
model.export(
format='engine',
imgsz=640,
half=True, # FP16
workspace=2, # GB workspace (reduced for Jetson Nano)
device=0
)
# Output: yolo26n.engine
# ---- ONNX + ONNX Runtime for CPU/NPU ----
model.export(
format='onnx',
imgsz=320,
opset=17,
simplify=True,
dynamic=False # fixed batch size for edge deployment
)
print("Export completed for all edge targets")
3. YOLO on Raspberry Pi 5
The Raspberry Pi 5 with 8GB RAM and the ARM Cortex-A76 processor represents the most accessible entry point for edge AI. With the right optimizations (NCNN, reduced resolution, tracking to reduce inference frequency), you can build a functional real-time detection system.
Critical: Backend Selection
On Raspberry Pi, always use qnnpack as PyTorch quantization backend and
NCNN as inference runtime. The NCNN framework developed by Tencent is the
fastest ARM CPU runtime available, consistently outperforming ONNX Runtime and TFLite on
ARM Cortex-A chips by 20-40%.
# ============================================
# RASPBERRY PI 5 SETUP for Computer Vision
# ============================================
# 1. Install base dependencies
# sudo apt update && sudo apt install -y python3-pip libopencv-dev
# pip install ultralytics ncnn onnxruntime
# 2. System optimizations for AI
# In /boot/firmware/config.txt:
# gpu_mem=256 # Increase GPU memory (VideoCore VII)
# over_voltage=6 # Mild overclock
# arm_freq=2800 # Max CPU frequency (stock 2.4GHz)
# ============================================
# INFERENCE with NCNN on Raspberry Pi
# ============================================
import ncnn
import cv2
import numpy as np
import time
class YOLOncnn:
"""
YOLO inference with NCNN - optimized for ARM CPU.
NCNN by Tencent is the fastest runtime for ARM CPU.
"""
def __init__(self, param_path: str, bin_path: str,
num_threads: int = 4, input_size: int = 320):
self.net = ncnn.Net()
self.net.opt.num_threads = num_threads # use all cores
self.net.opt.use_vulkan_compute = False # no discrete GPU on RPi
self.net.load_param(param_path)
self.net.load_model(bin_path)
self.input_size = input_size
def predict(self, img_bgr: np.ndarray, conf_thresh: float = 0.4) -> list[dict]:
"""NCNN inference on ARM CPU."""
h, w = img_bgr.shape[:2]
# Resize + normalization for NCNN
img_resized = cv2.resize(img_bgr, (self.input_size, self.input_size))
img_rgb = cv2.cvtColor(img_resized, cv2.COLOR_BGR2RGB)
mat_in = ncnn.Mat.from_pixels(
img_rgb, ncnn.Mat.PixelType.PIXEL_RGB,
self.input_size, self.input_size
)
mean_vals = [0.485 * 255, 0.456 * 255, 0.406 * 255]
norm_vals = [1/0.229/255, 1/0.224/255, 1/0.225/255]
mat_in.substract_mean_normalize(mean_vals, norm_vals)
ex = self.net.create_extractor()
ex.input("images", mat_in)
_, mat_out = ex.extract("output0")
return self._parse_output(mat_out, conf_thresh, w, h)
def _parse_output(self, mat_out, conf_thresh,
orig_w, orig_h) -> list[dict]:
"""Parse NCNN output into detection format."""
detections = []
for i in range(mat_out.h):
row = np.array(mat_out.row(i))
confidence = row[4]
if confidence < conf_thresh:
continue
class_scores = row[5:]
class_id = int(np.argmax(class_scores))
class_conf = confidence * class_scores[class_id]
if class_conf >= conf_thresh:
cx, cy, bw, bh = row[:4]
x1 = int((cx - bw/2) * orig_w / self.input_size)
y1 = int((cy - bh/2) * orig_h / self.input_size)
x2 = int((cx + bw/2) * orig_w / self.input_size)
y2 = int((cy + bh/2) * orig_h / self.input_size)
detections.append({
'class_id': class_id,
'confidence': float(class_conf),
'bbox': (x1, y1, x2, y2)
})
return detections
def run_rpi_detection_loop(model_param: str, model_bin: str,
camera_id: int = 0) -> None:
"""Real-time detection loop optimized for Raspberry Pi."""
detector = YOLOncnn(model_param, model_bin,
num_threads=4, input_size=320)
cap = cv2.VideoCapture(camera_id)
# Optimize capture for RPi
cap.set(cv2.CAP_PROP_FRAME_WIDTH, 640)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 480)
cap.set(cv2.CAP_PROP_FPS, 30)
cap.set(cv2.CAP_PROP_BUFFERSIZE, 1) # minimize latency
frame_skip = 2 # Process 1 frame out of 3 to save CPU
frame_count = 0
cached_dets = []
fps_history = []
while True:
ret, frame = cap.read()
if not ret:
break
t0 = time.perf_counter()
if frame_count % frame_skip == 0:
cached_dets = detector.predict(frame, conf_thresh=0.4)
elapsed = time.perf_counter() - t0
fps = 1.0 / elapsed if elapsed > 0 else 0
fps_history.append(fps)
# Visualization
for det in cached_dets:
x1, y1, x2, y2 = det['bbox']
cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 255, 0), 2)
cv2.putText(frame, f"{det['confidence']:.2f}",
(x1, y1-5), cv2.FONT_HERSHEY_SIMPLEX,
0.5, (0,255,0), 2)
avg_fps = sum(fps_history[-30:]) / min(len(fps_history), 30)
cv2.putText(frame, f"FPS: {avg_fps:.1f}", (10, 30),
cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2)
cv2.imshow('RPi Detection', frame)
if cv2.waitKey(1) & 0xFF == ord('q'):
break
frame_count += 1
cap.release()
cv2.destroyAllWindows()
print(f"Average FPS: {sum(fps_history)/len(fps_history):.1f}")
4. NVIDIA Jetson Orin: TensorRT and DLA
The Jetson Orin Nano (25W) offers 1024 CUDA cores and a dedicated DLA (Deep Learning Accelerator). With TensorRT FP16 and a YOLO26n model, you can easily exceed 100 FPS on 640x640 video.
from ultralytics import YOLO
import cv2
import time
def setup_jetson_pipeline(model_path: str = 'yolo26n.pt') -> YOLO:
"""
Optimal setup for Jetson Orin:
1. Export to TensorRT FP16
2. Configure jetson_clocks for maximum performance
3. Set performance mode for GPU
"""
import subprocess
# Maximize Jetson performance (run once, requires sudo)
# subprocess.run(['sudo', 'jetson_clocks'], check=True)
# subprocess.run(['sudo', 'nvpmodel', '-m', '0'], check=True) # MAXN mode
model = YOLO(model_path)
print("Exporting to TensorRT FP16...")
model.export(
format='engine',
imgsz=640,
half=True, # FP16 - nearly same accuracy as FP32, 2x faster
workspace=2, # GB GPU workspace (Jetson Orin Nano: 8GB shared)
device=0,
batch=1,
simplify=True
)
# Load the TensorRT model
trt_model = YOLO('yolo26n.engine')
print("TensorRT model ready")
return trt_model
def run_jetson_pipeline(model: YOLO, source=0) -> None:
"""Real-time pipeline optimized for Jetson with performance stats."""
cap = cv2.VideoCapture(source)
cap.set(cv2.CAP_PROP_BUFFERSIZE, 1)
fps_list = []
frame_count = 0
try:
while True:
ret, frame = cap.read()
if not ret:
break
t0 = time.perf_counter()
results = model.predict(
frame, conf=0.35, iou=0.45,
verbose=False, half=True # FP16 inference
)
elapsed = time.perf_counter() - t0
fps = 1.0 / elapsed
fps_list.append(fps)
# Annotate with performance info
annotated = results[0].plot()
avg_fps = sum(fps_list[-30:]) / min(len(fps_list), 30)
info_lines = [
f"FPS: {fps:.0f} (avg: {avg_fps:.0f})",
f"Detections: {len(results[0].boxes)}",
f"Inference: {elapsed*1000:.1f}ms"
]
for i, text in enumerate(info_lines):
cv2.putText(annotated, text, (10, 30 + i * 30),
cv2.FONT_HERSHEY_SIMPLEX, 0.8, (0, 255, 0), 2)
cv2.imshow('Jetson Pipeline', annotated)
if cv2.waitKey(1) & 0xFF == ord('q'):
break
frame_count += 1
finally:
cap.release()
cv2.destroyAllWindows()
if fps_list:
print(f"\n=== Jetson Stats ===")
print(f"Frames: {frame_count}")
print(f"Average FPS: {sum(fps_list)/len(fps_list):.1f}")
print(f"Peak FPS: {max(fps_list):.1f}")
print(f"Min latency: {1000/max(fps_list):.1f}ms")
5. Pruning and Knowledge Distillation
5.1 Structured Pruning
Structured pruning removes entire filters or neurons based on their L2-norm importance score. Unlike unstructured (weight-level) pruning, structured pruning produces models that are actually faster in inference - not just smaller files.
import torch
import torch.nn as nn
import torch.nn.utils.prune as prune
def apply_structured_pruning(model: nn.Module,
amount: float = 0.3,
n: int = 2) -> nn.Module:
"""
Structured L2-norm pruning: removes entire filters/neurons.
Produces faster inference models (unlike unstructured pruning
which only produces smaller but not necessarily faster models).
amount: fraction of filters to remove (0.3 = 30%)
n: L_n norm used for filter ranking
"""
for name, module in model.named_modules():
if isinstance(module, nn.Conv2d):
prune.ln_structured(
module,
name='weight',
amount=amount,
n=n,
dim=0 # dim=0 = prune output filters
)
elif isinstance(module, nn.Linear):
prune.ln_structured(
module,
name='weight',
amount=amount,
n=n,
dim=0
)
return model
def remove_pruning_masks(model: nn.Module) -> nn.Module:
"""
Make pruning permanent: remove masks and 'orig' parameters,
keeping only the pruned weights. Required before export.
"""
for name, module in model.named_modules():
if isinstance(module, (nn.Conv2d, nn.Linear)):
try:
prune.remove(module, 'weight')
except ValueError:
pass
return model
def prune_and_finetune(model: nn.Module, train_loader, val_loader,
prune_amount: float = 0.2,
finetune_epochs: int = 5) -> nn.Module:
"""
Complete pipeline:
1. Prune the model (remove prune_amount% of filters)
2. Fine-tune to recover lost accuracy
3. Remove masks and finalize
"""
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
print(f"Applying {prune_amount*100:.0f}% structured pruning...")
model = apply_structured_pruning(model, amount=prune_amount)
# Brief fine-tuning for accuracy recovery
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
for epoch in range(finetune_epochs):
model.train()
total_loss = 0.0
for images, labels in train_loader:
images, labels = images.to(device), labels.to(device)
loss = criterion(model(images), labels)
optimizer.zero_grad(set_to_none=True)
loss.backward()
optimizer.step()
total_loss += loss.item()
model.eval()
correct = total = 0
with torch.no_grad():
for images, labels in val_loader:
images, labels = images.to(device), labels.to(device)
preds = model(images).argmax(1)
correct += preds.eq(labels).sum().item()
total += labels.size(0)
print(f" FT Epoch {epoch+1}/{finetune_epochs} | "
f"Loss: {total_loss/len(train_loader):.4f} | "
f"Acc: {100.*correct/total:.2f}%")
model = remove_pruning_masks(model)
print("Pruning completed and finalized")
return model
5.2 Knowledge Distillation
Knowledge distillation trains a small student model to mimic a large teacher model. The student learns not just from hard labels (ground truth) but from the teacher's soft predictions (logits), which contain richer information about class relationships.
import torch
import torch.nn as nn
import torch.nn.functional as F
class DistillationLoss(nn.Module):
"""
Combined loss for knowledge distillation:
L_total = alpha * L_CE(student, labels) + (1-alpha) * L_KD(student, teacher)
L_KD = KL divergence between soft predictions (temperature-scaled)
Temperature T > 1 softens the distributions, revealing inter-class structure.
"""
def __init__(self, temperature: float = 4.0, alpha: float = 0.3):
super().__init__()
self.T = temperature
self.alpha = alpha
self.ce = nn.CrossEntropyLoss()
def forward(self, student_logits: torch.Tensor,
teacher_logits: torch.Tensor,
labels: torch.Tensor) -> torch.Tensor:
# Standard cross-entropy loss
loss_ce = self.ce(student_logits, labels)
# Soft prediction loss (KL divergence)
student_soft = F.log_softmax(student_logits / self.T, dim=1)
teacher_soft = F.softmax(teacher_logits / self.T, dim=1)
loss_kd = F.kl_div(student_soft, teacher_soft,
reduction='batchmean') * (self.T ** 2)
return self.alpha * loss_ce + (1 - self.alpha) * loss_kd
def train_with_distillation(teacher_model: nn.Module,
student_model: nn.Module,
train_loader,
epochs: int = 30,
temperature: float = 4.0) -> nn.Module:
"""Train student model guided by teacher model."""
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
teacher_model.to(device).eval() # Teacher is frozen
student_model.to(device)
criterion = DistillationLoss(temperature=temperature, alpha=0.3)
optimizer = torch.optim.AdamW(student_model.parameters(),
lr=1e-3, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
optimizer, T_max=epochs
)
for epoch in range(epochs):
student_model.train()
total_loss = 0.0
for images, labels in train_loader:
images = images.to(device)
labels = labels.to(device)
with torch.no_grad():
teacher_logits = teacher_model(images)
student_logits = student_model(images)
loss = criterion(student_logits, teacher_logits, labels)
optimizer.zero_grad(set_to_none=True)
loss.backward()
optimizer.step()
total_loss += loss.item()
scheduler.step()
print(f"Epoch {epoch+1}/{epochs} | "
f"Loss: {total_loss/len(train_loader):.4f} | "
f"LR: {scheduler.get_last_lr()[0]:.2e}")
return student_model
6. ONNX Runtime: Cross-Platform Inference
ONNX Runtime provides a unified API for inference across CPU, CUDA, TensorRT, OpenVINO, CoreML, and more. It's the best choice when you need portability across multiple target platforms from a single model file.
import onnxruntime as ort
import numpy as np
import cv2
import time
from typing import Optional
class ONNXInferenceEngine:
"""
Cross-platform ONNX Runtime inference.
Automatically selects best execution provider:
TensorRT > CUDA > CPU
"""
EXECUTION_PROVIDERS = [
'TensorrtExecutionProvider', # Jetson with TensorRT
'CUDAExecutionProvider', # NVIDIA GPU (generic)
'CPUExecutionProvider' # Fallback: any device
]
def __init__(self, model_path: str,
providers: Optional[list] = None):
providers = providers or self.EXECUTION_PROVIDERS
# Session options for optimization
sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = (
ort.GraphOptimizationLevel.ORT_ENABLE_ALL
)
sess_options.intra_op_num_threads = 4
sess_options.inter_op_num_threads = 1
# Check available providers
available = ort.get_available_providers()
selected = [p for p in providers if p in available]
print(f"Available providers: {available}")
print(f"Using: {selected[0]}")
self.session = ort.InferenceSession(
model_path,
sess_options=sess_options,
providers=selected
)
# Get I/O shapes
self.input_name = self.session.get_inputs()[0].name
self.input_shape = self.session.get_inputs()[0].shape
print(f"Input: {self.input_name} {self.input_shape}")
def preprocess(self, img_bgr: np.ndarray,
input_size: int = 320) -> np.ndarray:
"""Preprocess image for ONNX model."""
img = cv2.resize(img_bgr, (input_size, input_size))
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
img = img.astype(np.float32) / 255.0
img = np.transpose(img, (2, 0, 1)) # HWC -> CHW
img = np.expand_dims(img, 0) # CHW -> NCHW
return np.ascontiguousarray(img)
def infer(self, img_bgr: np.ndarray,
input_size: int = 320) -> list:
"""Run inference and return raw outputs."""
input_data = self.preprocess(img_bgr, input_size)
outputs = self.session.run(
None,
{self.input_name: input_data}
)
return outputs
def benchmark(self, input_size: int = 320,
n_runs: int = 100) -> dict:
"""Benchmark inference speed."""
dummy_img = np.random.randint(0, 255,
(480, 640, 3),
dtype=np.uint8)
# Warmup
for _ in range(10):
self.infer(dummy_img, input_size)
times = []
for _ in range(n_runs):
t0 = time.perf_counter()
self.infer(dummy_img, input_size)
times.append(time.perf_counter() - t0)
avg_ms = np.mean(times) * 1000
p99_ms = np.percentile(times, 99) * 1000
print(f"Avg latency: {avg_ms:.2f}ms")
print(f"P99 latency: {p99_ms:.2f}ms")
print(f"Avg FPS: {1000/avg_ms:.1f}")
return {"avg_ms": avg_ms, "p99_ms": p99_ms}
7. Best Practices for Edge Deployment
Edge Deployment Checklist
- Choose the smallest model that meets requirements: YOLOv8n or YOLO26n for RPi, YOLOv8m for Jetson Orin. Never deploy Large or XLarge variants on edge devices.
- Reduce input resolution: 320x320 instead of 640x640 reduces inference time by ~75% with moderate accuracy loss. For large objects, 320 is sufficient.
- Smart frame skipping: If objects move slowly, process 1 frame out of 3-5. Use a tracker (CSRT, ByteTrack) to interpolate positions in skipped frames.
- Optimize capture pipeline: Set CAP_PROP_BUFFERSIZE=1 to minimize acquisition latency. Use V4L2 directly on Linux for lower overhead.
- TensorRT on Jetson: always. The difference between PyTorch and TensorRT FP16 is 5-8x. There is no reason to use PyTorch for production inference on Jetson.
- Thermal throttling: On RPi and Jetson, overheating causes throttling. Add heatsinks, monitor temperature with
vcgencmd measure_temp, and implement thermal management. - Measure energy, not just speed: FPS/Watt is the metric that matters for battery devices. A 2x slower but 4x more energy-efficient model is often preferable.
- Profile before optimizing: Use
trtexecon Jetson andonnxruntime.tools.perf_testto identify the actual bottleneck before applying optimizations.
Edge Optimization Impact (YOLOv8n, Raspberry Pi 5)
| Configuration | FPS | mAP50 | Model Size |
|---|---|---|---|
| PyTorch FP32, 640x640 | 0.8 | 37.3% | 6.2 MB |
| ONNX Runtime FP32, 640x640 | 2.1 | 37.3% | 12.2 MB |
| NCNN FP32, 320x320 | 5.4 | 34.1% | 12.2 MB |
| NCNN FP32, 320x320 + frame skip | 14.2 (effective) | 34.1% | 12.2 MB |
| TFLite INT8, 320x320 | 6.8 | 33.6% | 3.1 MB |
Conclusions
Deploying computer vision models on edge devices requires a holistic approach that combines hardware selection, model optimization, and pipeline engineering:
- Edge hardware: Raspberry Pi 5 for budget scenarios, Jetson Orin for real-time performance
- INT8 quantization: 4x size reduction, 2-4x speedup, <1% accuracy loss
- NCNN for ARM CPU, TensorRT for NVIDIA GPU, TFLite + Coral TPU for ultra-low power
- Structured pruning + fine-tuning: remove 20-30% of filters with minimal loss
- Frame skipping + tracking: reduce compute by 70-80% in slowly-changing scenes
The key insight is that edge deployment is not a single optimization step but a system-level design problem. The best results come from co-designing the model, the runtime, and the acquisition pipeline together.
Series Navigation
Cross-Series Resources
- MLOps: Model Serving in Production - cloud deployment with Kubernetes and Triton
- Deep Learning Advanced: Quantization and Compression







