Hi! I'm

Federico Calò

Software Developer | Technical Writer

I create modern web applications and custom digital tools to help businesses grow through technological innovation. My passion is combining computer science and economics to generate real value.

Contact Me

About Me

My passion for computer science was born at the Technical Commercial Institute of Maglie, where I discovered the power of programming and the fascination of creating digital solutions. From the start, I understood that computer science was not just code, but an extraordinary tool for turning ideas into reality.

During my studies in Business Information Systems, I began to interweave computer science and economics, understanding how technology can be the engine of growth for any business. This vision accompanied me to the University of Bari, where I obtained my degree in Computer Science, deepening my technical skills and passion for software development.

Today I put this experience at the service of businesses, professionals and startups, creating tailor-made digital solutions that automate processes, optimize resources and open new business opportunities. Because true innovation begins when technology meets the real needs of people.

My Skills

Data Analysis & Predictive Models

I transform data into strategic insights with in-depth analysis and predictive models for informed decisions

Process Automation

I create custom tools that automate repetitive operations and free up time for value-added activities

Custom Systems

I develop tailor-made software systems, from platform integrations to customized dashboards

const federico = {
  nome: "Federico Calò",
  ruolo: "Sviluppatore Software",
  città: "Bari, Italia",
  missione: "Aiutare attraverso l'informatica",
  passioni: [
    "Codice Pulito",
    "Innovazione",
    "Crescita Continua"
  ]
};

La Mia Missione

Credo fermamente che l'informatica sia lo strumento più potente per trasformare le idee in realtà e migliorare la vita delle persone.

Democratizzare la Tecnologia

La mia missione è rendere l'informatica accessibile a tutti: dalle piccole imprese locali alle startup innovative, fino ai professionisti che vogliono digitalizzare la propria attività. Ogni realtà merita di sfruttare le potenzialità del digitale.

Unire Informatica ed Economia

Non è solo questione di scrivere codice: è capire come la tecnologia possa generare valore reale. Intrecciando competenze informatiche e visione economica, aiuto le attività a crescere, ottimizzare processi e raggiungere nuovi traguardi di efficienza e redditività.

Creare Soluzioni su Misura

Ogni attività è unica, e così devono esserlo le soluzioni. Sviluppo strumenti personalizzati che rispondono alle esigenze specifiche di ciascun cliente, automatizzando processi ripetitivi e liberando tempo per ciò che conta davvero: far crescere il business.

Trasforma la Tua Attività con la Tecnologia

Che tu gestisca un negozio, uno studio professionale o un'azienda, posso aiutarti a sfruttare le potenzialità dell'informatica per lavorare meglio, più velocemente e in modo più intelligente.

Parliamone Insieme →

Join the Community

Join the developer community where we discuss software, AI, architecture and DevOps. Share ideas, ask questions and grow with us.

Channel

FC Dev Blog

Get notifications on new articles, complete series, weekly tips and featured tools. Bilingual IT/EN content directly in your Telegram.

New articles as they are published
Weekly tips and code snippets
Polls on future topics

Subscribe to Channel

Group

FC Dev Community

A bilingual IT/EN community for developers. Discussions, Q&A, mutual help and networking with other professionals.

Discussions on articles and technologies
Coding help and code review
Job opportunities and collaboration

Join the Group

Discussion Topics

View

Master SQL

RoadMap.sh

November 2024

View

Oracle Certified Foundations Associate

Oracle

October 2024

View

People Leadership Credential

Connect

September 2024

💻 Languages & Technologies

Java

Python

JavaScript

Angular

React

TypeScript

SQL

PHP

CSS/SCSS

Node.js

Docker

Git

💼

12/2024 - Present

Custom Software Engineering Analyst

Accenture

Bari, Puglia, Italy · Hybrid Analysis and development of computer systems through the use of Java and Quarkus in Health and Public Sector. Continuous training on modern technologies for creating customized and efficient software solutions and on agents.

💼

06/2022 - 12/2024

Software analyst and Back End Developer Associate Consultant

Links Management and Technology SpA

Experience analyzing as-is software systems and ETL flows using PowerCenter. Completed Spring Boot training for developing modern and scalable backend applications. Backend developer specialized in Spring Boot, with experience in database design, analysis, development and testing of assigned tasks.

💼

02/2021 - 10/2021

Software programmer

Adesso.it (prima era WebScience srl)

Experience in AS-IS and TO-BE analysis, SEO evolutions and website evolutions to improve user performance and engagement.

🎓

2018 - 2025

Degree in Computer Science

University of Bari Aldo Moro

Bachelor's degree in Computer Science, focusing on software engineering, algorithms, and modern development practices.

📚

2013 - 2018

Diploma - Corporate Information Systems

Technical Commercial Institute of Maglie

Technical diploma specializing in Business Information Systems, combining IT knowledge with business management.

Contattami

Hai un progetto in mente? Parliamone! Compila il form qui sotto e ti risponderò al più presto.

* Campi obbligatori. I tuoi dati saranno utilizzati solo per rispondere alla tua richiesta.

Deploying YOLO26 on Edge: Raspberry Pi, Jetson, and Embedded Systems

Deploying computer vision models on edge devices - Raspberry Pi, NVIDIA Jetson, smartphones, ARM microcontrollers - is a completely different engineering challenge compared to cloud or GPU server deployment. Resources are constrained: a few watts of power consumption, gigabytes of RAM instead of dozens, no dedicated GPU or entry-level GPU at best. Yet millions of applications require local inference: offline surveillance, robotics, portable medical devices, industrial automation in environments without connectivity.

In this article we'll explore optimization techniques for edge deployment: quantization, pruning, knowledge distillation, optimized formats (ONNX, TFLite, NCNN) and real benchmarks on Raspberry Pi 5 and NVIDIA Jetson Orin.

What You'll Learn

Edge hardware overview: Raspberry Pi, Jetson Nano/Orin, Coral TPU, Hailo
Quantization: INT8, FP16 - theory and practical implementation
Structured and unstructured pruning to reduce parameters
Knowledge Distillation: training small models from large ones
TFLite and NCNN: deployment on ARM devices
TensorRT: maximum speed on NVIDIA GPUs (Jetson)
ONNX Runtime with optimizations for CPU and NPU
YOLO26 on Raspberry Pi 5: benchmarks and complete setup
Real-time video pipeline on Jetson Orin Nano

1. Edge Hardware for Computer Vision

Choosing the right hardware is the first critical decision in edge deployment. There is no single best device: the optimal choice depends on power budget, performance requirements, cost, and deployment environment.

      Edge Hardware Comparison 2026
      
        
            Device
            CPU
            GPU/NPU
            RAM
            TDP
            YOLOv8n FPS
          

        
            Raspberry Pi 5
            ARM Cortex-A76 4-core
            VideoCore VII
            8GB
            15W
            ~5 FPS
          

            Jetson Nano (2GB)
            ARM A57 4-core
            128 CUDA cores
            2GB
            10W
            ~20 FPS
          

            Jetson Orin Nano
            ARM Cortex-A78AE 6-core
            1024 CUDA + DLA
            8GB
            25W
            ~80 FPS
          

            Jetson AGX Orin
            ARM Cortex-A78AE 12-core
            2048 CUDA + DLA
            64GB
            60W
            ~200 FPS
          

            Google Coral TPU
            ARM Cortex-A53 4-core
            4 TOPS Edge TPU
            1GB
            4W
            ~30 FPS (TFLite)
          

            Hailo-8
            - (PCIe accelerator)
            26 TOPS Neural Engine
            -
            5W
            ~120 FPS
          

      
    

Hardware Selection Guide

The key metric for battery-powered or solar-powered devices is FPS/Watt, not raw FPS. The Coral TPU achieves ~7.5 FPS/Watt, while the Jetson AGX Orin achieves ~3.3 FPS/Watt but with significantly higher absolute throughput. For industrial line inspection or retail analytics, the Jetson Orin Nano strikes the best balance between performance and power consumption.

2. Quantization: From FP32 to INT8

Quantization reduces the numerical precision of model weights and activations: from float32 (32 bits) to float16 (16 bits) or int8 (8 bits). The practical effect: 4x smaller model with INT8, 2-4x faster inference, reduced energy consumption. Accuracy loss with modern techniques is typically under 1%.

      Quantization Methods Comparison
      
        
            Method
            Requires Retraining
            Accuracy Loss
            Speedup
            Use Case
          

        
            Post-Training (PTQ) FP16
            No
            <0.1%
            1.5-2x
            GPU deployment (Jetson FP16)
          

            Post-Training (PTQ) INT8
            No (calibration data only)
            0.5-2%
            2-4x
            CPU ARM, Coral TPU
          

            Quantization-Aware Training (QAT)
            Yes (few epochs)
            <0.3%
            2-4x
            High accuracy requirements
          

      
    

2.1 Post-Training Quantization (PTQ) with PyTorch

INT8 Quantization with PyTorch PTQ

import torch
import torch.quantization as quant
from torch.ao.quantization import get_default_qconfig, prepare, convert
from torchvision import models
import copy

def quantize_model_ptq(
    model: torch.nn.Module,
    calibration_loader,
    backend: str = 'qnnpack'  # 'qnnpack' for ARM, 'x86' for Intel CPU
) -> torch.nn.Module:
    """
    Post-Training Quantization (PTQ): quantize model without retraining.
    Only requires a small calibration dataset (~100-1000 images).

    Flow:
    1. Fuse operations (Conv+BN+ReLU -> single op)
    2. Insert observers for calibration
    3. Run calibration (forward pass on calibration dataset)
    4. Convert to quantized model
    """
    torch.backends.quantized.engine = backend

    model_to_quantize = copy.deepcopy(model)
    model_to_quantize.eval()

    # Step 1: Fuse common layers for efficiency
    model_to_quantize = torch.quantization.fuse_modules(
        model_to_quantize,
        [['conv1', 'bn1', 'relu']],  # adapt to your model's layer names
        inplace=True
    )

    # Step 2: Set qconfig and prepare for calibration
    qconfig = get_default_qconfig(backend)
    model_to_quantize.qconfig = qconfig
    prepared_model = prepare(model_to_quantize, inplace=False)

    # Step 3: Calibration with real data
    print("Running quantization calibration...")
    prepared_model.eval()
    with torch.no_grad():
        for i, (images, _) in enumerate(calibration_loader):
            prepared_model(images)
            if i >= 99:  # 100 calibration batches sufficient
                break
            if i % 10 == 0:
                print(f"  Batch {i+1}/100")

    # Step 4: Convert to quantized model
    quantized_model = convert(prepared_model, inplace=False)

    # Verify size reduction
    def model_size_mb(m: torch.nn.Module) -> float:
        param_size = sum(p.nelement() * p.element_size() for p in m.parameters())
        buffer_size = sum(b.nelement() * b.element_size() for b in m.buffers())
        return (param_size + buffer_size) / (1024 ** 2)

    original_size = model_size_mb(model)
    quantized_size = model_size_mb(quantized_model)
    print(f"Original size: {original_size:.1f} MB")
    print(f"Quantized size: {quantized_size:.1f} MB")
    print(f"Reduction: {original_size / quantized_size:.1f}x")

    return quantized_model


def compare_inference_speed(original_model, quantized_model,
                             input_tensor: torch.Tensor, n_runs: int = 100) -> dict:
    """Compare inference speed between original and quantized model."""
    import time

    results = {}

    for name, model in [('FP32', original_model), ('INT8', quantized_model)]:
        model.eval()
        # Warmup
        with torch.no_grad():
            for _ in range(10):
                model(input_tensor)

        # Benchmark
        start = time.perf_counter()
        with torch.no_grad():
            for _ in range(n_runs):
                model(input_tensor)
        elapsed = time.perf_counter() - start

        avg_ms = (elapsed / n_runs) * 1000
        results[name] = avg_ms
        print(f"{name}: {avg_ms:.2f}ms / inference")

    speedup = results['FP32'] / results['INT8']
    print(f"INT8 Speedup: {speedup:.2f}x")
    return results

2.2 YOLO Export for Edge Targets

YOLO26: Quantized Export for All Edge Formats

from ultralytics import YOLO

model = YOLO('yolo26n.pt')  # nano variant for edge

# ---- TFLite INT8 for Raspberry Pi / Coral TPU ----
model.export(
    format='tflite',
    imgsz=320,        # reduced resolution for edge
    int8=True,        # INT8 quantization
    data='coco.yaml'  # dataset for PTQ calibration
)
# Output: yolo26n_int8.tflite

# ---- NCNN for ARM CPU (Raspberry Pi, Android) ----
model.export(
    format='ncnn',
    imgsz=320,
    half=False  # NCNN uses native FP32 or INT8
)
# Output: yolo26n_ncnn_model/

# ---- TensorRT FP16 for Jetson ----
model.export(
    format='engine',
    imgsz=640,
    half=True,       # FP16
    workspace=2,     # GB workspace (reduced for Jetson Nano)
    device=0
)
# Output: yolo26n.engine

# ---- ONNX + ONNX Runtime for CPU/NPU ----
model.export(
    format='onnx',
    imgsz=320,
    opset=17,
    simplify=True,
    dynamic=False    # fixed batch size for edge deployment
)

print("Export completed for all edge targets")

3. YOLO on Raspberry Pi 5

The Raspberry Pi 5 with 8GB RAM and the ARM Cortex-A76 processor represents the most accessible entry point for edge AI. With the right optimizations (NCNN, reduced resolution, tracking to reduce inference frequency), you can build a functional real-time detection system.

Critical: Backend Selection

On Raspberry Pi, always use qnnpack as PyTorch quantization backend and NCNN as inference runtime. The NCNN framework developed by Tencent is the fastest ARM CPU runtime available, consistently outperforming ONNX Runtime and TFLite on ARM Cortex-A chips by 20-40%.

Raspberry Pi 5 Setup and Detection Pipeline

# ============================================
# RASPBERRY PI 5 SETUP for Computer Vision
# ============================================

# 1. Install base dependencies
# sudo apt update && sudo apt install -y python3-pip libopencv-dev
# pip install ultralytics ncnn onnxruntime

# 2. System optimizations for AI
# In /boot/firmware/config.txt:
# gpu_mem=256           # Increase GPU memory (VideoCore VII)
# over_voltage=6        # Mild overclock
# arm_freq=2800         # Max CPU frequency (stock 2.4GHz)

# ============================================
# INFERENCE with NCNN on Raspberry Pi
# ============================================

import ncnn
import cv2
import numpy as np
import time

class YOLOncnn:
    """
    YOLO inference with NCNN - optimized for ARM CPU.
    NCNN by Tencent is the fastest runtime for ARM CPU.
    """

    def __init__(self, param_path: str, bin_path: str,
                 num_threads: int = 4, input_size: int = 320):
        self.net = ncnn.Net()
        self.net.opt.num_threads = num_threads  # use all cores
        self.net.opt.use_vulkan_compute = False  # no discrete GPU on RPi
        self.net.load_param(param_path)
        self.net.load_model(bin_path)
        self.input_size = input_size

    def predict(self, img_bgr: np.ndarray, conf_thresh: float = 0.4) -> list[dict]:
        """NCNN inference on ARM CPU."""
        h, w = img_bgr.shape[:2]

        # Resize + normalization for NCNN
        img_resized = cv2.resize(img_bgr, (self.input_size, self.input_size))
        img_rgb = cv2.cvtColor(img_resized, cv2.COLOR_BGR2RGB)

        mat_in = ncnn.Mat.from_pixels(
            img_rgb, ncnn.Mat.PixelType.PIXEL_RGB,
            self.input_size, self.input_size
        )
        mean_vals = [0.485 * 255, 0.456 * 255, 0.406 * 255]
        norm_vals = [1/0.229/255, 1/0.224/255, 1/0.225/255]
        mat_in.substract_mean_normalize(mean_vals, norm_vals)

        ex = self.net.create_extractor()
        ex.input("images", mat_in)
        _, mat_out = ex.extract("output0")

        return self._parse_output(mat_out, conf_thresh, w, h)

    def _parse_output(self, mat_out, conf_thresh,
                      orig_w, orig_h) -> list[dict]:
        """Parse NCNN output into detection format."""
        detections = []
        for i in range(mat_out.h):
            row = np.array(mat_out.row(i))
            confidence = row[4]
            if confidence < conf_thresh:
                continue

            class_scores = row[5:]
            class_id = int(np.argmax(class_scores))
            class_conf = confidence * class_scores[class_id]

            if class_conf >= conf_thresh:
                cx, cy, bw, bh = row[:4]
                x1 = int((cx - bw/2) * orig_w / self.input_size)
                y1 = int((cy - bh/2) * orig_h / self.input_size)
                x2 = int((cx + bw/2) * orig_w / self.input_size)
                y2 = int((cy + bh/2) * orig_h / self.input_size)

                detections.append({
                    'class_id': class_id,
                    'confidence': float(class_conf),
                    'bbox': (x1, y1, x2, y2)
                })

        return detections


def run_rpi_detection_loop(model_param: str, model_bin: str,
                            camera_id: int = 0) -> None:
    """Real-time detection loop optimized for Raspberry Pi."""
    detector = YOLOncnn(model_param, model_bin,
                        num_threads=4, input_size=320)
    cap = cv2.VideoCapture(camera_id)

    # Optimize capture for RPi
    cap.set(cv2.CAP_PROP_FRAME_WIDTH, 640)
    cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 480)
    cap.set(cv2.CAP_PROP_FPS, 30)
    cap.set(cv2.CAP_PROP_BUFFERSIZE, 1)  # minimize latency

    frame_skip = 2   # Process 1 frame out of 3 to save CPU
    frame_count = 0
    cached_dets = []
    fps_history = []

    while True:
        ret, frame = cap.read()
        if not ret:
            break

        t0 = time.perf_counter()

        if frame_count % frame_skip == 0:
            cached_dets = detector.predict(frame, conf_thresh=0.4)

        elapsed = time.perf_counter() - t0
        fps = 1.0 / elapsed if elapsed > 0 else 0
        fps_history.append(fps)

        # Visualization
        for det in cached_dets:
            x1, y1, x2, y2 = det['bbox']
            cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 255, 0), 2)
            cv2.putText(frame, f"{det['confidence']:.2f}",
                       (x1, y1-5), cv2.FONT_HERSHEY_SIMPLEX,
                       0.5, (0,255,0), 2)

        avg_fps = sum(fps_history[-30:]) / min(len(fps_history), 30)
        cv2.putText(frame, f"FPS: {avg_fps:.1f}", (10, 30),
                   cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2)

        cv2.imshow('RPi Detection', frame)
        if cv2.waitKey(1) & 0xFF == ord('q'):
            break

        frame_count += 1

    cap.release()
    cv2.destroyAllWindows()
    print(f"Average FPS: {sum(fps_history)/len(fps_history):.1f}")

4. NVIDIA Jetson Orin: TensorRT and DLA

The Jetson Orin Nano (25W) offers 1024 CUDA cores and a dedicated DLA (Deep Learning Accelerator). With TensorRT FP16 and a YOLO26n model, you can easily exceed 100 FPS on 640x640 video.

TensorRT on Jetson: Setup and Real-Time Inference

from ultralytics import YOLO
import cv2
import time

def setup_jetson_pipeline(model_path: str = 'yolo26n.pt') -> YOLO:
    """
    Optimal setup for Jetson Orin:
    1. Export to TensorRT FP16
    2. Configure jetson_clocks for maximum performance
    3. Set performance mode for GPU
    """
    import subprocess

    # Maximize Jetson performance (run once, requires sudo)
    # subprocess.run(['sudo', 'jetson_clocks'], check=True)
    # subprocess.run(['sudo', 'nvpmodel', '-m', '0'], check=True)  # MAXN mode

    model = YOLO(model_path)

    print("Exporting to TensorRT FP16...")
    model.export(
        format='engine',
        imgsz=640,
        half=True,       # FP16 - nearly same accuracy as FP32, 2x faster
        workspace=2,     # GB GPU workspace (Jetson Orin Nano: 8GB shared)
        device=0,
        batch=1,
        simplify=True
    )

    # Load the TensorRT model
    trt_model = YOLO('yolo26n.engine')
    print("TensorRT model ready")
    return trt_model


def run_jetson_pipeline(model: YOLO, source=0) -> None:
    """Real-time pipeline optimized for Jetson with performance stats."""
    cap = cv2.VideoCapture(source)
    cap.set(cv2.CAP_PROP_BUFFERSIZE, 1)

    fps_list = []
    frame_count = 0

    try:
        while True:
            ret, frame = cap.read()
            if not ret:
                break

            t0 = time.perf_counter()
            results = model.predict(
                frame, conf=0.35, iou=0.45,
                verbose=False, half=True  # FP16 inference
            )
            elapsed = time.perf_counter() - t0
            fps = 1.0 / elapsed
            fps_list.append(fps)

            # Annotate with performance info
            annotated = results[0].plot()
            avg_fps = sum(fps_list[-30:]) / min(len(fps_list), 30)

            info_lines = [
                f"FPS: {fps:.0f} (avg: {avg_fps:.0f})",
                f"Detections: {len(results[0].boxes)}",
                f"Inference: {elapsed*1000:.1f}ms"
            ]
            for i, text in enumerate(info_lines):
                cv2.putText(annotated, text, (10, 30 + i * 30),
                           cv2.FONT_HERSHEY_SIMPLEX, 0.8, (0, 255, 0), 2)

            cv2.imshow('Jetson Pipeline', annotated)
            if cv2.waitKey(1) & 0xFF == ord('q'):
                break

            frame_count += 1

    finally:
        cap.release()
        cv2.destroyAllWindows()
        if fps_list:
            print(f"\n=== Jetson Stats ===")
            print(f"Frames: {frame_count}")
            print(f"Average FPS: {sum(fps_list)/len(fps_list):.1f}")
            print(f"Peak FPS: {max(fps_list):.1f}")
            print(f"Min latency: {1000/max(fps_list):.1f}ms")

5. Pruning and Knowledge Distillation

5.1 Structured Pruning

Structured pruning removes entire filters or neurons based on their L2-norm importance score. Unlike unstructured (weight-level) pruning, structured pruning produces models that are actually faster in inference - not just smaller files.

Structured Pruning with PyTorch

import torch
import torch.nn as nn
import torch.nn.utils.prune as prune

def apply_structured_pruning(model: nn.Module,
                               amount: float = 0.3,
                               n: int = 2) -> nn.Module:
    """
    Structured L2-norm pruning: removes entire filters/neurons.
    Produces faster inference models (unlike unstructured pruning
    which only produces smaller but not necessarily faster models).

    amount: fraction of filters to remove (0.3 = 30%)
    n: L_n norm used for filter ranking
    """
    for name, module in model.named_modules():
        if isinstance(module, nn.Conv2d):
            prune.ln_structured(
                module,
                name='weight',
                amount=amount,
                n=n,
                dim=0  # dim=0 = prune output filters
            )
        elif isinstance(module, nn.Linear):
            prune.ln_structured(
                module,
                name='weight',
                amount=amount,
                n=n,
                dim=0
            )

    return model


def remove_pruning_masks(model: nn.Module) -> nn.Module:
    """
    Make pruning permanent: remove masks and 'orig' parameters,
    keeping only the pruned weights. Required before export.
    """
    for name, module in model.named_modules():
        if isinstance(module, (nn.Conv2d, nn.Linear)):
            try:
                prune.remove(module, 'weight')
            except ValueError:
                pass
    return model


def prune_and_finetune(model: nn.Module, train_loader, val_loader,
                        prune_amount: float = 0.2,
                        finetune_epochs: int = 5) -> nn.Module:
    """
    Complete pipeline:
    1. Prune the model (remove prune_amount% of filters)
    2. Fine-tune to recover lost accuracy
    3. Remove masks and finalize
    """
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model.to(device)

    print(f"Applying {prune_amount*100:.0f}% structured pruning...")
    model = apply_structured_pruning(model, amount=prune_amount)

    # Brief fine-tuning for accuracy recovery
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)

    for epoch in range(finetune_epochs):
        model.train()
        total_loss = 0.0
        for images, labels in train_loader:
            images, labels = images.to(device), labels.to(device)
            loss = criterion(model(images), labels)
            optimizer.zero_grad(set_to_none=True)
            loss.backward()
            optimizer.step()
            total_loss += loss.item()

        model.eval()
        correct = total = 0
        with torch.no_grad():
            for images, labels in val_loader:
                images, labels = images.to(device), labels.to(device)
                preds = model(images).argmax(1)
                correct += preds.eq(labels).sum().item()
                total += labels.size(0)

        print(f"  FT Epoch {epoch+1}/{finetune_epochs} | "
              f"Loss: {total_loss/len(train_loader):.4f} | "
              f"Acc: {100.*correct/total:.2f}%")

    model = remove_pruning_masks(model)
    print("Pruning completed and finalized")
    return model

5.2 Knowledge Distillation

Knowledge distillation trains a small student model to mimic a large teacher model. The student learns not just from hard labels (ground truth) but from the teacher's soft predictions (logits), which contain richer information about class relationships.

Knowledge Distillation: Teacher to Student

import torch
import torch.nn as nn
import torch.nn.functional as F

class DistillationLoss(nn.Module):
    """
    Combined loss for knowledge distillation:
    L_total = alpha * L_CE(student, labels) + (1-alpha) * L_KD(student, teacher)

    L_KD = KL divergence between soft predictions (temperature-scaled)
    Temperature T > 1 softens the distributions, revealing inter-class structure.
    """

    def __init__(self, temperature: float = 4.0, alpha: float = 0.3):
        super().__init__()
        self.T = temperature
        self.alpha = alpha
        self.ce = nn.CrossEntropyLoss()

    def forward(self, student_logits: torch.Tensor,
                teacher_logits: torch.Tensor,
                labels: torch.Tensor) -> torch.Tensor:
        # Standard cross-entropy loss
        loss_ce = self.ce(student_logits, labels)

        # Soft prediction loss (KL divergence)
        student_soft = F.log_softmax(student_logits / self.T, dim=1)
        teacher_soft = F.softmax(teacher_logits / self.T, dim=1)
        loss_kd = F.kl_div(student_soft, teacher_soft,
                            reduction='batchmean') * (self.T ** 2)

        return self.alpha * loss_ce + (1 - self.alpha) * loss_kd


def train_with_distillation(teacher_model: nn.Module,
                              student_model: nn.Module,
                              train_loader,
                              epochs: int = 30,
                              temperature: float = 4.0) -> nn.Module:
    """Train student model guided by teacher model."""
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    teacher_model.to(device).eval()  # Teacher is frozen
    student_model.to(device)

    criterion = DistillationLoss(temperature=temperature, alpha=0.3)
    optimizer = torch.optim.AdamW(student_model.parameters(),
                                   lr=1e-3, weight_decay=1e-4)
    scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
        optimizer, T_max=epochs
    )

    for epoch in range(epochs):
        student_model.train()
        total_loss = 0.0

        for images, labels in train_loader:
            images = images.to(device)
            labels = labels.to(device)

            with torch.no_grad():
                teacher_logits = teacher_model(images)

            student_logits = student_model(images)
            loss = criterion(student_logits, teacher_logits, labels)

            optimizer.zero_grad(set_to_none=True)
            loss.backward()
            optimizer.step()
            total_loss += loss.item()

        scheduler.step()
        print(f"Epoch {epoch+1}/{epochs} | "
              f"Loss: {total_loss/len(train_loader):.4f} | "
              f"LR: {scheduler.get_last_lr()[0]:.2e}")

    return student_model

6. ONNX Runtime: Cross-Platform Inference

ONNX Runtime provides a unified API for inference across CPU, CUDA, TensorRT, OpenVINO, CoreML, and more. It's the best choice when you need portability across multiple target platforms from a single model file.

ONNX Runtime: Multi-Platform Inference Engine

import onnxruntime as ort
import numpy as np
import cv2
import time
from typing import Optional

class ONNXInferenceEngine:
    """
    Cross-platform ONNX Runtime inference.
    Automatically selects best execution provider:
    TensorRT > CUDA > CPU
    """

    EXECUTION_PROVIDERS = [
        'TensorrtExecutionProvider',   # Jetson with TensorRT
        'CUDAExecutionProvider',        # NVIDIA GPU (generic)
        'CPUExecutionProvider'          # Fallback: any device
    ]

    def __init__(self, model_path: str,
                 providers: Optional[list] = None):
        providers = providers or self.EXECUTION_PROVIDERS

        # Session options for optimization
        sess_options = ort.SessionOptions()
        sess_options.graph_optimization_level = (
            ort.GraphOptimizationLevel.ORT_ENABLE_ALL
        )
        sess_options.intra_op_num_threads = 4
        sess_options.inter_op_num_threads = 1

        # Check available providers
        available = ort.get_available_providers()
        selected = [p for p in providers if p in available]
        print(f"Available providers: {available}")
        print(f"Using: {selected[0]}")

        self.session = ort.InferenceSession(
            model_path,
            sess_options=sess_options,
            providers=selected
        )

        # Get I/O shapes
        self.input_name = self.session.get_inputs()[0].name
        self.input_shape = self.session.get_inputs()[0].shape
        print(f"Input: {self.input_name} {self.input_shape}")

    def preprocess(self, img_bgr: np.ndarray,
                   input_size: int = 320) -> np.ndarray:
        """Preprocess image for ONNX model."""
        img = cv2.resize(img_bgr, (input_size, input_size))
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        img = img.astype(np.float32) / 255.0
        img = np.transpose(img, (2, 0, 1))  # HWC -> CHW
        img = np.expand_dims(img, 0)        # CHW -> NCHW
        return np.ascontiguousarray(img)

    def infer(self, img_bgr: np.ndarray,
              input_size: int = 320) -> list:
        """Run inference and return raw outputs."""
        input_data = self.preprocess(img_bgr, input_size)
        outputs = self.session.run(
            None,
            {self.input_name: input_data}
        )
        return outputs

    def benchmark(self, input_size: int = 320,
                  n_runs: int = 100) -> dict:
        """Benchmark inference speed."""
        dummy_img = np.random.randint(0, 255,
                                       (480, 640, 3),
                                       dtype=np.uint8)
        # Warmup
        for _ in range(10):
            self.infer(dummy_img, input_size)

        times = []
        for _ in range(n_runs):
            t0 = time.perf_counter()
            self.infer(dummy_img, input_size)
            times.append(time.perf_counter() - t0)

        avg_ms = np.mean(times) * 1000
        p99_ms = np.percentile(times, 99) * 1000
        print(f"Avg latency: {avg_ms:.2f}ms")
        print(f"P99 latency: {p99_ms:.2f}ms")
        print(f"Avg FPS: {1000/avg_ms:.1f}")
        return {"avg_ms": avg_ms, "p99_ms": p99_ms}

7. Best Practices for Edge Deployment

Edge Deployment Checklist

Choose the smallest model that meets requirements: YOLOv8n or YOLO26n for RPi, YOLOv8m for Jetson Orin. Never deploy Large or XLarge variants on edge devices.
Reduce input resolution: 320x320 instead of 640x640 reduces inference time by ~75% with moderate accuracy loss. For large objects, 320 is sufficient.
Smart frame skipping: If objects move slowly, process 1 frame out of 3-5. Use a tracker (CSRT, ByteTrack) to interpolate positions in skipped frames.
Optimize capture pipeline: Set CAP_PROP_BUFFERSIZE=1 to minimize acquisition latency. Use V4L2 directly on Linux for lower overhead.
TensorRT on Jetson: always. The difference between PyTorch and TensorRT FP16 is 5-8x. There is no reason to use PyTorch for production inference on Jetson.
Thermal throttling: On RPi and Jetson, overheating causes throttling. Add heatsinks, monitor temperature with vcgencmd measure_temp, and implement thermal management.
Measure energy, not just speed: FPS/Watt is the metric that matters for battery devices. A 2x slower but 4x more energy-efficient model is often preferable.
Profile before optimizing: Use trtexec on Jetson and onnxruntime.tools.perf_test to identify the actual bottleneck before applying optimizations.

      Edge Optimization Impact (YOLOv8n, Raspberry Pi 5)
      
        
            Configuration
            FPS
            mAP50
            Model Size
          

        
            PyTorch FP32, 640x640
            0.8
            37.3%
            6.2 MB
          

            ONNX Runtime FP32, 640x640
            2.1
            37.3%
            12.2 MB
          

            NCNN FP32, 320x320
            5.4
            34.1%
            12.2 MB
          

            NCNN FP32, 320x320 + frame skip
            14.2 (effective)
            34.1%
            12.2 MB
          

            TFLite INT8, 320x320
            6.8
            33.6%
            3.1 MB
          

      
    

Conclusions

Deploying computer vision models on edge devices requires a holistic approach that combines hardware selection, model optimization, and pipeline engineering:

Edge hardware: Raspberry Pi 5 for budget scenarios, Jetson Orin for real-time performance
INT8 quantization: 4x size reduction, 2-4x speedup, <1% accuracy loss
NCNN for ARM CPU, TensorRT for NVIDIA GPU, TFLite + Coral TPU for ultra-low power
Structured pruning + fine-tuning: remove 20-30% of filters with minimal loss
Frame skipping + tracking: reduce compute by 70-80% in slowly-changing scenes

The key insight is that edge deployment is not a single optimization step but a system-level design problem. The best results come from co-designing the model, the runtime, and the acquisition pipeline together.

Series Navigation

Previous: OpenCV and PyTorch: Complete CV Pipeline
Next: Face Detection and Recognition: Modern Techniques

Cross-Series Resources

MLOps: Model Serving in Production - cloud deployment with Kubernetes and Triton
Deep Learning Advanced: Quantization and Compression

Device	CPU	GPU/NPU	RAM	TDP	YOLOv8n FPS
Raspberry Pi 5	ARM Cortex-A76 4-core	VideoCore VII	8GB	15W	~5 FPS
Jetson Nano (2GB)	ARM A57 4-core	128 CUDA cores	2GB	10W	~20 FPS
Jetson Orin Nano	ARM Cortex-A78AE 6-core	1024 CUDA + DLA	8GB	25W	~80 FPS
Jetson AGX Orin	ARM Cortex-A78AE 12-core	2048 CUDA + DLA	64GB	60W	~200 FPS
Google Coral TPU	ARM Cortex-A53 4-core	4 TOPS Edge TPU	1GB	4W	~30 FPS (TFLite)
Hailo-8	- (PCIe accelerator)	26 TOPS Neural Engine	-	5W	~120 FPS

Method	Requires Retraining	Accuracy Loss	Speedup	Use Case
Post-Training (PTQ) FP16	No	<0.1%	1.5-2x	GPU deployment (Jetson FP16)
Post-Training (PTQ) INT8	No (calibration data only)	0.5-2%	2-4x	CPU ARM, Coral TPU
Quantization-Aware Training (QAT)	Yes (few epochs)	<0.3%	2-4x	High accuracy requirements

Configuration	FPS	mAP50	Model Size
PyTorch FP32, 640x640	0.8	37.3%	6.2 MB
ONNX Runtime FP32, 640x640	2.1	37.3%	12.2 MB
NCNN FP32, 320x320	5.4	34.1%	12.2 MB
NCNN FP32, 320x320 + frame skip	14.2 (effective)	34.1%	12.2 MB
TFLite INT8, 320x320	6.8	33.6%	3.1 MB