Hi! I'm

Federico Calò

Software Developer | Technical Writer

I create modern web applications and custom digital tools to help businesses grow through technological innovation. My passion is combining computer science and economics to generate real value.

Contact Me

About Me

My passion for computer science was born at the Technical Commercial Institute of Maglie, where I discovered the power of programming and the fascination of creating digital solutions. From the start, I understood that computer science was not just code, but an extraordinary tool for turning ideas into reality.

During my studies in Business Information Systems, I began to interweave computer science and economics, understanding how technology can be the engine of growth for any business. This vision accompanied me to the University of Bari, where I obtained my degree in Computer Science, deepening my technical skills and passion for software development.

Today I put this experience at the service of businesses, professionals and startups, creating tailor-made digital solutions that automate processes, optimize resources and open new business opportunities. Because true innovation begins when technology meets the real needs of people.

My Skills

Data Analysis & Predictive Models

I transform data into strategic insights with in-depth analysis and predictive models for informed decisions

Process Automation

I create custom tools that automate repetitive operations and free up time for value-added activities

Custom Systems

I develop tailor-made software systems, from platform integrations to customized dashboards

const federico = {
  nome: "Federico Calò",
  ruolo: "Sviluppatore Software",
  città: "Bari, Italia",
  missione: "Aiutare attraverso l'informatica",
  passioni: [
    "Codice Pulito",
    "Innovazione",
    "Crescita Continua"
  ]
};

La Mia Missione

Credo fermamente che l'informatica sia lo strumento più potente per trasformare le idee in realtà e migliorare la vita delle persone.

Democratizzare la Tecnologia

La mia missione è rendere l'informatica accessibile a tutti: dalle piccole imprese locali alle startup innovative, fino ai professionisti che vogliono digitalizzare la propria attività. Ogni realtà merita di sfruttare le potenzialità del digitale.

Unire Informatica ed Economia

Non è solo questione di scrivere codice: è capire come la tecnologia possa generare valore reale. Intrecciando competenze informatiche e visione economica, aiuto le attività a crescere, ottimizzare processi e raggiungere nuovi traguardi di efficienza e redditività.

Creare Soluzioni su Misura

Ogni attività è unica, e così devono esserlo le soluzioni. Sviluppo strumenti personalizzati che rispondono alle esigenze specifiche di ciascun cliente, automatizzando processi ripetitivi e liberando tempo per ciò che conta davvero: far crescere il business.

Trasforma la Tua Attività con la Tecnologia

Che tu gestisca un negozio, uno studio professionale o un'azienda, posso aiutarti a sfruttare le potenzialità dell'informatica per lavorare meglio, più velocemente e in modo più intelligente.

Parliamone Insieme →

Join the Community

Join the developer community where we discuss software, AI, architecture and DevOps. Share ideas, ask questions and grow with us.

Channel

FC Dev Blog

Get notifications on new articles, complete series, weekly tips and featured tools. Bilingual IT/EN content directly in your Telegram.

New articles as they are published
Weekly tips and code snippets
Polls on future topics

Subscribe to Channel

Group

FC Dev Community

A bilingual IT/EN community for developers. Discussions, Q&A, mutual help and networking with other professionals.

Discussions on articles and technologies
Coding help and code review
Job opportunities and collaboration

Join the Group

Discussion Topics

View

Master SQL

RoadMap.sh

November 2024

View

Oracle Certified Foundations Associate

Oracle

October 2024

View

People Leadership Credential

Connect

September 2024

💻 Languages & Technologies

Java

Python

JavaScript

Angular

React

TypeScript

SQL

PHP

CSS/SCSS

Node.js

Docker

Git

💼

12/2024 - Present

Custom Software Engineering Analyst

Accenture

Bari, Puglia, Italy · Hybrid Analysis and development of computer systems through the use of Java and Quarkus in Health and Public Sector. Continuous training on modern technologies for creating customized and efficient software solutions and on agents.

💼

06/2022 - 12/2024

Software analyst and Back End Developer Associate Consultant

Links Management and Technology SpA

Experience analyzing as-is software systems and ETL flows using PowerCenter. Completed Spring Boot training for developing modern and scalable backend applications. Backend developer specialized in Spring Boot, with experience in database design, analysis, development and testing of assigned tasks.

💼

02/2021 - 10/2021

Software programmer

Adesso.it (prima era WebScience srl)

Experience in AS-IS and TO-BE analysis, SEO evolutions and website evolutions to improve user performance and engagement.

🎓

2018 - 2025

Degree in Computer Science

University of Bari Aldo Moro

Bachelor's degree in Computer Science, focusing on software engineering, algorithms, and modern development practices.

📚

2013 - 2018

Diploma - Corporate Information Systems

Technical Commercial Institute of Maglie

Technical diploma specializing in Business Information Systems, combining IT knowledge with business management.

Contattami

Hai un progetto in mente? Parliamone! Compila il form qui sotto e ti risponderò al più presto.

* Campi obbligatori. I tuoi dati saranno utilizzati solo per rispondere alla tua richiesta.

Neural Architecture Search and AutoML: Automating Network Design

Designing a neural network architecture has traditionally been a manual process requiring years of expertise, intuition, and computational resources for experimentation. ResNet, EfficientNet, MobileNet — each of these iconic architectures is the result of deeply informed human design choices. Yet with Neural Architecture Search (NAS), this process can be automated: given a computational budget and a task, an algorithm explores the space of possible architectures and identifies an optimal one.

The practical result is surprising. EfficientNet — arguably the most influential CNN family of recent years — was discovered via NAS. NASNet, DARTS, Once-for-All, and other NAS models have systematically outperformed manually designed architectures on standard benchmarks. But the real revolution is democratization: with tools like Optuna, Ray Tune, ENAS, and timm, it is now possible to run NAS on consumer GPUs in a few hours, optimizing architecture for your specific hardware.

In this guide we explore NAS techniques from the ground up — from GridSearch to DARTS — and build a practical AutoML pipeline with Optuna for optimizing deep learning architectures.

What You'll Learn

What NAS is and why it outperforms manual design
Search spaces: micro (cell-based) vs macro (layer-based)
Search strategies: Random Search, RL, Evolutionary, DARTS
One-Shot NAS and Weight Sharing: reducing cost from years to hours
NAS implementation with Optuna: hyperparameter + architecture search
Differentiable Architecture Search (DARTS) with complete PyTorch
Once-for-All Networks: architectures for heterogeneous hardware
Hardware-Aware NAS: optimizing for latency, FLOPs, and parameters
AutoML with AutoKeras and NAS for edge devices
Real case study: NAS for medical classification on Jetson Nano

Why NAS Outperforms Manual Design

Manual architecture design suffers from three structural limitations. First, expertise bias: researchers tend to reuse familiar patterns (ResBlocks, skip connections) even when they are not optimal for the specific task. Second, hardware mismatch: an architecture optimal on an A100 is rarely optimal on a Cortex-A55. Third, combinatorial explosion: the space of possible architectures is astronomically large — even just varying number of layers, channels, and kernel sizes across 8 stages produces over 10^14 configurations.

NAS solves these problems by formally defining the design problem:

# NAS PROBLEM FORMALIZATION
#
# Input:
#   - Search space A: set of possible architectures
#   - Dataset D = (D_train, D_val)
#   - Cost function c(a, D): measures quality of a on D
#   - Computational budget B
#
# Output:
#   - a* = argmin_{a in A} c(a, D_val)  s.t. cost(search) <= B
#
# Cost typically includes:
#   - Validation accuracy (minimize error)
#   - Latency on target hardware
#   - Number of parameters
#   - Power consumption
#
# Example multi-objective cost function:
# c(a) = (1 - val_acc) + lambda * latency_ms / latency_target

import torch
import torch.nn as nn
from typing import Dict, Any, Optional

class NASObjective:
    """
    Multi-objective cost function for NAS.
    Combines accuracy, latency, and model size.
    """
    def __init__(
        self,
        accuracy_weight: float = 1.0,
        latency_weight: float = 0.1,
        params_weight: float = 0.01,
        latency_target_ms: float = 10.0,
        params_target_M: float = 5.0
    ):
        self.w_acc = accuracy_weight
        self.w_lat = latency_weight
        self.w_par = params_weight
        self.lat_target = latency_target_ms
        self.par_target = params_target_M * 1e6

    def __call__(
        self,
        val_accuracy: float,
        latency_ms: float,
        n_params: int
    ) -> float:
        """
        Computes composite cost. Lower = better.
        val_accuracy: [0, 1] - we want to maximize
        latency_ms: milliseconds - we want to minimize
        n_params: parameter count - we want to minimize
        """
        acc_cost = self.w_acc * (1.0 - val_accuracy)
        lat_cost = self.w_lat * max(0, latency_ms / self.lat_target - 1.0)
        par_cost = self.w_par * max(0, n_params / self.par_target - 1.0)
        return acc_cost + lat_cost + par_cost

# Usage example:
obj = NASObjective(latency_target_ms=5.0, params_target_M=2.0)
cost = obj(val_accuracy=0.92, latency_ms=4.2, n_params=1_800_000)
print(f"Composite cost: {cost:.4f}")  # ~0.08

The NAS Search Space

The heart of NAS is the definition of the search space: the set of all possible architectures the algorithm can explore. The choice of search space is fundamental — too narrow and the optimum is unreachable, too wide and the search becomes computationally intractable.

There are two main approaches:

Cell-based NAS (micro search space): searches for the optimal structure of a single cell, then replicates it multiple times to build the network. This drastically reduces the search space while maintaining flexibility. Used by NASNet, DARTS, ENAS.
Macro search space: searches for global architecture parameters such as number of layers, channel widths, connection types. Used by EfficientNet (NAS + scaling), MobileNet v3, Once-for-All.

# Example of a macro search space for a CNN
# Each dimension is a discrete or continuous choice

SEARCH_SPACE = {
    # Global structure
    "n_layers": [4, 6, 8, 10, 12],          # Number of layers
    "initial_channels": [16, 32, 48, 64],   # Initial channels
    "width_multiplier": [0.5, 0.75, 1.0, 1.25, 1.5],  # Width multiplier

    # Per layer:
    "kernel_sizes": [3, 5, 7],              # Kernel size
    "expansion_ratios": [1, 2, 4, 6],      # MBConv expansion ratio
    "se_ratios": [0.0, 0.25, 0.5],         # Squeeze-and-Excitation ratio
    "skip_ops": ["identity", "conv", "pool"],  # Skip connection type

    # Attention configuration (for hybrid CNN+Transformer networks)
    "use_attention": [False, True],
    "attention_heads": [1, 2, 4, 8],
}

# Estimate search space size
import math
n_configs = (
    len(SEARCH_SPACE["n_layers"]) *
    len(SEARCH_SPACE["initial_channels"]) *
    len(SEARCH_SPACE["width_multiplier"]) *
    len(SEARCH_SPACE["kernel_sizes"]) ** 8 *  # 8 layers
    len(SEARCH_SPACE["expansion_ratios"]) ** 8
)
print(f"Possible configurations: {n_configs:.2e}")
# ~10^14: exhaustive exploration is impossible!

# CELL-BASED SEARCH SPACE (much more compact)
# In NASNet/DARTS: search only the cell structure
CELL_SEARCH_SPACE = {
    "n_nodes": [3, 4, 5],           # Internal nodes per cell
    "ops_per_edge": [               # Candidate operations per edge
        "sep_conv_3x3",
        "sep_conv_5x5",
        "dil_conv_3x3",
        "dil_conv_5x5",
        "avg_pool_3x3",
        "max_pool_3x3",
        "skip_connect",
        "none"
    ],
    "n_cells": [6, 8, 10, 12, 14],     # Number of cells in the network
    "init_channels": [16, 24, 32, 36], # Initial channels
}
# Reduced space: ~10^4 configurations instead of 10^14!

Search Strategies

Search strategies determine how the algorithm navigates the architecture space. The choice of strategy is as important as the search space itself:

Strategy	Approach	Pros	Cons	Typical Cost
Random Search	Random sampling	Simple, strong baseline	Inefficient	N * full training
Grid Search	Exhaustive grid	Complete for small spaces	Exponential in dimensions	K^D * training
Bayesian Opt.	Surrogate model + acquisition	Efficient, guided	Expensive for large spaces	50-200 trials
RL (NASNet)	RNN controller	Complex architectures	400 GPU-days originally	1000+ trials
Evolutionary	Genetic algorithms	Good exploration	Very slow	500+ trials
DARTS	Continuous differentiation	1-4 GPU-days, optimal	Memory intensive	1 training cycle
One-Shot / ENAS	Weight sharing supernet	Hours on single GPU	Ranking approximation	1 supernet + sampling

Practical NAS with Optuna

Optuna is the most widely used library for hyperparameter and architecture search. It combines Bayesian sampling (TPE), pruning of unpromising trials, and an elegant API that integrates naturally with PyTorch.

# pip install optuna optuna-integration[pytorch] torch torchvision

import optuna
from optuna.trial import Trial
import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader

# ============================================================
# SEARCH SPACE: flexible CNN architecture
# ============================================================
class FlexibleCNN(nn.Module):
    """
    CNN with parameterized architecture for NAS.
    Supports 2-5 convolutional blocks with variable kernels and channels.
    """
    def __init__(self, n_conv_layers: int, channels: list,
                 kernel_sizes: list, use_bn: bool,
                 dropout_rate: float, n_classes: int = 10):
        super().__init__()

        layers = []
        in_channels = 3
        for i in range(n_conv_layers):
            out_channels = channels[i]
            k = kernel_sizes[i]
            layers.extend([
                nn.Conv2d(in_channels, out_channels, k, padding=k//2),
                nn.BatchNorm2d(out_channels) if use_bn else nn.Identity(),
                nn.ReLU(inplace=True),
                nn.MaxPool2d(2) if i < n_conv_layers - 1 else nn.AdaptiveAvgPool2d(4)
            ])
            in_channels = out_channels

        self.features = nn.Sequential(*layers)

        final_size = channels[-1] * 16
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Dropout(dropout_rate),
            nn.Linear(final_size, 512),
            nn.ReLU(),
            nn.Dropout(dropout_rate / 2),
            nn.Linear(512, n_classes)
        )

    def forward(self, x):
        return self.classifier(self.features(x))


# ============================================================
# OBJECTIVE FUNCTION for Optuna
# ============================================================
def objective(trial: Trial) -> float:
    """
    Objective function: trains an architecture and returns val_accuracy.
    Optuna will call this hundreds of times with different configurations.
    """
    # === SEARCH SPACE ===
    n_conv_layers = trial.suggest_int("n_conv_layers", 2, 5)
    channels = [
        trial.suggest_categorical(f"channels_{i}", [32, 64, 96, 128, 192, 256])
        for i in range(n_conv_layers)
    ]
    kernel_sizes = [
        trial.suggest_categorical(f"kernel_{i}", [3, 5])
        for i in range(n_conv_layers)
    ]
    use_bn = trial.suggest_categorical("use_bn", [True, False])
    dropout_rate = trial.suggest_float("dropout", 0.1, 0.5)

    # Training hyperparameters
    lr = trial.suggest_float("lr", 1e-4, 1e-2, log=True)
    batch_size = trial.suggest_categorical("batch_size", [64, 128, 256])
    optimizer_name = trial.suggest_categorical("optimizer", ["Adam", "SGD", "AdamW"])
    weight_decay = trial.suggest_float("weight_decay", 1e-5, 1e-2, log=True)

    # === DATA ===
    transform_train = transforms.Compose([
        transforms.RandomCrop(32, padding=4),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize((0.5071, 0.4867, 0.4408), (0.2675, 0.2565, 0.2761))
    ])
    transform_val = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.5071, 0.4867, 0.4408), (0.2675, 0.2565, 0.2761))
    ])

    train_set = torchvision.datasets.CIFAR100(
        root='./data', train=True, download=True, transform=transform_train
    )
    val_set = torchvision.datasets.CIFAR100(
        root='./data', train=False, download=True, transform=transform_val
    )
    train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True, num_workers=4)
    val_loader = DataLoader(val_set, batch_size=256, shuffle=False, num_workers=4)

    # === MODEL ===
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model = FlexibleCNN(n_conv_layers, channels, kernel_sizes, use_bn,
                        dropout_rate, n_classes=100).to(device)

    # Optimizer
    if optimizer_name == "Adam":
        optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=weight_decay)
    elif optimizer_name == "AdamW":
        optimizer = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=weight_decay)
    else:
        optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9,
                                     weight_decay=weight_decay)

    criterion = nn.CrossEntropyLoss()
    scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=15)

    # === TRAINING (15 epochs for fast trial) ===
    for epoch in range(15):
        model.train()
        for imgs, labels in train_loader:
            imgs, labels = imgs.to(device), labels.to(device)
            optimizer.zero_grad()
            loss = criterion(model(imgs), labels)
            loss.backward()
            optimizer.step()
        scheduler.step()

        # Pruning: eliminate unpromising trials after first 5 epochs
        if epoch >= 4:
            model.eval()
            correct = total = 0
            with torch.no_grad():
                for imgs, labels in val_loader:
                    imgs, labels = imgs.to(device), labels.to(device)
                    preds = model(imgs).argmax(1)
                    correct += (preds == labels).sum().item()
                    total += labels.size(0)
            val_acc = correct / total

            # Report to Optuna for pruning
            trial.report(val_acc, epoch)
            if trial.should_prune():
                raise optuna.exceptions.TrialPruned()

    # Final accuracy
    model.eval()
    correct = total = 0
    with torch.no_grad():
        for imgs, labels in val_loader:
            imgs, labels = imgs.to(device), labels.to(device)
            preds = model(imgs).argmax(1)
            correct += (preds == labels).sum().item()
            total += labels.size(0)

    return correct / total  # Optuna maximizes this value


# ============================================================
# LAUNCH OPTUNA STUDY
# ============================================================
study = optuna.create_study(
    direction="maximize",
    sampler=optuna.samplers.TPESampler(seed=42),
    pruner=optuna.pruners.MedianPruner(n_startup_trials=5, n_warmup_steps=5)
)

# Run 100 trials (parallelizable with n_jobs)
study.optimize(objective, n_trials=100, timeout=3600)

# === RESULTS ===
best_trial = study.best_trial
print(f"Best accuracy: {best_trial.value:.4f}")
print(f"Best configuration:")
for key, val in best_trial.params.items():
    print(f"  {key}: {val}")

# Visualize parameter importance
fig = optuna.visualization.plot_param_importances(study)
fig.show()  # Requires plotly

DARTS: Differentiable Architecture Search

DARTS (Liu et al., 2019) is one of the most elegant and effective NAS algorithms. The key idea: make the operation choice continuous and differentiable, allowing architecture optimization via gradient descent rather than discrete search.

In a DARTS cell, each edge between nodes has a soft mixture of all possible operations (conv 3x3, conv 5x5, skip, pool). The mixing weights (architecture parameters alpha) are optimized with gradient descent alongside model weights. At the end, for each edge the operation with the highest weight is selected (discretization).

import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import List

# ============================================================
# PRIMITIVE OPERATIONS for the DARTS cell
# ============================================================
OPS = {
    'none': lambda C, stride: Zero(stride),
    'skip_connect': lambda C, stride: nn.Identity() if stride == 1 else FactorizedReduce(C),
    'sep_conv_3x3': lambda C, stride: SepConv(C, C, 3, stride, 1),
    'sep_conv_5x5': lambda C, stride: SepConv(C, C, 5, stride, 2),
    'dil_conv_3x3': lambda C, stride: DilConv(C, C, 3, stride, 2, 2),
    'avg_pool_3x3': lambda C, stride: nn.AvgPool2d(3, stride, 1, count_include_pad=False),
    'max_pool_3x3': lambda C, stride: nn.MaxPool2d(3, stride, 1),
}

PRIMITIVES = list(OPS.keys())

class SepConv(nn.Module):
    """Depthwise separable convolution."""
    def __init__(self, C_in, C_out, kernel_size, stride, padding):
        super().__init__()
        self.op = nn.Sequential(
            nn.ReLU(),
            nn.Conv2d(C_in, C_in, kernel_size, stride, padding, groups=C_in, bias=False),
            nn.Conv2d(C_in, C_out, 1, bias=False),
            nn.BatchNorm2d(C_out),
            nn.ReLU(),
            nn.Conv2d(C_out, C_out, kernel_size, 1, padding, groups=C_out, bias=False),
            nn.Conv2d(C_out, C_out, 1, bias=False),
            nn.BatchNorm2d(C_out)
        )

    def forward(self, x):
        return self.op(x)


class Zero(nn.Module):
    def __init__(self, stride):
        super().__init__()
        self.stride = stride

    def forward(self, x):
        if self.stride == 1:
            return x.mul(0.)
        return x[:, :, ::self.stride, ::self.stride].mul(0.)


# ============================================================
# MIXED OPERATION: softmax over all operations
# ============================================================
class MixedOp(nn.Module):
    """
    Soft mixture of K operations with weights alpha (architecture parameters).
    output = sum_k(softmax(alpha)[k] * op_k(input))
    """
    def __init__(self, C: int, stride: int):
        super().__init__()
        self._ops = nn.ModuleList()
        for prim in PRIMITIVES:
            op = OPS[prim](C, stride)
            if 'pool' in prim:
                op = nn.Sequential(op, nn.BatchNorm2d(C))
            self._ops.append(op)

    def forward(self, x: torch.Tensor, weights: torch.Tensor) -> torch.Tensor:
        """weights: softmax(alpha) for this edge."""
        return sum(w * op(x) for w, op in zip(weights, self._ops))


# ============================================================
# DARTS CELL (Normal Cell)
# ============================================================
class DARTSCell(nn.Module):
    """
    DARTS cell with N_NODES intermediate nodes.
    Each node is the sum of all previous inputs weighted by alpha.
    """
    N_NODES = 4

    def __init__(self, C: int, reduction: bool = False):
        super().__init__()
        self.reduction = reduction
        stride = 2 if reduction else 1

        self.preprocess0 = nn.Sequential(
            nn.ReLU(), nn.Conv2d(C, C, 1, bias=False), nn.BatchNorm2d(C)
        )
        self.preprocess1 = nn.Sequential(
            nn.ReLU(), nn.Conv2d(C, C, 1, bias=False), nn.BatchNorm2d(C)
        )

        self._ops = nn.ModuleList()
        for i in range(self.N_NODES):
            for j in range(2 + i):
                s = stride if j < 2 else 1
                self._ops.append(MixedOp(C, s))

    def forward(self, s0: torch.Tensor, s1: torch.Tensor,
                weights: torch.Tensor) -> torch.Tensor:
        s0 = self.preprocess0(s0)
        s1 = self.preprocess1(s1)
        states = [s0, s1]
        offset = 0

        for i in range(self.N_NODES):
            s = sum(
                self._ops[offset + j](h, weights[offset + j])
                for j, h in enumerate(states)
            )
            offset += len(states)
            states.append(s)

        return torch.cat(states[2:], dim=1)


# ============================================================
# DARTS NETWORK with alpha parameters
# ============================================================
class DARTSNetwork(nn.Module):
    def __init__(self, C: int = 16, n_classes: int = 10,
                 n_layers: int = 8, n_nodes: int = 4):
        super().__init__()
        C_curr = C

        self.stem = nn.Sequential(
            nn.Conv2d(3, C_curr, 3, padding=1, bias=False),
            nn.BatchNorm2d(C_curr)
        )

        self.cells = nn.ModuleList()
        for i in range(n_layers):
            reduction = i in [n_layers // 3, 2 * n_layers // 3]
            cell = DARTSCell(C_curr, reduction)
            if reduction:
                C_curr *= 2
            self.cells.append(cell)

        n_ops = len(PRIMITIVES)
        n_edges = n_nodes * (n_nodes + 1) // 2 + 2 * n_nodes

        # Alpha: architecture parameters (learnable!)
        self._arch_parameters = nn.ParameterList([
            nn.Parameter(1e-3 * torch.randn(n_edges, n_ops))
            for _ in range(n_layers)
        ])

        self.global_pool = nn.AdaptiveAvgPool2d(1)
        self.classifier = nn.Linear(C_curr * 4, n_classes)

    def arch_parameters(self):
        return list(self._arch_parameters)

    def model_parameters(self):
        ids = set(id(p) for p in self.arch_parameters())
        return [p for p in self.parameters() if id(p) not in ids]

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        s0 = s1 = self.stem(x)
        for i, cell in enumerate(self.cells):
            weights = F.softmax(self._arch_parameters[i], dim=-1)
            s0, s1 = s1, cell(s0, s1, weights)
        out = self.global_pool(s1)
        return self.classifier(out.view(out.size(0), -1))

    def genotype(self):
        """Extracts the discrete architecture (argmax of alphas)."""
        result = []
        for alpha in self._arch_parameters:
            ops = F.softmax(alpha, dim=-1).argmax(dim=-1)
            result.append([PRIMITIVES[op.item()] for op in ops])
        return result


# ============================================================
# DARTS TRAINING LOOP: Bi-level optimization
# ============================================================
def train_darts(model: DARTSNetwork, train_loader, val_loader,
                n_epochs: int = 50, arch_lr: float = 3e-4, model_lr: float = 3e-3):
    """
    DARTS uses bi-level optimization:
    - Step 1: optimize model weights on train set
    - Step 2: optimize alpha (architecture) on val set
    """
    device = next(model.parameters()).device

    optimizer_model = torch.optim.SGD(
        model.model_parameters(), lr=model_lr,
        momentum=0.9, weight_decay=3e-4
    )
    optimizer_arch = torch.optim.Adam(
        model.arch_parameters(), lr=arch_lr,
        betas=(0.5, 0.999), weight_decay=1e-3
    )

    criterion = nn.CrossEntropyLoss()
    scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
        optimizer_model, T_max=n_epochs
    )

    val_iter = iter(val_loader)

    for epoch in range(n_epochs):
        model.train()
        total_loss = 0.0

        for imgs_train, labels_train in train_loader:
            imgs_train = imgs_train.to(device)
            labels_train = labels_train.to(device)

            # Step 1: Update alpha on validation
            try:
                imgs_val, labels_val = next(val_iter)
            except StopIteration:
                val_iter = iter(val_loader)
                imgs_val, labels_val = next(val_iter)

            imgs_val = imgs_val.to(device)
            labels_val = labels_val.to(device)

            optimizer_arch.zero_grad()
            loss_arch = criterion(model(imgs_val), labels_val)
            loss_arch.backward()
            optimizer_arch.step()

            # Step 2: Update model weights on training
            optimizer_model.zero_grad()
            loss_model = criterion(model(imgs_train), labels_train)
            loss_model.backward()
            nn.utils.clip_grad_norm_(model.model_parameters(), 5.0)
            optimizer_model.step()
            total_loss += loss_model.item()

        scheduler.step()

        if epoch % 10 == 0:
            print(f"Epoch {epoch}/{n_epochs} | Loss: {total_loss/len(train_loader):.4f}")
            print(f"Genotype: {model.genotype()[:2]}...")

    return model.genotype()

      DARTS vs One-Shot NAS: Comparison
      
            Aspect
            DARTS
            One-Shot (ENAS, OFA)
          
            Search cost
            1-4 GPU-days
            Hours (post supernet training)
          
            Architecture quality
            Very high
            High (slight approximation)
          
            Hardware target
            Single target
            Multi-target (OFA)
          
            GPU memory
            High (bi-level opt)
            Medium
          
            Implementation
            Complex
            Moderate

Hardware-Aware NAS with Optuna and Latency Constraints

Theoretical NAS maximizes accuracy. Practical NAS optimizes a multi-objective trade-off between accuracy, latency, FLOPs, and model size. Optuna natively supports multi-objective search with the NSGA-II algorithm.

import optuna
from optuna.samplers import NSGAIISampler
import torch
import time

def estimate_latency_ms(model: nn.Module, input_shape=(1, 3, 224, 224),
                         n_runs: int = 50, device: str = "cpu") -> float:
    """Measures average latency in milliseconds."""
    model = model.to(device).eval()
    x = torch.randn(*input_shape).to(device)

    with torch.no_grad():
        for _ in range(10):
            model(x)

    t0 = time.perf_counter()
    with torch.no_grad():
        for _ in range(n_runs):
            model(x)
    elapsed = (time.perf_counter() - t0) / n_runs * 1000
    return elapsed


def count_flops(model: nn.Module, input_shape=(1, 3, 224, 224)) -> int:
    """Estimates FLOPs (use fvcore or ptflops in production)."""
    total_flops = 0
    x = torch.randn(*input_shape)

    def hook(module, input, output):
        nonlocal total_flops
        if isinstance(module, nn.Conv2d):
            B, C_out, H_out, W_out = output.shape
            kernel_ops = module.kernel_size[0] * module.kernel_size[1] * module.in_channels
            total_flops += 2 * B * C_out * H_out * W_out * kernel_ops
        elif isinstance(module, nn.Linear):
            B = input[0].shape[0]
            total_flops += 2 * B * module.in_features * module.out_features

    hooks = []
    for m in model.modules():
        if isinstance(m, (nn.Conv2d, nn.Linear)):
            hooks.append(m.register_forward_hook(hook))

    with torch.no_grad():
        model(x)

    for h in hooks:
        h.remove()
    return total_flops


# Multi-objective NAS: maximize accuracy, minimize latency
def multi_objective(trial: optuna.Trial):
    n_channels = trial.suggest_categorical("channels", [32, 64, 128])
    n_layers = trial.suggest_int("layers", 2, 6)
    kernel = trial.suggest_categorical("kernel", [3, 5])
    use_se = trial.suggest_categorical("use_se", [True, False])

    model = FlexibleCNN(
        n_conv_layers=n_layers,
        channels=[n_channels] * n_layers,
        kernel_sizes=[kernel] * n_layers,
        use_bn=True,
        dropout_rate=0.2,
        n_classes=10
    )

    val_accuracy = 0.80 + 0.05 * (n_channels / 128)  # Simulated
    latency = estimate_latency_ms(model, input_shape=(1, 3, 32, 32))
    flops = count_flops(model, input_shape=(1, 3, 32, 32))

    return val_accuracy, -latency  # Maximize accuracy, maximize -latency


# Multi-objective study with NSGA-II (Pareto-optimal evolutionary algorithm)
study_mo = optuna.create_study(
    directions=["maximize", "maximize"],
    sampler=NSGAIISampler(seed=42)
)
study_mo.optimize(multi_objective, n_trials=100)

# Pareto front: optimal architectures on the accuracy/latency trade-off
pareto_trials = study_mo.best_trials
print(f"Pareto-optimal architectures: {len(pareto_trials)}")
for t in pareto_trials[:5]:
    acc, neg_lat = t.values
    print(f"  Acc: {acc:.3f}, Latency: {-neg_lat:.1f} ms | {t.params}")

Once-for-All: NAS for Heterogeneous Hardware

Once-for-All (OFA) from MIT solves a fundamental practical problem: training a separate network for each target device is prohibitive. OFA trains a single supernet that supports thousands of sub-architectures, then uses a fast evolutionary search to find the optimal sub-architecture for each device.

OFA training uses progressive shrinking: the supernet is trained starting from the maximum configuration, then progressively reducing dimensions (first kernel sizes, then depth, finally width). This creates shared weights that work well across all configurations.

# Using OFA via the official library
# pip install ofa

from ofa.model_zoo import ofa_net
import torch
import time

# Load pre-trained OFA network (OFA-MobileNetV3)
ofa_network = ofa_net('ofa_mbv3_d234_e346_k357_w1.0', pretrained=True)

def evaluate_subnet(subnet_config):
    """Evaluates a sub-architecture of the OFA network."""
    ofa_network.set_active_subnet(
        ks=subnet_config['ks'],    # kernel sizes
        e=subnet_config['e'],      # expansion ratios
        d=subnet_config['d']       # depths
    )
    subnet = ofa_network.get_active_subnet(preserve_weight=True)
    return subnet

# Optimal configuration for iPhone XS (7.5ms latency target)
iphone_config = {
    'ks': [3, 3, 5, 3, 5, 3, 5, 5, 3, 5, 5, 3, 5, 5, 3, 5, 5, 3, 5, 5],
    'e': [3, 3, 6, 3, 6, 3, 6, 6, 3, 6, 6, 3, 6, 6, 3, 6, 6, 3, 6, 6],
    'd': [2, 3, 3, 3, 3]
}

iphone_subnet = evaluate_subnet(iphone_config)
print(f"Subnet for iPhone XS: {sum(p.numel() for p in iphone_subnet.parameters()):,} params")

# Configuration for Raspberry Pi 4 (50ms latency target)
rpi_config = {
    'ks': [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
    'e': [3, 3, 3, 3, 3, 3, 4, 4, 3, 4, 4, 3, 4, 4, 3, 4, 4, 3, 4, 4],
    'd': [2, 2, 2, 2, 2]
}

rpi_subnet = evaluate_subnet(rpi_config)

# Benchmark latency on CPU (proxy for RPi4)
rpi_subnet.eval()
x = torch.randn(1, 3, 224, 224)
with torch.no_grad():
    for _ in range(10): rpi_subnet(x)  # warmup
    t0 = time.perf_counter()
    for _ in range(50): rpi_subnet(x)
    lat_ms = (time.perf_counter() - t0) / 50 * 1000

print(f"RPi4 subnet: {lat_ms:.1f}ms")
print(f"Parameters: {sum(p.numel() for p in rpi_subnet.parameters()):,}")

End-to-End AutoML with AutoKeras

For those who don't have time to implement NAS from scratch, AutoKeras offers a very high-level API that automatically handles architecture search, preprocessing, and hyperparameter tuning. Internally it uses Keras Tuner with Bayesian algorithms and random search, and integrates with TensorFlow for deployment.

# pip install autokeras tensorflow

import autokeras as ak
import numpy as np
import tensorflow as tf

# ============================================================
# IMAGE CLASSIFICATION with AutoKeras
# ============================================================
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()
x_train = x_train.astype(np.float32) / 255.0
x_test = x_test.astype(np.float32) / 255.0

# Create searcher with complexity constraints
clf = ak.ImageClassifier(
    max_trials=30,
    overwrite=True,
    project_name='nas_cifar10',
    seed=42
)

# Launch NAS (search + training)
clf.fit(
    x_train, y_train,
    epochs=20,
    validation_split=0.15,
    callbacks=[
        tf.keras.callbacks.EarlyStopping(patience=5, restore_best_weights=True)
    ]
)

# Evaluation
loss, acc = clf.evaluate(x_test, y_test)
print(f"Test accuracy: {acc:.4f}")

best_model = clf.export_model()
best_model.summary()

total_params = best_model.count_params()
print(f"Total parameters: {total_params:,}")

print("\nArchitecture found by AutoKeras:")
for i, layer in enumerate(best_model.layers):
    print(f"  Layer {i}: {type(layer).__name__}")

# Export for deployment
best_model.save('best_nas_model.h5')
# Convert to TFLite for edge deployment
converter = tf.lite.TFLiteConverter.from_keras_model(best_model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]  # INT8 quantization
tflite_model = converter.convert()
with open('best_nas_model_int8.tflite', 'wb') as f:
    f.write(tflite_model)
print(f"TFLite model: {len(tflite_model)/1024:.1f} KB")

Case Study: NAS for Medical Classification on Jetson Nano

A real case clarifies the practical value of hardware-aware NAS. In a dermoscopic image classification project (8 skin lesion classes) on NVIDIA Jetson Nano, the constraints were: latency under 100ms per image, accuracy above 88%, model under 10MB. Standard architectures did not satisfy all constraints simultaneously.

import optuna
from optuna.samplers import NSGAIISampler
import torch
import torch.nn as nn
import time

# ============================================================
# CASE STUDY: NAS for dermoscopy on Jetson Nano
# ============================================================

class SEModule(nn.Module):
    """Squeeze-and-Excitation for channel attention."""
    def __init__(self, channels, reduction=4):
        super().__init__()
        self.se = nn.Sequential(
            nn.AdaptiveAvgPool2d(1),
            nn.Flatten(),
            nn.Linear(channels, channels // reduction),
            nn.ReLU(),
            nn.Linear(channels // reduction, channels),
            nn.Sigmoid()
        )

    def forward(self, x):
        scale = self.se(x).view(x.size(0), -1, 1, 1)
        return x * scale


class DermatologyNASModel(nn.Module):
    """Flexible model for dermoscopic classification."""
    def __init__(self, n_stages, channels, expansion, use_se, n_classes=8):
        super().__init__()
        self.stem = nn.Sequential(
            nn.Conv2d(3, channels[0], 3, stride=2, padding=1, bias=False),
            nn.BatchNorm2d(channels[0]), nn.ReLU6()
        )

        stages = []
        in_ch = channels[0]
        for i in range(n_stages):
            out_ch = channels[i]
            exp = expansion[i]
            mid_ch = in_ch * exp
            stage = nn.Sequential(
                nn.Conv2d(in_ch, mid_ch, 1, bias=False),
                nn.BatchNorm2d(mid_ch), nn.ReLU6(),
                nn.Conv2d(mid_ch, mid_ch, 3,
                         stride=2 if i < n_stages-1 else 1,
                         padding=1, groups=mid_ch, bias=False),
                nn.BatchNorm2d(mid_ch), nn.ReLU6(),
                SEModule(mid_ch) if use_se[i] else nn.Identity(),
                nn.Conv2d(mid_ch, out_ch, 1, bias=False),
                nn.BatchNorm2d(out_ch)
            )
            stages.append(stage)
            in_ch = out_ch

        self.stages = nn.Sequential(*stages)
        self.pool = nn.AdaptiveAvgPool2d(1)
        self.classifier = nn.Linear(channels[-1], n_classes)

    def forward(self, x):
        return self.classifier(self.pool(self.stages(self.stem(x))).flatten(1))


def jetson_nas_objective(trial: optuna.Trial):
    """Hardware-aware objective function for Jetson Nano."""
    n_stages = trial.suggest_int("n_stages", 3, 5)
    channels = [trial.suggest_categorical(f"ch_{i}", [16, 24, 32, 48, 64]) for i in range(n_stages)]
    expansions = [trial.suggest_categorical(f"exp_{i}", [2, 4, 6]) for i in range(n_stages)]
    use_se = [trial.suggest_categorical(f"se_{i}", [True, False]) for i in range(n_stages)]

    model = DermatologyNASModel(n_stages, channels, expansions, use_se, n_classes=8)
    n_params = sum(p.numel() for p in model.parameters())
    model_size_mb = n_params * 4 / (1024 ** 2)

    x = torch.randn(1, 3, 224, 224)
    model.eval()
    with torch.no_grad():
        for _ in range(5): model(x)
        t0 = time.perf_counter()
        for _ in range(20): model(x)
        latency_ms = (time.perf_counter() - t0) / 20 * 1000
        jetson_latency_ms = latency_ms * 3.5  # Jetson Nano correction factor

    if jetson_latency_ms > 150 or model_size_mb > 15:
        raise optuna.exceptions.TrialPruned()

    val_accuracy = min(0.85 + 0.05 * (sum(channels) / (64 * n_stages)), 0.94)
    return val_accuracy, -jetson_latency_ms


# Multi-objective search
study = optuna.create_study(
    directions=["maximize", "maximize"],
    sampler=NSGAIISampler(seed=42)
)
study.optimize(jetson_nas_objective, n_trials=80, timeout=7200)

# Select optimal architecture from Pareto front
best = [t for t in study.best_trials if t.values[0] > 0.88 and -t.values[1] < 100]
best.sort(key=lambda t: t.values[0], reverse=True)

if best:
    print(f"\nBest architecture for Jetson Nano:")
    print(f"  Accuracy: {best[0].values[0]:.3f}")
    print(f"  Estimated latency: {-best[0].values[1]:.1f} ms")
    print(f"  Configuration: {best[0].params}")

Limitations and Pitfalls of NAS

Overfitting to the search space: architectures found perform well on the benchmarks chosen for search but may generalize poorly to different datasets. Always evaluate on an independent holdout set not used during search.
Hidden computational cost: DARTS requires 1-4 GPU-days but the full training of the found architecture adds more GPU-hours. Total cost is often 2-3x that of training a good manually designed architecture.
DARTS instability: original DARTS suffers from training instability and tends to collapse toward skip connections. Use DARTS+ or R-DARTS for more stable results. Monitor alpha weight entropy to detect collapse early.
Cross-dataset transfer: an optimal architecture on CIFAR-10 is not necessarily optimal on ImageNet or medical datasets. Perform the search on the final dataset.
Unreliable proxy tasks: using a simpler proxy task (e.g., CIFAR instead of ImageNet) to reduce cost can lead to incorrect rankings. Always validate the found architecture on the real task.

Architecture Comparison: NAS vs Manual on Standard Benchmarks

Architecture	Method	ImageNet Top-1	Parameters	FLOPs	Search Cost
ResNet-50	Manual	76.1%	25.6M	4.1G	N/A
MobileNetV3-Large	NAS + Manual	75.2%	5.4M	0.22G	~1000 GPU-h
EfficientNet-B0	NAS (MnasNet)	77.1%	5.3M	0.39G	~6000 GPU-h
NASNet-A Mobile	RL-NAS	74.0%	5.3M	0.56G	400 GPU-days
DARTS (2nd order)	DARTS	73.3%	4.7M	0.6G	4 GPU-days
OFA-595M (RPi)	OFA One-Shot	76.0%	~4.5M	0.6G	<1 GPU-h post OFA

Best Practices for NAS in Production

When to Use NAS and How to Do It Well

Use fine-tuning before NAS: often a pre-trained ViT-B or EfficientNet-B4 outperforms a NAS architecture found from scratch. Use NAS when the task has very specific requirements (fixed hardware target, domain very different from ImageNet, tight hardware constraints).
Optuna Bayesian for hyperparameters: even without architecture search, Optuna TPE for LR, batch size, augmentation, and optimizer is often more effective than GridSearch and requires 3-5x fewer trials. This is the first step before full NAS.
Hardware-aware from the start: include latency/FLOPs in the objective function from the first trial. A model 1% more accurate but 2x slower is useless for real-time deployment on edge devices.
Aggressive early stopping: use Optuna's MedianPruner. Eliminate 30-40% of unpromising trials in the first epochs, reducing total cost by 2-3x.
Parallelize across multiple GPUs: Optuna supports native parallelization via shared database (SQLite or PostgreSQL). 4 GPUs reduce time by 3.5x without any code modifications.
Save architecture checkpoints: after search, save not just weights but also the architecture specification (genotype in DARTS, config dict in Optuna). This lets you reconstruct the model without redoing the search.

Conclusions

Neural Architecture Search has matured significantly from 2017 to today. From algorithms requiring 400 GPU-days to practical tools that run in hours on a single GPU, the field has made automatic architecture design accessible to practitioners. In 2026, the most effective workflow combines: a well-defined search space, Optuna with aggressive pruning for hyperparameters, and hardware-aware objectives for optimized deployment.

For most projects, using pre-existing architectures (ViT, Swin, EfficientNet) with fine-tuning remains more efficient than NAS from scratch. But when a task has very specific hardware requirements — latency under 5ms on Raspberry Pi, model under 1MB for microcontrollers, specialized medical classification — hardware-aware NAS becomes the indispensable tool.

The trend toward edge computing further amplifies the value of NAS: with Gartner predicting SLMs will surpass cloud LLMs 3x by 2027, optimizing architectures for specific hardware is no longer an academic luxury but a practical necessity.

Next Steps

Next article: Knowledge Distillation: Compressing Complex Models
Related: Vision Transformer: Architecture and Applications
Related: LLM on Edge Devices: Raspberry Pi and Jetson
MLOps series: Experiment Tracking with MLflow and Optuna
AI Engineering series: Model Optimization for Production

Aspect	DARTS	One-Shot (ENAS, OFA)
Search cost	1-4 GPU-days	Hours (post supernet training)
Architecture quality	Very high	High (slight approximation)
Hardware target	Single target	Multi-target (OFA)
GPU memory	High (bi-level opt)	Medium
Implementation	Complex	Moderate