Neural Architecture Search and AutoML: Automating Network Design
Designing a neural network architecture has traditionally been a manual process requiring years of expertise, intuition, and computational resources for experimentation. ResNet, EfficientNet, MobileNet — each of these iconic architectures is the result of deeply informed human design choices. Yet with Neural Architecture Search (NAS), this process can be automated: given a computational budget and a task, an algorithm explores the space of possible architectures and identifies an optimal one.
The practical result is surprising. EfficientNet — arguably the most influential CNN family of recent years — was discovered via NAS. NASNet, DARTS, Once-for-All, and other NAS models have systematically outperformed manually designed architectures on standard benchmarks. But the real revolution is democratization: with tools like Optuna, Ray Tune, ENAS, and timm, it is now possible to run NAS on consumer GPUs in a few hours, optimizing architecture for your specific hardware.
In this guide we explore NAS techniques from the ground up — from GridSearch to DARTS — and build a practical AutoML pipeline with Optuna for optimizing deep learning architectures.
What You'll Learn
- What NAS is and why it outperforms manual design
- Search spaces: micro (cell-based) vs macro (layer-based)
- Search strategies: Random Search, RL, Evolutionary, DARTS
- One-Shot NAS and Weight Sharing: reducing cost from years to hours
- NAS implementation with Optuna: hyperparameter + architecture search
- Differentiable Architecture Search (DARTS) with complete PyTorch
- Once-for-All Networks: architectures for heterogeneous hardware
- Hardware-Aware NAS: optimizing for latency, FLOPs, and parameters
- AutoML with AutoKeras and NAS for edge devices
- Real case study: NAS for medical classification on Jetson Nano
Why NAS Outperforms Manual Design
Manual architecture design suffers from three structural limitations. First, expertise bias: researchers tend to reuse familiar patterns (ResBlocks, skip connections) even when they are not optimal for the specific task. Second, hardware mismatch: an architecture optimal on an A100 is rarely optimal on a Cortex-A55. Third, combinatorial explosion: the space of possible architectures is astronomically large — even just varying number of layers, channels, and kernel sizes across 8 stages produces over 10^14 configurations.
NAS solves these problems by formally defining the design problem:
# NAS PROBLEM FORMALIZATION
#
# Input:
# - Search space A: set of possible architectures
# - Dataset D = (D_train, D_val)
# - Cost function c(a, D): measures quality of a on D
# - Computational budget B
#
# Output:
# - a* = argmin_{a in A} c(a, D_val) s.t. cost(search) <= B
#
# Cost typically includes:
# - Validation accuracy (minimize error)
# - Latency on target hardware
# - Number of parameters
# - Power consumption
#
# Example multi-objective cost function:
# c(a) = (1 - val_acc) + lambda * latency_ms / latency_target
import torch
import torch.nn as nn
from typing import Dict, Any, Optional
class NASObjective:
"""
Multi-objective cost function for NAS.
Combines accuracy, latency, and model size.
"""
def __init__(
self,
accuracy_weight: float = 1.0,
latency_weight: float = 0.1,
params_weight: float = 0.01,
latency_target_ms: float = 10.0,
params_target_M: float = 5.0
):
self.w_acc = accuracy_weight
self.w_lat = latency_weight
self.w_par = params_weight
self.lat_target = latency_target_ms
self.par_target = params_target_M * 1e6
def __call__(
self,
val_accuracy: float,
latency_ms: float,
n_params: int
) -> float:
"""
Computes composite cost. Lower = better.
val_accuracy: [0, 1] - we want to maximize
latency_ms: milliseconds - we want to minimize
n_params: parameter count - we want to minimize
"""
acc_cost = self.w_acc * (1.0 - val_accuracy)
lat_cost = self.w_lat * max(0, latency_ms / self.lat_target - 1.0)
par_cost = self.w_par * max(0, n_params / self.par_target - 1.0)
return acc_cost + lat_cost + par_cost
# Usage example:
obj = NASObjective(latency_target_ms=5.0, params_target_M=2.0)
cost = obj(val_accuracy=0.92, latency_ms=4.2, n_params=1_800_000)
print(f"Composite cost: {cost:.4f}") # ~0.08
The NAS Search Space
The heart of NAS is the definition of the search space: the set of all possible architectures the algorithm can explore. The choice of search space is fundamental — too narrow and the optimum is unreachable, too wide and the search becomes computationally intractable.
There are two main approaches:
- Cell-based NAS (micro search space): searches for the optimal structure of a single cell, then replicates it multiple times to build the network. This drastically reduces the search space while maintaining flexibility. Used by NASNet, DARTS, ENAS.
- Macro search space: searches for global architecture parameters such as number of layers, channel widths, connection types. Used by EfficientNet (NAS + scaling), MobileNet v3, Once-for-All.
# Example of a macro search space for a CNN
# Each dimension is a discrete or continuous choice
SEARCH_SPACE = {
# Global structure
"n_layers": [4, 6, 8, 10, 12], # Number of layers
"initial_channels": [16, 32, 48, 64], # Initial channels
"width_multiplier": [0.5, 0.75, 1.0, 1.25, 1.5], # Width multiplier
# Per layer:
"kernel_sizes": [3, 5, 7], # Kernel size
"expansion_ratios": [1, 2, 4, 6], # MBConv expansion ratio
"se_ratios": [0.0, 0.25, 0.5], # Squeeze-and-Excitation ratio
"skip_ops": ["identity", "conv", "pool"], # Skip connection type
# Attention configuration (for hybrid CNN+Transformer networks)
"use_attention": [False, True],
"attention_heads": [1, 2, 4, 8],
}
# Estimate search space size
import math
n_configs = (
len(SEARCH_SPACE["n_layers"]) *
len(SEARCH_SPACE["initial_channels"]) *
len(SEARCH_SPACE["width_multiplier"]) *
len(SEARCH_SPACE["kernel_sizes"]) ** 8 * # 8 layers
len(SEARCH_SPACE["expansion_ratios"]) ** 8
)
print(f"Possible configurations: {n_configs:.2e}")
# ~10^14: exhaustive exploration is impossible!
# CELL-BASED SEARCH SPACE (much more compact)
# In NASNet/DARTS: search only the cell structure
CELL_SEARCH_SPACE = {
"n_nodes": [3, 4, 5], # Internal nodes per cell
"ops_per_edge": [ # Candidate operations per edge
"sep_conv_3x3",
"sep_conv_5x5",
"dil_conv_3x3",
"dil_conv_5x5",
"avg_pool_3x3",
"max_pool_3x3",
"skip_connect",
"none"
],
"n_cells": [6, 8, 10, 12, 14], # Number of cells in the network
"init_channels": [16, 24, 32, 36], # Initial channels
}
# Reduced space: ~10^4 configurations instead of 10^14!
Search Strategies
Search strategies determine how the algorithm navigates the architecture space. The choice of strategy is as important as the search space itself:
| Strategy | Approach | Pros | Cons | Typical Cost |
|---|---|---|---|---|
| Random Search | Random sampling | Simple, strong baseline | Inefficient | N * full training |
| Grid Search | Exhaustive grid | Complete for small spaces | Exponential in dimensions | K^D * training |
| Bayesian Opt. | Surrogate model + acquisition | Efficient, guided | Expensive for large spaces | 50-200 trials |
| RL (NASNet) | RNN controller | Complex architectures | 400 GPU-days originally | 1000+ trials |
| Evolutionary | Genetic algorithms | Good exploration | Very slow | 500+ trials |
| DARTS | Continuous differentiation | 1-4 GPU-days, optimal | Memory intensive | 1 training cycle |
| One-Shot / ENAS | Weight sharing supernet | Hours on single GPU | Ranking approximation | 1 supernet + sampling |
Practical NAS with Optuna
Optuna is the most widely used library for hyperparameter and architecture search. It combines Bayesian sampling (TPE), pruning of unpromising trials, and an elegant API that integrates naturally with PyTorch.
# pip install optuna optuna-integration[pytorch] torch torchvision
import optuna
from optuna.trial import Trial
import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
# ============================================================
# SEARCH SPACE: flexible CNN architecture
# ============================================================
class FlexibleCNN(nn.Module):
"""
CNN with parameterized architecture for NAS.
Supports 2-5 convolutional blocks with variable kernels and channels.
"""
def __init__(self, n_conv_layers: int, channels: list,
kernel_sizes: list, use_bn: bool,
dropout_rate: float, n_classes: int = 10):
super().__init__()
layers = []
in_channels = 3
for i in range(n_conv_layers):
out_channels = channels[i]
k = kernel_sizes[i]
layers.extend([
nn.Conv2d(in_channels, out_channels, k, padding=k//2),
nn.BatchNorm2d(out_channels) if use_bn else nn.Identity(),
nn.ReLU(inplace=True),
nn.MaxPool2d(2) if i < n_conv_layers - 1 else nn.AdaptiveAvgPool2d(4)
])
in_channels = out_channels
self.features = nn.Sequential(*layers)
final_size = channels[-1] * 16
self.classifier = nn.Sequential(
nn.Flatten(),
nn.Dropout(dropout_rate),
nn.Linear(final_size, 512),
nn.ReLU(),
nn.Dropout(dropout_rate / 2),
nn.Linear(512, n_classes)
)
def forward(self, x):
return self.classifier(self.features(x))
# ============================================================
# OBJECTIVE FUNCTION for Optuna
# ============================================================
def objective(trial: Trial) -> float:
"""
Objective function: trains an architecture and returns val_accuracy.
Optuna will call this hundreds of times with different configurations.
"""
# === SEARCH SPACE ===
n_conv_layers = trial.suggest_int("n_conv_layers", 2, 5)
channels = [
trial.suggest_categorical(f"channels_{i}", [32, 64, 96, 128, 192, 256])
for i in range(n_conv_layers)
]
kernel_sizes = [
trial.suggest_categorical(f"kernel_{i}", [3, 5])
for i in range(n_conv_layers)
]
use_bn = trial.suggest_categorical("use_bn", [True, False])
dropout_rate = trial.suggest_float("dropout", 0.1, 0.5)
# Training hyperparameters
lr = trial.suggest_float("lr", 1e-4, 1e-2, log=True)
batch_size = trial.suggest_categorical("batch_size", [64, 128, 256])
optimizer_name = trial.suggest_categorical("optimizer", ["Adam", "SGD", "AdamW"])
weight_decay = trial.suggest_float("weight_decay", 1e-5, 1e-2, log=True)
# === DATA ===
transform_train = transforms.Compose([
transforms.RandomCrop(32, padding=4),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize((0.5071, 0.4867, 0.4408), (0.2675, 0.2565, 0.2761))
])
transform_val = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5071, 0.4867, 0.4408), (0.2675, 0.2565, 0.2761))
])
train_set = torchvision.datasets.CIFAR100(
root='./data', train=True, download=True, transform=transform_train
)
val_set = torchvision.datasets.CIFAR100(
root='./data', train=False, download=True, transform=transform_val
)
train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True, num_workers=4)
val_loader = DataLoader(val_set, batch_size=256, shuffle=False, num_workers=4)
# === MODEL ===
device = "cuda" if torch.cuda.is_available() else "cpu"
model = FlexibleCNN(n_conv_layers, channels, kernel_sizes, use_bn,
dropout_rate, n_classes=100).to(device)
# Optimizer
if optimizer_name == "Adam":
optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=weight_decay)
elif optimizer_name == "AdamW":
optimizer = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=weight_decay)
else:
optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9,
weight_decay=weight_decay)
criterion = nn.CrossEntropyLoss()
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=15)
# === TRAINING (15 epochs for fast trial) ===
for epoch in range(15):
model.train()
for imgs, labels in train_loader:
imgs, labels = imgs.to(device), labels.to(device)
optimizer.zero_grad()
loss = criterion(model(imgs), labels)
loss.backward()
optimizer.step()
scheduler.step()
# Pruning: eliminate unpromising trials after first 5 epochs
if epoch >= 4:
model.eval()
correct = total = 0
with torch.no_grad():
for imgs, labels in val_loader:
imgs, labels = imgs.to(device), labels.to(device)
preds = model(imgs).argmax(1)
correct += (preds == labels).sum().item()
total += labels.size(0)
val_acc = correct / total
# Report to Optuna for pruning
trial.report(val_acc, epoch)
if trial.should_prune():
raise optuna.exceptions.TrialPruned()
# Final accuracy
model.eval()
correct = total = 0
with torch.no_grad():
for imgs, labels in val_loader:
imgs, labels = imgs.to(device), labels.to(device)
preds = model(imgs).argmax(1)
correct += (preds == labels).sum().item()
total += labels.size(0)
return correct / total # Optuna maximizes this value
# ============================================================
# LAUNCH OPTUNA STUDY
# ============================================================
study = optuna.create_study(
direction="maximize",
sampler=optuna.samplers.TPESampler(seed=42),
pruner=optuna.pruners.MedianPruner(n_startup_trials=5, n_warmup_steps=5)
)
# Run 100 trials (parallelizable with n_jobs)
study.optimize(objective, n_trials=100, timeout=3600)
# === RESULTS ===
best_trial = study.best_trial
print(f"Best accuracy: {best_trial.value:.4f}")
print(f"Best configuration:")
for key, val in best_trial.params.items():
print(f" {key}: {val}")
# Visualize parameter importance
fig = optuna.visualization.plot_param_importances(study)
fig.show() # Requires plotly
DARTS: Differentiable Architecture Search
DARTS (Liu et al., 2019) is one of the most elegant and effective NAS algorithms. The key idea: make the operation choice continuous and differentiable, allowing architecture optimization via gradient descent rather than discrete search.
In a DARTS cell, each edge between nodes has a soft mixture of all possible
operations (conv 3x3, conv 5x5, skip, pool). The mixing weights (architecture parameters
alpha) are optimized with gradient descent alongside model weights.
At the end, for each edge the operation with the highest weight is selected
(discretization).
import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import List
# ============================================================
# PRIMITIVE OPERATIONS for the DARTS cell
# ============================================================
OPS = {
'none': lambda C, stride: Zero(stride),
'skip_connect': lambda C, stride: nn.Identity() if stride == 1 else FactorizedReduce(C),
'sep_conv_3x3': lambda C, stride: SepConv(C, C, 3, stride, 1),
'sep_conv_5x5': lambda C, stride: SepConv(C, C, 5, stride, 2),
'dil_conv_3x3': lambda C, stride: DilConv(C, C, 3, stride, 2, 2),
'avg_pool_3x3': lambda C, stride: nn.AvgPool2d(3, stride, 1, count_include_pad=False),
'max_pool_3x3': lambda C, stride: nn.MaxPool2d(3, stride, 1),
}
PRIMITIVES = list(OPS.keys())
class SepConv(nn.Module):
"""Depthwise separable convolution."""
def __init__(self, C_in, C_out, kernel_size, stride, padding):
super().__init__()
self.op = nn.Sequential(
nn.ReLU(),
nn.Conv2d(C_in, C_in, kernel_size, stride, padding, groups=C_in, bias=False),
nn.Conv2d(C_in, C_out, 1, bias=False),
nn.BatchNorm2d(C_out),
nn.ReLU(),
nn.Conv2d(C_out, C_out, kernel_size, 1, padding, groups=C_out, bias=False),
nn.Conv2d(C_out, C_out, 1, bias=False),
nn.BatchNorm2d(C_out)
)
def forward(self, x):
return self.op(x)
class Zero(nn.Module):
def __init__(self, stride):
super().__init__()
self.stride = stride
def forward(self, x):
if self.stride == 1:
return x.mul(0.)
return x[:, :, ::self.stride, ::self.stride].mul(0.)
# ============================================================
# MIXED OPERATION: softmax over all operations
# ============================================================
class MixedOp(nn.Module):
"""
Soft mixture of K operations with weights alpha (architecture parameters).
output = sum_k(softmax(alpha)[k] * op_k(input))
"""
def __init__(self, C: int, stride: int):
super().__init__()
self._ops = nn.ModuleList()
for prim in PRIMITIVES:
op = OPS[prim](C, stride)
if 'pool' in prim:
op = nn.Sequential(op, nn.BatchNorm2d(C))
self._ops.append(op)
def forward(self, x: torch.Tensor, weights: torch.Tensor) -> torch.Tensor:
"""weights: softmax(alpha) for this edge."""
return sum(w * op(x) for w, op in zip(weights, self._ops))
# ============================================================
# DARTS CELL (Normal Cell)
# ============================================================
class DARTSCell(nn.Module):
"""
DARTS cell with N_NODES intermediate nodes.
Each node is the sum of all previous inputs weighted by alpha.
"""
N_NODES = 4
def __init__(self, C: int, reduction: bool = False):
super().__init__()
self.reduction = reduction
stride = 2 if reduction else 1
self.preprocess0 = nn.Sequential(
nn.ReLU(), nn.Conv2d(C, C, 1, bias=False), nn.BatchNorm2d(C)
)
self.preprocess1 = nn.Sequential(
nn.ReLU(), nn.Conv2d(C, C, 1, bias=False), nn.BatchNorm2d(C)
)
self._ops = nn.ModuleList()
for i in range(self.N_NODES):
for j in range(2 + i):
s = stride if j < 2 else 1
self._ops.append(MixedOp(C, s))
def forward(self, s0: torch.Tensor, s1: torch.Tensor,
weights: torch.Tensor) -> torch.Tensor:
s0 = self.preprocess0(s0)
s1 = self.preprocess1(s1)
states = [s0, s1]
offset = 0
for i in range(self.N_NODES):
s = sum(
self._ops[offset + j](h, weights[offset + j])
for j, h in enumerate(states)
)
offset += len(states)
states.append(s)
return torch.cat(states[2:], dim=1)
# ============================================================
# DARTS NETWORK with alpha parameters
# ============================================================
class DARTSNetwork(nn.Module):
def __init__(self, C: int = 16, n_classes: int = 10,
n_layers: int = 8, n_nodes: int = 4):
super().__init__()
C_curr = C
self.stem = nn.Sequential(
nn.Conv2d(3, C_curr, 3, padding=1, bias=False),
nn.BatchNorm2d(C_curr)
)
self.cells = nn.ModuleList()
for i in range(n_layers):
reduction = i in [n_layers // 3, 2 * n_layers // 3]
cell = DARTSCell(C_curr, reduction)
if reduction:
C_curr *= 2
self.cells.append(cell)
n_ops = len(PRIMITIVES)
n_edges = n_nodes * (n_nodes + 1) // 2 + 2 * n_nodes
# Alpha: architecture parameters (learnable!)
self._arch_parameters = nn.ParameterList([
nn.Parameter(1e-3 * torch.randn(n_edges, n_ops))
for _ in range(n_layers)
])
self.global_pool = nn.AdaptiveAvgPool2d(1)
self.classifier = nn.Linear(C_curr * 4, n_classes)
def arch_parameters(self):
return list(self._arch_parameters)
def model_parameters(self):
ids = set(id(p) for p in self.arch_parameters())
return [p for p in self.parameters() if id(p) not in ids]
def forward(self, x: torch.Tensor) -> torch.Tensor:
s0 = s1 = self.stem(x)
for i, cell in enumerate(self.cells):
weights = F.softmax(self._arch_parameters[i], dim=-1)
s0, s1 = s1, cell(s0, s1, weights)
out = self.global_pool(s1)
return self.classifier(out.view(out.size(0), -1))
def genotype(self):
"""Extracts the discrete architecture (argmax of alphas)."""
result = []
for alpha in self._arch_parameters:
ops = F.softmax(alpha, dim=-1).argmax(dim=-1)
result.append([PRIMITIVES[op.item()] for op in ops])
return result
# ============================================================
# DARTS TRAINING LOOP: Bi-level optimization
# ============================================================
def train_darts(model: DARTSNetwork, train_loader, val_loader,
n_epochs: int = 50, arch_lr: float = 3e-4, model_lr: float = 3e-3):
"""
DARTS uses bi-level optimization:
- Step 1: optimize model weights on train set
- Step 2: optimize alpha (architecture) on val set
"""
device = next(model.parameters()).device
optimizer_model = torch.optim.SGD(
model.model_parameters(), lr=model_lr,
momentum=0.9, weight_decay=3e-4
)
optimizer_arch = torch.optim.Adam(
model.arch_parameters(), lr=arch_lr,
betas=(0.5, 0.999), weight_decay=1e-3
)
criterion = nn.CrossEntropyLoss()
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
optimizer_model, T_max=n_epochs
)
val_iter = iter(val_loader)
for epoch in range(n_epochs):
model.train()
total_loss = 0.0
for imgs_train, labels_train in train_loader:
imgs_train = imgs_train.to(device)
labels_train = labels_train.to(device)
# Step 1: Update alpha on validation
try:
imgs_val, labels_val = next(val_iter)
except StopIteration:
val_iter = iter(val_loader)
imgs_val, labels_val = next(val_iter)
imgs_val = imgs_val.to(device)
labels_val = labels_val.to(device)
optimizer_arch.zero_grad()
loss_arch = criterion(model(imgs_val), labels_val)
loss_arch.backward()
optimizer_arch.step()
# Step 2: Update model weights on training
optimizer_model.zero_grad()
loss_model = criterion(model(imgs_train), labels_train)
loss_model.backward()
nn.utils.clip_grad_norm_(model.model_parameters(), 5.0)
optimizer_model.step()
total_loss += loss_model.item()
scheduler.step()
if epoch % 10 == 0:
print(f"Epoch {epoch}/{n_epochs} | Loss: {total_loss/len(train_loader):.4f}")
print(f"Genotype: {model.genotype()[:2]}...")
return model.genotype()
DARTS vs One-Shot NAS: Comparison
| Aspect | DARTS | One-Shot (ENAS, OFA) |
|---|---|---|
| Search cost | 1-4 GPU-days | Hours (post supernet training) |
| Architecture quality | Very high | High (slight approximation) |
| Hardware target | Single target | Multi-target (OFA) |
| GPU memory | High (bi-level opt) | Medium |
| Implementation | Complex | Moderate |
Hardware-Aware NAS with Optuna and Latency Constraints
Theoretical NAS maximizes accuracy. Practical NAS optimizes a multi-objective trade-off between accuracy, latency, FLOPs, and model size. Optuna natively supports multi-objective search with the NSGA-II algorithm.
import optuna
from optuna.samplers import NSGAIISampler
import torch
import time
def estimate_latency_ms(model: nn.Module, input_shape=(1, 3, 224, 224),
n_runs: int = 50, device: str = "cpu") -> float:
"""Measures average latency in milliseconds."""
model = model.to(device).eval()
x = torch.randn(*input_shape).to(device)
with torch.no_grad():
for _ in range(10):
model(x)
t0 = time.perf_counter()
with torch.no_grad():
for _ in range(n_runs):
model(x)
elapsed = (time.perf_counter() - t0) / n_runs * 1000
return elapsed
def count_flops(model: nn.Module, input_shape=(1, 3, 224, 224)) -> int:
"""Estimates FLOPs (use fvcore or ptflops in production)."""
total_flops = 0
x = torch.randn(*input_shape)
def hook(module, input, output):
nonlocal total_flops
if isinstance(module, nn.Conv2d):
B, C_out, H_out, W_out = output.shape
kernel_ops = module.kernel_size[0] * module.kernel_size[1] * module.in_channels
total_flops += 2 * B * C_out * H_out * W_out * kernel_ops
elif isinstance(module, nn.Linear):
B = input[0].shape[0]
total_flops += 2 * B * module.in_features * module.out_features
hooks = []
for m in model.modules():
if isinstance(m, (nn.Conv2d, nn.Linear)):
hooks.append(m.register_forward_hook(hook))
with torch.no_grad():
model(x)
for h in hooks:
h.remove()
return total_flops
# Multi-objective NAS: maximize accuracy, minimize latency
def multi_objective(trial: optuna.Trial):
n_channels = trial.suggest_categorical("channels", [32, 64, 128])
n_layers = trial.suggest_int("layers", 2, 6)
kernel = trial.suggest_categorical("kernel", [3, 5])
use_se = trial.suggest_categorical("use_se", [True, False])
model = FlexibleCNN(
n_conv_layers=n_layers,
channels=[n_channels] * n_layers,
kernel_sizes=[kernel] * n_layers,
use_bn=True,
dropout_rate=0.2,
n_classes=10
)
val_accuracy = 0.80 + 0.05 * (n_channels / 128) # Simulated
latency = estimate_latency_ms(model, input_shape=(1, 3, 32, 32))
flops = count_flops(model, input_shape=(1, 3, 32, 32))
return val_accuracy, -latency # Maximize accuracy, maximize -latency
# Multi-objective study with NSGA-II (Pareto-optimal evolutionary algorithm)
study_mo = optuna.create_study(
directions=["maximize", "maximize"],
sampler=NSGAIISampler(seed=42)
)
study_mo.optimize(multi_objective, n_trials=100)
# Pareto front: optimal architectures on the accuracy/latency trade-off
pareto_trials = study_mo.best_trials
print(f"Pareto-optimal architectures: {len(pareto_trials)}")
for t in pareto_trials[:5]:
acc, neg_lat = t.values
print(f" Acc: {acc:.3f}, Latency: {-neg_lat:.1f} ms | {t.params}")
Once-for-All: NAS for Heterogeneous Hardware
Once-for-All (OFA) from MIT solves a fundamental practical problem: training a separate network for each target device is prohibitive. OFA trains a single supernet that supports thousands of sub-architectures, then uses a fast evolutionary search to find the optimal sub-architecture for each device.
OFA training uses progressive shrinking: the supernet is trained starting from the maximum configuration, then progressively reducing dimensions (first kernel sizes, then depth, finally width). This creates shared weights that work well across all configurations.
# Using OFA via the official library
# pip install ofa
from ofa.model_zoo import ofa_net
import torch
import time
# Load pre-trained OFA network (OFA-MobileNetV3)
ofa_network = ofa_net('ofa_mbv3_d234_e346_k357_w1.0', pretrained=True)
def evaluate_subnet(subnet_config):
"""Evaluates a sub-architecture of the OFA network."""
ofa_network.set_active_subnet(
ks=subnet_config['ks'], # kernel sizes
e=subnet_config['e'], # expansion ratios
d=subnet_config['d'] # depths
)
subnet = ofa_network.get_active_subnet(preserve_weight=True)
return subnet
# Optimal configuration for iPhone XS (7.5ms latency target)
iphone_config = {
'ks': [3, 3, 5, 3, 5, 3, 5, 5, 3, 5, 5, 3, 5, 5, 3, 5, 5, 3, 5, 5],
'e': [3, 3, 6, 3, 6, 3, 6, 6, 3, 6, 6, 3, 6, 6, 3, 6, 6, 3, 6, 6],
'd': [2, 3, 3, 3, 3]
}
iphone_subnet = evaluate_subnet(iphone_config)
print(f"Subnet for iPhone XS: {sum(p.numel() for p in iphone_subnet.parameters()):,} params")
# Configuration for Raspberry Pi 4 (50ms latency target)
rpi_config = {
'ks': [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
'e': [3, 3, 3, 3, 3, 3, 4, 4, 3, 4, 4, 3, 4, 4, 3, 4, 4, 3, 4, 4],
'd': [2, 2, 2, 2, 2]
}
rpi_subnet = evaluate_subnet(rpi_config)
# Benchmark latency on CPU (proxy for RPi4)
rpi_subnet.eval()
x = torch.randn(1, 3, 224, 224)
with torch.no_grad():
for _ in range(10): rpi_subnet(x) # warmup
t0 = time.perf_counter()
for _ in range(50): rpi_subnet(x)
lat_ms = (time.perf_counter() - t0) / 50 * 1000
print(f"RPi4 subnet: {lat_ms:.1f}ms")
print(f"Parameters: {sum(p.numel() for p in rpi_subnet.parameters()):,}")
End-to-End AutoML with AutoKeras
For those who don't have time to implement NAS from scratch, AutoKeras offers a very high-level API that automatically handles architecture search, preprocessing, and hyperparameter tuning. Internally it uses Keras Tuner with Bayesian algorithms and random search, and integrates with TensorFlow for deployment.
# pip install autokeras tensorflow
import autokeras as ak
import numpy as np
import tensorflow as tf
# ============================================================
# IMAGE CLASSIFICATION with AutoKeras
# ============================================================
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()
x_train = x_train.astype(np.float32) / 255.0
x_test = x_test.astype(np.float32) / 255.0
# Create searcher with complexity constraints
clf = ak.ImageClassifier(
max_trials=30,
overwrite=True,
project_name='nas_cifar10',
seed=42
)
# Launch NAS (search + training)
clf.fit(
x_train, y_train,
epochs=20,
validation_split=0.15,
callbacks=[
tf.keras.callbacks.EarlyStopping(patience=5, restore_best_weights=True)
]
)
# Evaluation
loss, acc = clf.evaluate(x_test, y_test)
print(f"Test accuracy: {acc:.4f}")
best_model = clf.export_model()
best_model.summary()
total_params = best_model.count_params()
print(f"Total parameters: {total_params:,}")
print("\nArchitecture found by AutoKeras:")
for i, layer in enumerate(best_model.layers):
print(f" Layer {i}: {type(layer).__name__}")
# Export for deployment
best_model.save('best_nas_model.h5')
# Convert to TFLite for edge deployment
converter = tf.lite.TFLiteConverter.from_keras_model(best_model)
converter.optimizations = [tf.lite.Optimize.DEFAULT] # INT8 quantization
tflite_model = converter.convert()
with open('best_nas_model_int8.tflite', 'wb') as f:
f.write(tflite_model)
print(f"TFLite model: {len(tflite_model)/1024:.1f} KB")
Case Study: NAS for Medical Classification on Jetson Nano
A real case clarifies the practical value of hardware-aware NAS. In a dermoscopic image classification project (8 skin lesion classes) on NVIDIA Jetson Nano, the constraints were: latency under 100ms per image, accuracy above 88%, model under 10MB. Standard architectures did not satisfy all constraints simultaneously.
import optuna
from optuna.samplers import NSGAIISampler
import torch
import torch.nn as nn
import time
# ============================================================
# CASE STUDY: NAS for dermoscopy on Jetson Nano
# ============================================================
class SEModule(nn.Module):
"""Squeeze-and-Excitation for channel attention."""
def __init__(self, channels, reduction=4):
super().__init__()
self.se = nn.Sequential(
nn.AdaptiveAvgPool2d(1),
nn.Flatten(),
nn.Linear(channels, channels // reduction),
nn.ReLU(),
nn.Linear(channels // reduction, channels),
nn.Sigmoid()
)
def forward(self, x):
scale = self.se(x).view(x.size(0), -1, 1, 1)
return x * scale
class DermatologyNASModel(nn.Module):
"""Flexible model for dermoscopic classification."""
def __init__(self, n_stages, channels, expansion, use_se, n_classes=8):
super().__init__()
self.stem = nn.Sequential(
nn.Conv2d(3, channels[0], 3, stride=2, padding=1, bias=False),
nn.BatchNorm2d(channels[0]), nn.ReLU6()
)
stages = []
in_ch = channels[0]
for i in range(n_stages):
out_ch = channels[i]
exp = expansion[i]
mid_ch = in_ch * exp
stage = nn.Sequential(
nn.Conv2d(in_ch, mid_ch, 1, bias=False),
nn.BatchNorm2d(mid_ch), nn.ReLU6(),
nn.Conv2d(mid_ch, mid_ch, 3,
stride=2 if i < n_stages-1 else 1,
padding=1, groups=mid_ch, bias=False),
nn.BatchNorm2d(mid_ch), nn.ReLU6(),
SEModule(mid_ch) if use_se[i] else nn.Identity(),
nn.Conv2d(mid_ch, out_ch, 1, bias=False),
nn.BatchNorm2d(out_ch)
)
stages.append(stage)
in_ch = out_ch
self.stages = nn.Sequential(*stages)
self.pool = nn.AdaptiveAvgPool2d(1)
self.classifier = nn.Linear(channels[-1], n_classes)
def forward(self, x):
return self.classifier(self.pool(self.stages(self.stem(x))).flatten(1))
def jetson_nas_objective(trial: optuna.Trial):
"""Hardware-aware objective function for Jetson Nano."""
n_stages = trial.suggest_int("n_stages", 3, 5)
channels = [trial.suggest_categorical(f"ch_{i}", [16, 24, 32, 48, 64]) for i in range(n_stages)]
expansions = [trial.suggest_categorical(f"exp_{i}", [2, 4, 6]) for i in range(n_stages)]
use_se = [trial.suggest_categorical(f"se_{i}", [True, False]) for i in range(n_stages)]
model = DermatologyNASModel(n_stages, channels, expansions, use_se, n_classes=8)
n_params = sum(p.numel() for p in model.parameters())
model_size_mb = n_params * 4 / (1024 ** 2)
x = torch.randn(1, 3, 224, 224)
model.eval()
with torch.no_grad():
for _ in range(5): model(x)
t0 = time.perf_counter()
for _ in range(20): model(x)
latency_ms = (time.perf_counter() - t0) / 20 * 1000
jetson_latency_ms = latency_ms * 3.5 # Jetson Nano correction factor
if jetson_latency_ms > 150 or model_size_mb > 15:
raise optuna.exceptions.TrialPruned()
val_accuracy = min(0.85 + 0.05 * (sum(channels) / (64 * n_stages)), 0.94)
return val_accuracy, -jetson_latency_ms
# Multi-objective search
study = optuna.create_study(
directions=["maximize", "maximize"],
sampler=NSGAIISampler(seed=42)
)
study.optimize(jetson_nas_objective, n_trials=80, timeout=7200)
# Select optimal architecture from Pareto front
best = [t for t in study.best_trials if t.values[0] > 0.88 and -t.values[1] < 100]
best.sort(key=lambda t: t.values[0], reverse=True)
if best:
print(f"\nBest architecture for Jetson Nano:")
print(f" Accuracy: {best[0].values[0]:.3f}")
print(f" Estimated latency: {-best[0].values[1]:.1f} ms")
print(f" Configuration: {best[0].params}")
Limitations and Pitfalls of NAS
- Overfitting to the search space: architectures found perform well on the benchmarks chosen for search but may generalize poorly to different datasets. Always evaluate on an independent holdout set not used during search.
- Hidden computational cost: DARTS requires 1-4 GPU-days but the full training of the found architecture adds more GPU-hours. Total cost is often 2-3x that of training a good manually designed architecture.
- DARTS instability: original DARTS suffers from training instability and tends to collapse toward skip connections. Use DARTS+ or R-DARTS for more stable results. Monitor alpha weight entropy to detect collapse early.
- Cross-dataset transfer: an optimal architecture on CIFAR-10 is not necessarily optimal on ImageNet or medical datasets. Perform the search on the final dataset.
- Unreliable proxy tasks: using a simpler proxy task (e.g., CIFAR instead of ImageNet) to reduce cost can lead to incorrect rankings. Always validate the found architecture on the real task.
Architecture Comparison: NAS vs Manual on Standard Benchmarks
| Architecture | Method | ImageNet Top-1 | Parameters | FLOPs | Search Cost |
|---|---|---|---|---|---|
| ResNet-50 | Manual | 76.1% | 25.6M | 4.1G | N/A |
| MobileNetV3-Large | NAS + Manual | 75.2% | 5.4M | 0.22G | ~1000 GPU-h |
| EfficientNet-B0 | NAS (MnasNet) | 77.1% | 5.3M | 0.39G | ~6000 GPU-h |
| NASNet-A Mobile | RL-NAS | 74.0% | 5.3M | 0.56G | 400 GPU-days |
| DARTS (2nd order) | DARTS | 73.3% | 4.7M | 0.6G | 4 GPU-days |
| OFA-595M (RPi) | OFA One-Shot | 76.0% | ~4.5M | 0.6G | <1 GPU-h post OFA |
Best Practices for NAS in Production
When to Use NAS and How to Do It Well
- Use fine-tuning before NAS: often a pre-trained ViT-B or EfficientNet-B4 outperforms a NAS architecture found from scratch. Use NAS when the task has very specific requirements (fixed hardware target, domain very different from ImageNet, tight hardware constraints).
- Optuna Bayesian for hyperparameters: even without architecture search, Optuna TPE for LR, batch size, augmentation, and optimizer is often more effective than GridSearch and requires 3-5x fewer trials. This is the first step before full NAS.
- Hardware-aware from the start: include latency/FLOPs in the objective function from the first trial. A model 1% more accurate but 2x slower is useless for real-time deployment on edge devices.
- Aggressive early stopping: use Optuna's MedianPruner. Eliminate 30-40% of unpromising trials in the first epochs, reducing total cost by 2-3x.
- Parallelize across multiple GPUs: Optuna supports native parallelization via shared database (SQLite or PostgreSQL). 4 GPUs reduce time by 3.5x without any code modifications.
- Save architecture checkpoints: after search, save not just weights but also the architecture specification (genotype in DARTS, config dict in Optuna). This lets you reconstruct the model without redoing the search.
Conclusions
Neural Architecture Search has matured significantly from 2017 to today. From algorithms requiring 400 GPU-days to practical tools that run in hours on a single GPU, the field has made automatic architecture design accessible to practitioners. In 2026, the most effective workflow combines: a well-defined search space, Optuna with aggressive pruning for hyperparameters, and hardware-aware objectives for optimized deployment.
For most projects, using pre-existing architectures (ViT, Swin, EfficientNet) with fine-tuning remains more efficient than NAS from scratch. But when a task has very specific hardware requirements — latency under 5ms on Raspberry Pi, model under 1MB for microcontrollers, specialized medical classification — hardware-aware NAS becomes the indispensable tool.
The trend toward edge computing further amplifies the value of NAS: with Gartner predicting SLMs will surpass cloud LLMs 3x by 2027, optimizing architectures for specific hardware is no longer an academic luxury but a practical necessity.
Next Steps
- Next article: Knowledge Distillation: Compressing Complex Models
- Related: Vision Transformer: Architecture and Applications
- Related: LLM on Edge Devices: Raspberry Pi and Jetson
- MLOps series: Experiment Tracking with MLflow and Optuna
- AI Engineering series: Model Optimization for Production







