Introduction: Deep Learning at the Edge
TinyML is the field that brings deep learning to devices with extremely limited resources: microcontrollers with a few kilobytes of RAM, smartphones, IoT sensors, and embedded devices. Instead of sending data to the cloud for inference, the model runs directly on the device, offering zero latency, complete privacy, and offline operation.
The challenge is compressing models that normally require gigabytes of memory into a few megabytes or even kilobytes while maintaining acceptable accuracy. Techniques like quantization, pruning, and knowledge distillation make this possible. In this article we will explore compression techniques, deployment frameworks, and real-world use cases.
What You Will Learn
- Embedded device constraints: memory, computation, energy
- Quantization: from float32 to int8 and beyond
- Pruning: removing non-essential weights
- Knowledge distillation: transferring knowledge from large to small models
- TensorFlow Lite and ONNX: edge deployment frameworks
- Privacy and on-device inference
- Use cases: gesture recognition, anomaly detection, keyword spotting
Embedded Device Constraints
Embedded devices operate under severe constraints that make running standard deep learning models impossible:
- Memory (RAM): from 256 KB (microcontrollers) to 4-8 GB (smartphones). A ResNet-50 model requires ~100 MB, a GPT-2 small ~500 MB
- Compute (CPU/NPU): from tens of MHz (microcontrollers) to a few GHz (mobile). No dedicated GPU in most cases
- Energy: from milliwatts (battery sensors) to a few watts (smartphones). Inference must be energy-efficient
- Storage: from a few MB (flash) to gigabytes (mobile). The model must fit in available storage
Why Edge AI Matters
On-device inference offers crucial advantages: zero latency (no cloud round-trip), privacy (data never leaves the device), reliability (works without internet connection), cost (no cloud inference costs), and scalability (billions of devices can perform inference simultaneously).
Quantization: Reducing Numerical Precision
Quantization converts model weights and activations from float32 (32 bits) to representations with fewer bits: int8 (8 bits), int4 (4 bits), or even binary (1 bit). This reduces model size and speeds up inference, since integer operations are more efficient than floating-point operations.
Post-Training Quantization (PTQ)
The simplest quantization: take an already trained model and convert the weights. No retraining required, but may result in slight accuracy loss.
Quantization-Aware Training (QAT)
The model is trained while simulating quantization during the forward pass. This allows the network to adapt to reduced precision, typically recovering accuracy lost with PTQ.
import torch
import torch.quantization as quant
# --- Post-Training Quantization ---
model = load_trained_model()
model.eval()
# Quantization configuration: int8 for weights and activations
model.qconfig = quant.get_default_qconfig('fbgemm') # For x86
# model.qconfig = quant.get_default_qconfig('qnnpack') # For ARM
# Prepare the model
model_prepared = quant.prepare(model)
# Calibration: pass real data to determine range
with torch.no_grad():
for batch in calibration_loader:
model_prepared(batch)
# Convert to quantized model
model_quantized = quant.convert(model_prepared)
# Size comparison
import os
torch.save(model.state_dict(), 'model_fp32.pth')
torch.save(model_quantized.state_dict(), 'model_int8.pth')
size_fp32 = os.path.getsize('model_fp32.pth') / 1e6
size_int8 = os.path.getsize('model_int8.pth') / 1e6
print(f"FP32: {size_fp32:.1f} MB")
print(f"INT8: {size_int8:.1f} MB")
print(f"Compression: {size_fp32/size_int8:.1f}x")
Pruning: Removing Non-Essential Weights
Pruning removes connections (weights) or entire structures (neurons, channels, layers) that contribute little to the model output. Neural networks are typically over-parameterized: up to 90% of weights can be removed with minimal accuracy loss.
- Unstructured pruning: removes individual weights (setting them to zero). High compression ratio but requires specialized hardware for acceleration
- Structured pruning: removes entire channels, filters, or layers. Compatible with standard hardware, immediate real acceleration
import torch.nn.utils.prune as prune
model = load_trained_model()
# Unstructured pruning: remove 50% of weights by magnitude
for name, module in model.named_modules():
if isinstance(module, torch.nn.Conv2d):
prune.l1_unstructured(module, name='weight', amount=0.5)
elif isinstance(module, torch.nn.Linear):
prune.l1_unstructured(module, name='weight', amount=0.3)
# Count zero (pruned) weights
total = 0
pruned = 0
for name, param in model.named_parameters():
if 'weight' in name:
total += param.numel()
pruned += (param == 0).sum().item()
print(f"Total parameters: {total:,}")
print(f"Pruned parameters: {pruned:,} ({100*pruned/total:.1f}%)")
# Make pruning permanent
for name, module in model.named_modules():
if isinstance(module, (torch.nn.Conv2d, torch.nn.Linear)):
prune.remove(module, 'weight')
Knowledge Distillation: Teacher and Student
Knowledge distillation transfers knowledge from a large model (teacher) to a small model (student). Instead of training the student only on correct labels, it is trained to replicate the teacher's soft predictions (the complete output probabilities, not just the predicted class). These soft predictions contain richer information: for example, an image of a cat has similar probabilities to a tiger and low ones for an airplane, and this structure helps the student.
TensorFlow Lite: Mobile and Edge Deployment
TensorFlow Lite is the reference framework for deploying ML models on mobile and embedded devices. It offers a converter that optimizes the model (quantization, operator fusion) and a lightweight runtime for inference.
import tensorflow as tf
# Convert Keras model to TFLite with quantization
model = tf.keras.models.load_model('my_model.h5')
converter = tf.lite.TFLiteConverter.from_keras_model(model)
# Full int8 quantization
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_data_gen
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8
tflite_model = converter.convert()
# Save TFLite model
with open('model_quantized.tflite', 'wb') as f:
f.write(tflite_model)
print(f"TFLite model: {len(tflite_model)/1024:.0f} KB")
# Inference with TFLite
interpreter = tf.lite.Interpreter(model_path='model_quantized.tflite')
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
interpreter.set_tensor(input_details[0]['index'], input_data)
interpreter.invoke()
output = interpreter.get_tensor(output_details[0]['index'])
ONNX: Framework Interoperability
ONNX (Open Neural Network Exchange) is an open format that allows exporting models from any framework (PyTorch, TensorFlow, etc.) and running them with optimized runtimes. ONNX Runtime offers automatic optimizations for different hardware platforms, while TVM compiles models into optimized native code for specific devices.
Real-World Use Cases
- Keyword Spotting: recognizing voice commands ("Hey Siri", "OK Google") on devices with a few KB of RAM. Models of 20-50 KB with >95% accuracy
- Gesture Recognition: recognizing gestures with IMU sensors (accelerometer, gyroscope). Models for smartwatches and wearables
- Anomaly Detection: detecting anomalies in industrial machinery with vibration sensors. Predictive maintenance without cloud connection
- On-Mobile Object Detection: MobileNet and EfficientDet for real-time object recognition on smartphones
- On-Device Facial Recognition: Apple's Face ID uses a dedicated neural model in the A-series chip
Next Steps in the Series
- In the final article we will explore Explainable AI (XAI)
- We will see how to interpret model decisions with SHAP, LIME, and GradCAM
- We will analyze why model transparency is fundamental for GDPR compliance and fairness







