Introduction: What Are Neural Networks
Artificial neural networks are the foundation of modern deep learning. Inspired by the structure of the human brain, these computational architectures can learn complex patterns from data through an iterative optimization process called backpropagation. From image classification to machine translation, neural networks power the most advanced artificial intelligence applications in the world.
In this first article of the Deep Learning and Neural Networks series, we will start from the historical origins with the Perceptron (1958) and build up to the fundamental concepts that enable networks to learn: weights, biases, activation functions, gradient descent, and backpropagation. By the end, we will implement a neural network from scratch in Python and PyTorch.
What You Will Learn
- The history of neural networks: from the Perceptron to modern Deep Learning
- Neural network architecture: input, hidden, and output layers
- Activation functions: ReLU, Sigmoid, Tanh and visual comparison
- Backpropagation: how the network computes gradients and updates weights
- Loss functions: MSE and Cross-Entropy for different tasks
- Practical implementation in NumPy and PyTorch
The Perceptron: The First Artificial Neuron
In 1958, Frank Rosenblatt introduced the Perceptron, the first model of an artificial neuron. The idea was simple yet revolutionary: a computational unit that receives numerical inputs, multiplies them by weights, sums the results, and produces a binary output through a threshold function.
Mathematically, the perceptron computes a weighted sum of its inputs plus a bias term, then applies a step function to produce the output:
# Simple Perceptron in Python
import numpy as np
class Perceptron:
def __init__(self, n_inputs, learning_rate=0.01):
self.weights = np.random.randn(n_inputs)
self.bias = 0.0
self.lr = learning_rate
def predict(self, x):
"""Forward pass: weighted sum + threshold"""
linear_output = np.dot(x, self.weights) + self.bias
return 1 if linear_output >= 0 else 0
def train(self, X, y, epochs=100):
"""Perceptron learning rule"""
for _ in range(epochs):
for xi, yi in zip(X, y):
prediction = self.predict(xi)
error = yi - prediction
self.weights += self.lr * error * xi
self.bias += self.lr * error
# Example: AND gate
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([0, 0, 0, 1])
p = Perceptron(n_inputs=2)
p.train(X, y, epochs=50)
print([p.predict(xi) for xi in X]) # [0, 0, 0, 1]
The perceptron works perfectly for linearly separable problems like AND and OR. However, in 1969 Minsky and Papert demonstrated that a single perceptron cannot solve the XOR problem, where classes are not separable by a straight line. This discovery slowed neural network research for over a decade, a period known as the AI Winter.
The XOR Limitation and the Need for Deep Learning
The XOR problem proved that multiple layers (hidden layers) were needed to solve non-linear problems. This insight led to the development of Multi-Layer Perceptrons (MLPs) and, decades later, modern deep learning. Today we know that by adding even a single hidden layer with non-linear activation functions, a neural network can approximate any continuous function (Universal Approximation Theorem).
Neural Network Architecture: Layers and Neurons
A neural network is organized into layers of interconnected neurons. The classic architecture comprises three types of layers:
- Input Layer: receives raw data. Each neuron represents a dataset feature (e.g., image pixels, text words)
- Hidden Layer(s): one or more intermediate layers where processing occurs. Each neuron receives input from the previous layer, applies weights and bias, and passes the result through an activation function
- Output Layer: produces the final prediction. For binary classification: 1 neuron with sigmoid. For multi-class: N neurons with softmax
The forward pass is the process by which data flows from input to output through all layers. For each neuron, the computation follows three steps: weighted sum of inputs, bias addition, and activation function application.
Activation Functions: ReLU, Sigmoid, and Tanh
Activation functions introduce non-linearity into the network, enabling it to learn complex relationships in the data. Without them, a network with N layers would be equivalent to a single linear layer, regardless of depth.
Sigmoid
The sigmoid function squashes any value into the range (0, 1). Historically used as the standard activation, today it is primarily employed in the output layer for binary classification. Its main problem is vanishing gradient: for very high or very low values, the gradient becomes nearly zero, drastically slowing down learning.
Tanh
The tanh (hyperbolic tangent) function maps values to the range (-1, 1). Zero-centered, it offers stronger gradients than sigmoid, making it preferable in hidden layers. However, it also suffers from vanishing gradient for extreme values.
ReLU (Rectified Linear Unit)
ReLU is the most widely used activation function in modern deep learning. Its formula is extremely simple: f(x) = max(0, x). The advantages are numerous: efficient computation, no vanishing gradient for positive values, and promotion of sparse representations. The only downside is the dying ReLU problem: neurons that consistently receive negative inputs stop learning entirely.
import numpy as np
import matplotlib.pyplot as plt
# Activation function implementations
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def tanh(x):
return np.tanh(x)
def relu(x):
return np.maximum(0, x)
def leaky_relu(x, alpha=0.01):
return np.where(x > 0, x, alpha * x)
# Derivatives for backpropagation
def sigmoid_derivative(x):
s = sigmoid(x)
return s * (1 - s)
def relu_derivative(x):
return np.where(x > 0, 1, 0)
# Visual comparison
x = np.linspace(-5, 5, 200)
fig, axes = plt.subplots(1, 4, figsize=(16, 4))
for ax, func, name in zip(axes, [sigmoid, tanh, relu, leaky_relu],
['Sigmoid', 'Tanh', 'ReLU', 'Leaky ReLU']):
ax.plot(x, func(x), linewidth=2)
ax.set_title(name)
ax.grid(True, alpha=0.3)
ax.axhline(y=0, color='k', linewidth=0.5)
ax.axvline(x=0, color='k', linewidth=0.5)
plt.tight_layout()
plt.savefig('activation_functions.png', dpi=150)
Loss Functions: Measuring Error
The loss function quantifies how much the network's predictions deviate from the actual values. It provides the error signal that guides learning during training.
MSE (Mean Squared Error)
Used for regression problems, MSE computes the average of squared differences between predictions and actual values. It penalizes large errors more heavily, making it sensitive to outliers.
Cross-Entropy
For classification problems, Cross-Entropy measures the distance between the predicted probability distribution and the actual one. For binary classification, Binary Cross-Entropy is used; for multi-class, Categorical Cross-Entropy. Cross-Entropy produces stronger gradients when the network is very confident but wrong, accelerating correction.
Backpropagation: How the Network Learns
Backpropagation is the fundamental algorithm that enables neural networks to learn. Introduced by Rumelhart, Hinton, and Williams in 1986, it applies the chain rule of calculus to compute the gradient of the loss function with respect to every weight in the network.
The process consists of four phases:
- Forward Pass: data flows through the network from input to output, computing activations at each layer
- Loss Computation: the error between predicted output and actual value is measured
- Backward Pass: gradients are computed starting from the output towards the input, propagating the error backwards
- Weight Update: each weight is modified in the opposite direction of the gradient, proportionally to the learning rate
Gradient Descent: The Fundamental Optimizer
Gradient Descent updates weights according to the formula: w = w - lr * dL/dw. Modern variants include SGD with momentum (accumulates velocity in the gradient direction), Adam (adaptive learning rate per parameter), and AdamW (Adam with corrected weight decay). Adam is the default optimizer in most deep learning applications.
Complete Implementation: MLP in PyTorch
Let us put everything together by implementing a Multi-Layer Perceptron for handwritten digit classification (MNIST dataset) using PyTorch:
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
# MLP Model Definition
class MLP(nn.Module):
def __init__(self, input_size=784, hidden_sizes=[256, 128], num_classes=10):
super().__init__()
self.flatten = nn.Flatten()
self.network = nn.Sequential(
nn.Linear(input_size, hidden_sizes[0]),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(hidden_sizes[0], hidden_sizes[1]),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(hidden_sizes[1], num_classes)
)
def forward(self, x):
x = self.flatten(x)
return self.network(x)
# Dataset and DataLoader setup
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])
train_dataset = datasets.MNIST('./data', train=True, download=True, transform=transform)
test_dataset = datasets.MNIST('./data', train=False, transform=transform)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=1000, shuffle=False)
# Training
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = MLP().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
for epoch in range(10):
model.train()
total_loss = 0
for batch_x, batch_y in train_loader:
batch_x, batch_y = batch_x.to(device), batch_y.to(device)
optimizer.zero_grad()
output = model(batch_x)
loss = criterion(output, batch_y)
loss.backward()
optimizer.step()
total_loss += loss.item()
# Evaluation
model.eval()
correct = 0
with torch.no_grad():
for batch_x, batch_y in test_loader:
batch_x, batch_y = batch_x.to(device), batch_y.to(device)
output = model(batch_x)
pred = output.argmax(dim=1)
correct += (pred == batch_y).sum().item()
accuracy = 100. * correct / len(test_dataset)
print(f'Epoch {epoch+1}: Loss={total_loss/len(train_loader):.4f}, '
f'Accuracy={accuracy:.2f}%')
This model achieves approximately 98% accuracy on MNIST after 10 epochs. The network has two hidden layers (256 and 128 neurons), uses ReLU activation, Dropout for regularization, and the Adam optimizer with a learning rate of 0.001.
Deep Learning: Why Multiple Layers Work
Deep learning is distinguished from traditional machine learning by the use of networks with many hidden layers. But why is depth so important?
The answer lies in hierarchical feature composition. Each layer learns to recognize patterns at an increasing level of abstraction:
- Layer 1: Detects edges, gradients, and simple textures
- Layer 2: Combines edges into geometric shapes (corners, curves)
- Layer 3: Recognizes object parts (eyes, wheels, windows)
- Layer 4+: Identifies complete objects and composed scenes
This hierarchy of representations is why deep networks such as ResNet (152 layers) can achieve superhuman performance in image classification, while a single layer could never capture the same complexity.
However, depth also brings challenges: vanishing gradient makes training very deep networks difficult because the error signal attenuates as it passes through many layers. Modern solutions include skip connections (ResNet), batch normalization, and activation functions like ReLU that maintain more stable gradients.
Next Steps in the Series
- In the next article we will explore Convolutional Neural Networks (CNNs), the architecture that revolutionized computer vision
- We will see how convolutions and pooling extract spatial features from images
- We will implement classic architectures like VGG and ResNet in PyTorch







