Introduction: Learning from Rewards
Reinforcement Learning (RL) is a learning paradigm fundamentally different from supervised and unsupervised learning. Instead of learning from labeled data, an agent interacts with an environment, takes actions, and receives rewards. The goal is to learn a policy (strategy) that maximizes cumulative reward over time.
From chess (AlphaZero) to robotics (object manipulation), from algorithmic trading to autonomous driving, reinforcement learning underpins some of the most impressive AI applications. In this article we will explore the fundamental concepts, from classic Q-Learning to modern deep RL algorithms like DQN and PPO.
What You Will Learn
- Markov Decision Process (MDP): state, action, reward, transition
- Q-Learning: the action-state value table
- Deep Q-Network (DQN): approximating Q with neural networks
- Policy Gradient: directly optimizing the policy
- Proximal Policy Optimization (PPO): stability and performance
- Exploration vs exploitation: balancing discovery and utilization
- Practical implementation with Gymnasium (OpenAI)
Markov Decision Process (MDP)
The formal framework of RL is the Markov Decision Process, defined by four components:
- States (S): the possible situations the agent can be in (e.g., position on a grid, game frame)
- Actions (A): the available moves in each state (e.g., up, down, left, right)
- Rewards (R): the numerical feedback received after each action (e.g., +1 for winning, -1 for losing)
- Transitions (P): the probability of moving to the next state given an action. The Markov property states that the future depends only on the current state, not the history
The agent seeks an optimal policy that maximizes the discounted sum of future rewards. The discount factor gamma (between 0 and 1) balances immediate vs future rewards: gamma close to 0 makes the agent myopic, close to 1 makes it farsighted.
Q-Learning: Action-State Values
Q-Learning is an off-policy algorithm that learns the Q(s, a) function: the expected cumulative reward of choosing action a in state s and following the optimal policy from that point on. The update rule is:
Q(s, a) = Q(s, a) + alpha * [R + gamma * max_a'(Q(s', a')) - Q(s, a)]
import numpy as np
import gymnasium as gym
class QLearningAgent:
def __init__(self, n_states, n_actions, lr=0.1, gamma=0.99, epsilon=1.0):
self.q_table = np.zeros((n_states, n_actions))
self.lr = lr
self.gamma = gamma
self.epsilon = epsilon
self.epsilon_min = 0.01
self.epsilon_decay = 0.995
def choose_action(self, state):
"""Epsilon-greedy: explore with prob epsilon, exploit otherwise"""
if np.random.random() < self.epsilon:
return np.random.randint(self.q_table.shape[1])
return np.argmax(self.q_table[state])
def learn(self, state, action, reward, next_state, done):
"""Q-table update"""
target = reward
if not done:
target += self.gamma * np.max(self.q_table[next_state])
self.q_table[state, action] += self.lr * (target - self.q_table[state, action])
# Decay epsilon
self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay)
# Training on FrozenLake
env = gym.make('FrozenLake-v1', is_slippery=False)
agent = QLearningAgent(n_states=16, n_actions=4)
for episode in range(5000):
state, _ = env.reset()
total_reward = 0
done = False
while not done:
action = agent.choose_action(state)
next_state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
agent.learn(state, action, reward, next_state, done)
state = next_state
total_reward += reward
if episode % 1000 == 0:
print(f"Episode {episode}, Reward: {total_reward}, "
f"Epsilon: {agent.epsilon:.3f}")
Exploration vs Exploitation
The fundamental dilemma of RL: the agent must explore (try new actions to discover better rewards) and exploit (use current knowledge to maximize reward). The epsilon-greedy strategy balances both aspects: with probability epsilon it chooses a random action, otherwise the best known action. Epsilon is gradually reduced during training.
Deep Q-Network (DQN)
Q-Learning with tables only works for small, discrete state spaces. For complex environments (images, continuous states), the Deep Q-Network (DQN) replaces the Q-table with a neural network that approximates the Q function. Two key innovations stabilize training:
- Experience Replay: transitions are stored in a buffer and randomly sampled for training, breaking temporal correlation between consecutive samples
- Target Network: a separate copy of the network, periodically updated, computes target Q-values, stabilizing learning
import torch
import torch.nn as nn
from collections import deque
import random
class DQN(nn.Module):
def __init__(self, state_dim, action_dim):
super().__init__()
self.net = nn.Sequential(
nn.Linear(state_dim, 128),
nn.ReLU(),
nn.Linear(128, 128),
nn.ReLU(),
nn.Linear(128, action_dim)
)
def forward(self, x):
return self.net(x)
class DQNAgent:
def __init__(self, state_dim, action_dim, lr=1e-3, gamma=0.99):
self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
self.q_net = DQN(state_dim, action_dim).to(self.device)
self.target_net = DQN(state_dim, action_dim).to(self.device)
self.target_net.load_state_dict(self.q_net.state_dict())
self.optimizer = torch.optim.Adam(self.q_net.parameters(), lr=lr)
self.memory = deque(maxlen=100000)
self.gamma = gamma
self.batch_size = 64
def store(self, state, action, reward, next_state, done):
self.memory.append((state, action, reward, next_state, done))
def learn(self):
if len(self.memory) < self.batch_size:
return
batch = random.sample(self.memory, self.batch_size)
states, actions, rewards, next_states, dones = zip(*batch)
states = torch.FloatTensor(states).to(self.device)
actions = torch.LongTensor(actions).to(self.device)
rewards = torch.FloatTensor(rewards).to(self.device)
next_states = torch.FloatTensor(next_states).to(self.device)
dones = torch.FloatTensor(dones).to(self.device)
q_values = self.q_net(states).gather(1, actions.unsqueeze(1))
with torch.no_grad():
next_q = self.target_net(next_states).max(1)[0]
targets = rewards + (1 - dones) * self.gamma * next_q
loss = nn.functional.mse_loss(q_values.squeeze(), targets)
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
def update_target(self):
self.target_net.load_state_dict(self.q_net.state_dict())
Policy Gradient and Actor-Critic
Policy Gradient methods directly optimize the policy without going through the Q function. The policy gradient theorem states that the gradient of expected reward with respect to policy parameters is proportional to the reward weighted by the log-probability of actions.
Actor-Critic
The Actor-Critic architecture combines both approaches: the Actor (policy network) chooses actions, the Critic (value network) estimates how good the current state is. The Critic reduces the variance of Actor updates, making training more stable.
PPO: The Industry Standard
Proximal Policy Optimization (PPO), developed by OpenAI, is the most widely used RL algorithm in practice. Its key innovation is the clipped objective: it limits how much the new policy can deviate from the old one at each update, preventing overly drastic changes that would destabilize training.
PPO underpins many successes: InstructGPT and RLHF (LLM alignment), OpenAI Five (Dota 2), robotic agent training, and many others.
Next Steps in the Series
- In the next article we will explore Advanced Transfer Learning with BERT, GPT, and Hugging Face
- We will cover fine-tuning, prompt engineering, and RAG (Retrieval-Augmented Generation)
- We will compare open-source models: Llama, Mistral, Falcon







