Hi! I'm

Federico Calò

Software Developer | Technical Writer

I create modern web applications and custom digital tools to help businesses grow through technological innovation. My passion is combining computer science and economics to generate real value.

Contact Me

About Me

My passion for computer science was born at the Technical Commercial Institute of Maglie, where I discovered the power of programming and the fascination of creating digital solutions. From the start, I understood that computer science was not just code, but an extraordinary tool for turning ideas into reality.

During my studies in Business Information Systems, I began to interweave computer science and economics, understanding how technology can be the engine of growth for any business. This vision accompanied me to the University of Bari, where I obtained my degree in Computer Science, deepening my technical skills and passion for software development.

Today I put this experience at the service of businesses, professionals and startups, creating tailor-made digital solutions that automate processes, optimize resources and open new business opportunities. Because true innovation begins when technology meets the real needs of people.

My Skills

Data Analysis & Predictive Models

I transform data into strategic insights with in-depth analysis and predictive models for informed decisions

Process Automation

I create custom tools that automate repetitive operations and free up time for value-added activities

Custom Systems

I develop tailor-made software systems, from platform integrations to customized dashboards

const federico = {
  nome: "Federico Calò",
  ruolo: "Sviluppatore Software",
  città: "Bari, Italia",
  missione: "Aiutare attraverso l'informatica",
  passioni: [
    "Codice Pulito",
    "Innovazione",
    "Crescita Continua"
  ]
};

La Mia Missione

Credo fermamente che l'informatica sia lo strumento più potente per trasformare le idee in realtà e migliorare la vita delle persone.

Democratizzare la Tecnologia

La mia missione è rendere l'informatica accessibile a tutti: dalle piccole imprese locali alle startup innovative, fino ai professionisti che vogliono digitalizzare la propria attività. Ogni realtà merita di sfruttare le potenzialità del digitale.

Unire Informatica ed Economia

Non è solo questione di scrivere codice: è capire come la tecnologia possa generare valore reale. Intrecciando competenze informatiche e visione economica, aiuto le attività a crescere, ottimizzare processi e raggiungere nuovi traguardi di efficienza e redditività.

Creare Soluzioni su Misura

Ogni attività è unica, e così devono esserlo le soluzioni. Sviluppo strumenti personalizzati che rispondono alle esigenze specifiche di ciascun cliente, automatizzando processi ripetitivi e liberando tempo per ciò che conta davvero: far crescere il business.

Trasforma la Tua Attività con la Tecnologia

Che tu gestisca un negozio, uno studio professionale o un'azienda, posso aiutarti a sfruttare le potenzialità dell'informatica per lavorare meglio, più velocemente e in modo più intelligente.

Parliamone Insieme →

Join the Community

Join the developer community where we discuss software, AI, architecture and DevOps. Share ideas, ask questions and grow with us.

Channel

FC Dev Blog

Get notifications on new articles, complete series, weekly tips and featured tools. Bilingual IT/EN content directly in your Telegram.

New articles as they are published
Weekly tips and code snippets
Polls on future topics

Subscribe to Channel

Group

FC Dev Community

A bilingual IT/EN community for developers. Discussions, Q&A, mutual help and networking with other professionals.

Discussions on articles and technologies
Coding help and code review
Job opportunities and collaboration

Join the Group

Discussion Topics

View

Master SQL

RoadMap.sh

November 2024

View

Oracle Certified Foundations Associate

Oracle

October 2024

View

People Leadership Credential

Connect

September 2024

💻 Languages & Technologies

Java

Python

JavaScript

Angular

React

TypeScript

SQL

PHP

CSS/SCSS

Node.js

Docker

Git

💼

12/2024 - Present

Custom Software Engineering Analyst

Accenture

Bari, Puglia, Italy · Hybrid Analysis and development of computer systems through the use of Java and Quarkus in Health and Public Sector. Continuous training on modern technologies for creating customized and efficient software solutions and on agents.

💼

06/2022 - 12/2024

Software analyst and Back End Developer Associate Consultant

Links Management and Technology SpA

Experience analyzing as-is software systems and ETL flows using PowerCenter. Completed Spring Boot training for developing modern and scalable backend applications. Backend developer specialized in Spring Boot, with experience in database design, analysis, development and testing of assigned tasks.

💼

02/2021 - 10/2021

Software programmer

Adesso.it (prima era WebScience srl)

Experience in AS-IS and TO-BE analysis, SEO evolutions and website evolutions to improve user performance and engagement.

🎓

2018 - 2025

Degree in Computer Science

University of Bari Aldo Moro

Bachelor's degree in Computer Science, focusing on software engineering, algorithms, and modern development practices.

📚

2013 - 2018

Diploma - Corporate Information Systems

Technical Commercial Institute of Maglie

Technical diploma specializing in Business Information Systems, combining IT knowledge with business management.

Contattami

Hai un progetto in mente? Parliamone! Compila il form qui sotto e ti risponderò al più presto.

* Campi obbligatori. I tuoi dati saranno utilizzati solo per rispondere alla tua richiesta.

Introduction: How Models Find the Best Weights

Training a machine learning model means finding the parameter values that minimize a loss function. This process is called optimization and is the heart of all deep learning. The most used algorithm is gradient descent, with its many variants that balance speed, stability, and generalization.

What You Will Learn

Vanilla Gradient Descent and its limitations
Stochastic Gradient Descent (SGD) and mini-batches
Momentum and Nesterov Accelerated Gradient
Adaptive algorithms: RMSprop and Adam
Learning rate scheduling: warmup, cosine annealing
Convergence, saddle points, and the loss landscape

Gradient Descent: The Base Algorithm

Gradient descent updates parameters by moving in the direction opposite to the gradient of the loss:

\\theta_{t+1} = \\theta_t - \\eta \\nabla_{\\theta} L(\\theta_t)

where $\\eta$ is the learning rate (critical hyperparameter) and $\\nabla_{\\theta} L$ is the gradient computed on all training data (batch gradient descent).

Analogy: imagine standing on a hill in the fog, wanting to reach the lowest valley. At each step, you feel the slope under your feet (gradient) and walk in the steepest downhill direction. The learning rate is the step size: too large and you overshoot the valley, too small and it takes forever.

The Learning Rate Problem

The choice of learning rate is crucial:

$\\eta$ too large: loss oscillates or diverges
$\\eta$ too small: extremely slow convergence
$\\eta$ just right: stable convergence to the minimum


import numpy as np

# Objective function: f(x) = x^4 - 3x^2 + 2 (has two minima)
def f(x):
    return x**4 - 3*x**2 + 2

def grad_f(x):
    return 4*x**3 - 6*x

# Gradient descent with different learning rates
for lr in [0.01, 0.05, 0.1]:
    x = 2.0  # starting point
    history = [x]
    for _ in range(100):
        x = x - lr * grad_f(x)
        history.append(x)
    print(f"lr={lr}: x_final={x:.6f}, f(x)={f(x):.6f}")

Stochastic Gradient Descent (SGD)

Computing the gradient on the entire dataset is expensive. SGD uses a single sample (or a mini-batch) to estimate the gradient:

\\theta_{t+1} = \\theta_t - \\eta \\nabla_{\\theta} L(\\theta_t; x_i, y_i)

The estimated gradient is noisy but on average points in the right direction. The noise has a surprising benefit: it helps escape local minima and acts as an implicit regularizer.

In practice, mini-batch SGD is used with batches of 32-256 samples, a compromise between gradient variance and computational efficiency:

\\theta_{t+1} = \\theta_t - \\eta \\frac{1}{|B|} \\sum_{i \\in B} \\nabla_{\\theta} L(\\theta_t; x_i, y_i)

Momentum: Accelerating Convergence

Momentum adds a "velocity" to the update, accumulating past gradients as an exponential moving average:

v_t = \\beta v_{t-1} + (1 - \\beta) \\nabla_{\\theta} L(\\theta_t)

\\theta_{t+1} = \\theta_t - \\eta v_t

where $\\beta \\approx 0.9$ controls how much "memory" momentum has.

Intuition: think of a ball rolling down a hill. Without momentum, it stops every time it hits a small bump. With momentum, the ball accumulates speed and rolls over small dips, converging faster to the valley. Momentum reduces oscillations in directions with alternating gradients and accelerates in directions with consistent gradient.

Nesterov Accelerated Gradient (NAG)

NAG improves momentum by computing the gradient at the "future" position (look-ahead):

v_t = \\beta v_{t-1} + \\eta \\nabla_{\\theta} L(\\theta_t - \\beta v_{t-1})

\\theta_{t+1} = \\theta_t - v_t

"First look where I am going, then correct the direction." NAG converges faster than standard momentum and decelerates before minima.

RMSprop: Adaptive Learning Rate

RMSprop adapts the learning rate individually for each parameter, dividing by the square root of the running average of squared gradients:

s_t = \\beta s_{t-1} + (1 - \\beta) (\\nabla_{\\theta} L)^2

\\theta_{t+1} = \\theta_t - \\frac{\\eta}{\\sqrt{s_t} + \\epsilon} \\nabla_{\\theta} L

For parameters with large gradients, the effective learning rate decreases; for parameters with small gradients, it increases. This solves the problem of different scales across features.

Adam: The State-of-the-Art

Adam (Adaptive Moment Estimation) combines the best of momentum and RMSprop, maintaining both a running average of the gradient (first moment) and a running average of squared gradients (second moment):

m_t = \\beta_1 m_{t-1} + (1 - \\beta_1) g_t \\quad \\text{(first moment - mean)}

v_t = \\beta_2 v_{t-1} + (1 - \\beta_2) g_t^2 \\quad \\text{(second moment - variance)}

With bias correction to compensate for zero initialization:

\\hat{m}_t = \\frac{m_t}{1 - \\beta_1^t} \\qquad \\hat{v}_t = \\frac{v_t}{1 - \\beta_2^t}

Final update:

\\theta_{t+1} = \\theta_t - \\frac{\\eta}{\\sqrt{\\hat{v}_t} + \\epsilon} \\hat{m}_t

Recommended default hyperparameters: $\\beta_1 = 0.9$ , $\\beta_2 = 0.999$ , $\\epsilon = 10^{-8}$ .


import numpy as np

class Adam:
    def __init__(self, lr=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon
        self.m = None
        self.v = None
        self.t = 0

    def update(self, params, grads):
        if self.m is None:
            self.m = np.zeros_like(params)
            self.v = np.zeros_like(params)

        self.t += 1
        self.m = self.beta1 * self.m + (1 - self.beta1) * grads
        self.v = self.beta2 * self.v + (1 - self.beta2) * grads**2

        # Bias correction
        m_hat = self.m / (1 - self.beta1**self.t)
        v_hat = self.v / (1 - self.beta2**self.t)

        params -= self.lr * m_hat / (np.sqrt(v_hat) + self.epsilon)
        return params

# Test: minimize f(x,y) = x^2 + 10*y^2 (elliptic landscape)
def f(params):
    return params[0]**2 + 10 * params[1]**2

def grad_f(params):
    return np.array([2*params[0], 20*params[1]])

# Compare SGD vs Adam
params_sgd = np.array([5.0, 5.0])
params_adam = np.array([5.0, 5.0])
optimizer = Adam(lr=0.1)

print("Step | SGD f(x) | Adam f(x)")
for step in range(50):
    g = grad_f(params_sgd)
    params_sgd -= 0.01 * g

    g = grad_f(params_adam)
    params_adam = optimizer.update(params_adam, g)

    if step % 10 == 0:
        print(f"{step:4d} | {f(params_sgd):8.4f} | {f(params_adam):8.4f}")

Learning Rate Scheduling

Starting with a fixed learning rate is not optimal. Scheduling strategies adapt $\\eta$ during training:

Step Decay

\\eta_t = \\eta_0 \\cdot \\gamma^{\\lfloor t / s \\rfloor}

where $\\gamma = 0.1$ and $s$ is the number of epochs between each reduction.

Cosine Annealing

\\eta_t = \\eta_{\\min} + \\frac{1}{2}(\\eta_{\\max} - \\eta_{\\min})\\left(1 + \\cos\\left(\\frac{t \\pi}{T}\\right)\\right)

Gradually reduces the learning rate following a cosine curve, more aggressive toward the end.

Warmup + Decay

Used in Transformers: starts with a low learning rate that increases linearly for $T_w$ steps (warmup), then decays:

\\eta_t = \\begin{cases} \\eta_{\\max} \\cdot \\frac{t}{T_w} & \\text{if } t \\leq T_w \\\\ \\eta_{\\max} \\cdot \\text{decay}(t - T_w) & \\text{if } t > T_w \\end{cases}


import numpy as np

def cosine_annealing(t, T, eta_min=1e-6, eta_max=1e-3):
    return eta_min + 0.5 * (eta_max - eta_min) * (1 + np.cos(t * np.pi / T))

def warmup_cosine(t, warmup_steps, total_steps, eta_max=1e-3):
    if t < warmup_steps:
        return eta_max * t / warmup_steps
    else:
        progress = (t - warmup_steps) / (total_steps - warmup_steps)
        return eta_max * 0.5 * (1 + np.cos(progress * np.pi))

# Visualization (values)
total_steps = 1000
warmup = 100
print("Step | Cosine LR | Warmup+Cosine LR")
for t in range(0, total_steps, 100):
    cos_lr = cosine_annealing(t, total_steps)
    warm_lr = warmup_cosine(t, warmup, total_steps)
    print(f"{t:4d} | {cos_lr:.6f} | {warm_lr:.6f}")

Saddle Points and Loss Landscape

In high-dimensional spaces (millions of parameters), local minima are rare. The real problem is saddle points: points where the gradient is zero but which are neither minima nor maxima. The probability of a saddle point grows exponentially with dimensionality.

Fortunately, SGD with momentum and Adam can escape saddle points thanks to stochastic noise and accumulated momentum.

Summary and Connections to ML

Key Takeaways

Gradient Descent: $\\theta \\leftarrow \\theta - \\eta \\nabla L$ - the base algorithm
SGD: uses mini-batches for efficiency, noise helps generalization
Momentum: accumulates velocity, overcomes irregularities in the loss surface
Adam: combines momentum + adaptive learning rate, the DL default
Learning rate scheduling: warmup + cosine decay is the Transformer standard
Saddle points: more problematic than local minima in high dimensions

In the Next Article: we will explore information theory. We will cover entropy, cross-entropy (the most used classification loss), KL divergence, and the deep connections with maximum likelihood.