Hi! I'm

Federico Calò

Software Developer | Technical Writer

I create modern web applications and custom digital tools to help businesses grow through technological innovation. My passion is combining computer science and economics to generate real value.

Contact Me

About Me

My passion for computer science was born at the Technical Commercial Institute of Maglie, where I discovered the power of programming and the fascination of creating digital solutions. From the start, I understood that computer science was not just code, but an extraordinary tool for turning ideas into reality.

During my studies in Business Information Systems, I began to interweave computer science and economics, understanding how technology can be the engine of growth for any business. This vision accompanied me to the University of Bari, where I obtained my degree in Computer Science, deepening my technical skills and passion for software development.

Today I put this experience at the service of businesses, professionals and startups, creating tailor-made digital solutions that automate processes, optimize resources and open new business opportunities. Because true innovation begins when technology meets the real needs of people.

My Skills

Data Analysis & Predictive Models

I transform data into strategic insights with in-depth analysis and predictive models for informed decisions

Process Automation

I create custom tools that automate repetitive operations and free up time for value-added activities

Custom Systems

I develop tailor-made software systems, from platform integrations to customized dashboards

const federico = {
  nome: "Federico Calò",
  ruolo: "Sviluppatore Software",
  città: "Bari, Italia",
  missione: "Aiutare attraverso l'informatica",
  passioni: [
    "Codice Pulito",
    "Innovazione",
    "Crescita Continua"
  ]
};

La Mia Missione

Credo fermamente che l'informatica sia lo strumento più potente per trasformare le idee in realtà e migliorare la vita delle persone.

Democratizzare la Tecnologia

La mia missione è rendere l'informatica accessibile a tutti: dalle piccole imprese locali alle startup innovative, fino ai professionisti che vogliono digitalizzare la propria attività. Ogni realtà merita di sfruttare le potenzialità del digitale.

Unire Informatica ed Economia

Non è solo questione di scrivere codice: è capire come la tecnologia possa generare valore reale. Intrecciando competenze informatiche e visione economica, aiuto le attività a crescere, ottimizzare processi e raggiungere nuovi traguardi di efficienza e redditività.

Creare Soluzioni su Misura

Ogni attività è unica, e così devono esserlo le soluzioni. Sviluppo strumenti personalizzati che rispondono alle esigenze specifiche di ciascun cliente, automatizzando processi ripetitivi e liberando tempo per ciò che conta davvero: far crescere il business.

Trasforma la Tua Attività con la Tecnologia

Che tu gestisca un negozio, uno studio professionale o un'azienda, posso aiutarti a sfruttare le potenzialità dell'informatica per lavorare meglio, più velocemente e in modo più intelligente.

Parliamone Insieme →

Join the Community

Join the developer community where we discuss software, AI, architecture and DevOps. Share ideas, ask questions and grow with us.

Channel

FC Dev Blog

Get notifications on new articles, complete series, weekly tips and featured tools. Bilingual IT/EN content directly in your Telegram.

New articles as they are published
Weekly tips and code snippets
Polls on future topics

Subscribe to Channel

Group

FC Dev Community

A bilingual IT/EN community for developers. Discussions, Q&A, mutual help and networking with other professionals.

Discussions on articles and technologies
Coding help and code review
Job opportunities and collaboration

Join the Group

Discussion Topics

View

Master SQL

RoadMap.sh

November 2024

View

Oracle Certified Foundations Associate

Oracle

October 2024

View

People Leadership Credential

Connect

September 2024

💻 Languages & Technologies

Java

Python

JavaScript

Angular

React

TypeScript

SQL

PHP

CSS/SCSS

Node.js

Docker

Git

💼

12/2024 - Present

Custom Software Engineering Analyst

Accenture

Bari, Puglia, Italy · Hybrid Analysis and development of computer systems through the use of Java and Quarkus in Health and Public Sector. Continuous training on modern technologies for creating customized and efficient software solutions and on agents.

💼

06/2022 - 12/2024

Software analyst and Back End Developer Associate Consultant

Links Management and Technology SpA

Experience analyzing as-is software systems and ETL flows using PowerCenter. Completed Spring Boot training for developing modern and scalable backend applications. Backend developer specialized in Spring Boot, with experience in database design, analysis, development and testing of assigned tasks.

💼

02/2021 - 10/2021

Software programmer

Adesso.it (prima era WebScience srl)

Experience in AS-IS and TO-BE analysis, SEO evolutions and website evolutions to improve user performance and engagement.

🎓

2018 - 2025

Degree in Computer Science

University of Bari Aldo Moro

Bachelor's degree in Computer Science, focusing on software engineering, algorithms, and modern development practices.

📚

2013 - 2018

Diploma - Corporate Information Systems

Technical Commercial Institute of Maglie

Technical diploma specializing in Business Information Systems, combining IT knowledge with business management.

Contattami

Hai un progetto in mente? Parliamone! Compila il form qui sotto e ti risponderò al più presto.

* Campi obbligatori. I tuoi dati saranno utilizzati solo per rispondere alla tua richiesta.

Introduction: The Transformer Revolution

Transformers, introduced in the paper "Attention Is All You Need" (2017), have revolutionized AI. From GPT to BERT, from DALL-E to Stable Diffusion, from Claude to Gemini: all are based on the Transformer architecture. The secret of their success is the self-attention mechanism, which captures long-range dependencies without the limitations of recurrent networks.

What You Will Learn

Query, Key, Value: the three linear projections
Scaled Dot-Product Attention: the central formula
Why divide by the square root of d
Multi-Head Attention: multiple perspectives
Positional Encoding: adding order without recurrence
Step-by-step NumPy implementation

Q, K, V Projections: Three Perspectives on Data

The attention mechanism operates on three linear transformations of the input $\\mathbf{X} \\in \\mathbb{R}^{n \\times d_{\\text{model}}}$ :

\\mathbf{Q} = \\mathbf{X} \\mathbf{W}^Q \\qquad \\mathbf{K} = \\mathbf{X} \\mathbf{W}^K \\qquad \\mathbf{V} = \\mathbf{X} \\mathbf{W}^V

where $\\mathbf{W}^Q, \\mathbf{W}^K \\in \\mathbb{R}^{d_{\\text{model}} \\times d_k}$ and $\\mathbf{W}^V \\in \\mathbb{R}^{d_{\\text{model}} \\times d_v}$ .

Intuition:

Query ( $\\mathbf{Q}$ ): "what am I looking for?" - the question each token asks
Key ( $\\mathbf{K}$ ): "what do I offer?" - each token's label
Value ( $\\mathbf{V}$ ): "what is my content?" - the actual information

The attention mechanism computes similarity between each Query and all Keys, uses these similarities as weights, and combines the corresponding Values.

Scaled Dot-Product Attention

The central formula of the Transformer:

\\text{Attention}(\\mathbf{Q}, \\mathbf{K}, \\mathbf{V}) = \\text{softmax}\\left(\\frac{\\mathbf{Q} \\mathbf{K}^T}{\\sqrt{d_k}}\\right) \\mathbf{V}

Step by step:

$\\mathbf{Q} \\mathbf{K}^T \\in \\mathbb{R}^{n \\times n}$ : similarity matrix between all tokens (dot product)
Division by $\\sqrt{d_k}$ : scaling for numerical stability
Softmax per row: converts scores to weights (probabilities) summing to 1
Multiplication by $\\mathbf{V}$ : weighted average of Values

Why the Scaling Factor?

Without $\\sqrt{d_k}$ , the dot product grows with vector dimension. If $q$ and $k$ have i.i.d. components with mean 0 and variance 1, then:

\\text{Var}(\\mathbf{q} \\cdot \\mathbf{k}) = d_k

For large $d_k$ , the dot product would have very large or very small values, pushing softmax into saturation regions (near-zero gradients). Dividing by $\\sqrt{d_k}$ brings the variance back to 1.


import numpy as np

def softmax(x, axis=-1):
    exp_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
    return exp_x / np.sum(exp_x, axis=axis, keepdims=True)

def scaled_dot_product_attention(Q, K, V, mask=None):
    """Scaled Dot-Product Attention."""
    d_k = Q.shape[-1]

    # 1. Compute similarity scores
    scores = Q @ K.T
    scores = scores / np.sqrt(d_k)

    # 2. Apply mask (optional, for decoder)
    if mask is not None:
        scores = np.where(mask == 0, -1e9, scores)

    # 3. Softmax to get attention weights
    attention_weights = softmax(scores)

    # 4. Weighted average of Values
    output = attention_weights @ V

    return output, attention_weights

# Example: sequence of 4 tokens, dimension 8
np.random.seed(42)
seq_len, d_model, d_k = 4, 8, 8

X = np.random.randn(seq_len, d_model)

# Projection matrices
W_Q = np.random.randn(d_model, d_k) * 0.1
W_K = np.random.randn(d_model, d_k) * 0.1
W_V = np.random.randn(d_model, d_k) * 0.1

# Projections
Q = X @ W_Q
K = X @ W_K
V = X @ W_V

# Attention
output, weights = scaled_dot_product_attention(Q, K, V)
print(f"Input shape: {X.shape}")
print(f"Attention weights:\n{np.round(weights, 3)}")
print(f"Output shape: {output.shape}")

Multi-Head Attention: Multiple Perspectives

Instead of a single attention mechanism, Transformers use $h$ parallel heads, each with its own projection matrices:

\\text{MultiHead}(\\mathbf{Q}, \\mathbf{K}, \\mathbf{V}) = \\text{Concat}(\\text{head}_1, \\ldots, \\text{head}_h) \\mathbf{W}^O

\\text{head}_i = \\text{Attention}(\\mathbf{X}\\mathbf{W}_i^Q, \\mathbf{X}\\mathbf{W}_i^K, \\mathbf{X}\\mathbf{W}_i^V)

Each head has dimension $d_k = d_{\\text{model}} / h$ . With 8 heads and $d_{\\text{model}} = 512$ , each head operates in a 64-dimensional space.

Why multiple heads? Each head can capture a different type of relationship: one might focus on syntax, another on semantics, another on co-reference.


import numpy as np

def multi_head_attention(X, n_heads, d_model):
    """Simplified Multi-Head Attention implementation."""
    d_k = d_model // n_heads
    seq_len = X.shape[0]

    heads_output = []
    all_weights = []

    for h in range(n_heads):
        W_Q = np.random.randn(d_model, d_k) * 0.1
        W_K = np.random.randn(d_model, d_k) * 0.1
        W_V = np.random.randn(d_model, d_k) * 0.1

        Q = X @ W_Q
        K = X @ W_K
        V = X @ W_V

        # Scaled dot-product attention
        scores = Q @ K.T / np.sqrt(d_k)
        weights = softmax(scores)
        head_out = weights @ V

        heads_output.append(head_out)
        all_weights.append(weights)

    # Concatenate heads
    concat = np.concatenate(heads_output, axis=-1)  # (seq_len, d_model)

    # Output projection
    W_O = np.random.randn(d_model, d_model) * 0.1
    output = concat @ W_O

    return output, all_weights

np.random.seed(42)
seq_len, d_model, n_heads = 6, 16, 4
X = np.random.randn(seq_len, d_model)

output, weights = multi_head_attention(X, n_heads, d_model)
print(f"Input: {X.shape}, Output: {output.shape}")
print(f"Number of heads: {n_heads}, d_k per head: {d_model // n_heads}")

Positional Encoding: Order Without Recurrence

Attention is permutation-invariant: it does not distinguish token order. To add positional information, positional encoding based on sinusoidal functions is used:

PE_{(pos, 2i)} = \\sin\\left(\\frac{pos}{10000^{2i/d_{\\text{model}}}}\\right)

PE_{(pos, 2i+1)} = \\cos\\left(\\frac{pos}{10000^{2i/d_{\\text{model}}}}\\right)

Why sine and cosine? Because $PE_{pos+k}$ can be expressed as a linear transformation of $PE_{pos}$ , allowing the model to easily learn to "look" at fixed relative distances.


import numpy as np

def positional_encoding(max_len, d_model):
    """Sinusoidal Positional Encoding."""
    PE = np.zeros((max_len, d_model))
    position = np.arange(max_len)[:, np.newaxis]
    div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))

    PE[:, 0::2] = np.sin(position * div_term)  # Even indices: sine
    PE[:, 1::2] = np.cos(position * div_term)  # Odd indices: cosine

    return PE

# Generate positional encoding
max_len, d_model = 100, 64
PE = positional_encoding(max_len, d_model)
print(f"PE shape: {PE.shape}")
print(f"PE[0, :8]: {np.round(PE[0, :8], 4)}")
print(f"PE[1, :8]: {np.round(PE[1, :8], 4)}")

# Final embedding is: token_embedding + positional_encoding
token_embedding = np.random.randn(10, d_model)  # 10 tokens
input_with_position = token_embedding + PE[:10]
print(f"\nEmbedding + PE shape: {input_with_position.shape}")

Layer Normalization and Residual Connections

Each Transformer sub-layer uses residual connections and layer normalization:

\\text{output} = \\text{LayerNorm}(\\mathbf{x} + \\text{SubLayer}(\\mathbf{x}))

Layer normalization normalizes along the feature dimension:

\\text{LayerNorm}(\\mathbf{x}) = \\frac{\\mathbf{x} - \\mu}{\\sigma + \\epsilon} \\odot \\gamma + \\beta

where $\\mu$ and $\\sigma$ are computed per sample, and $\\gamma, \\beta$ are learnable parameters.

Summary

Key Takeaways

Attention: $\\text{softmax}(QK^T / \\sqrt{d_k}) V$ - weighted similarity between tokens
Scaling $\\sqrt{d_k}$ : prevents softmax saturation
Multi-head: $h$ parallel heads capture different relationships
Positional Encoding: sine/cosine add sequential order
Residual + LayerNorm: stabilize training of deep networks
The entire architecture is composed of linear algebra operations and softmax

In the Next Article: we will explore data augmentation and synthetic data generation techniques. SMOTE, Mixup, augmentation for images and text, and when augmentation actually helps.