Hi! I'm

Federico Calò

Software Developer | Technical Writer

I create modern web applications and custom digital tools to help businesses grow through technological innovation. My passion is combining computer science and economics to generate real value.

Contact Me

About Me

My passion for computer science was born at the Technical Commercial Institute of Maglie, where I discovered the power of programming and the fascination of creating digital solutions. From the start, I understood that computer science was not just code, but an extraordinary tool for turning ideas into reality.

During my studies in Business Information Systems, I began to interweave computer science and economics, understanding how technology can be the engine of growth for any business. This vision accompanied me to the University of Bari, where I obtained my degree in Computer Science, deepening my technical skills and passion for software development.

Today I put this experience at the service of businesses, professionals and startups, creating tailor-made digital solutions that automate processes, optimize resources and open new business opportunities. Because true innovation begins when technology meets the real needs of people.

My Skills

Data Analysis & Predictive Models

I transform data into strategic insights with in-depth analysis and predictive models for informed decisions

Process Automation

I create custom tools that automate repetitive operations and free up time for value-added activities

Custom Systems

I develop tailor-made software systems, from platform integrations to customized dashboards

const federico = {
  nome: "Federico Calò",
  ruolo: "Sviluppatore Software",
  città: "Bari, Italia",
  missione: "Aiutare attraverso l'informatica",
  passioni: [
    "Codice Pulito",
    "Innovazione",
    "Crescita Continua"
  ]
};

La Mia Missione

Credo fermamente che l'informatica sia lo strumento più potente per trasformare le idee in realtà e migliorare la vita delle persone.

Democratizzare la Tecnologia

La mia missione è rendere l'informatica accessibile a tutti: dalle piccole imprese locali alle startup innovative, fino ai professionisti che vogliono digitalizzare la propria attività. Ogni realtà merita di sfruttare le potenzialità del digitale.

Unire Informatica ed Economia

Non è solo questione di scrivere codice: è capire come la tecnologia possa generare valore reale. Intrecciando competenze informatiche e visione economica, aiuto le attività a crescere, ottimizzare processi e raggiungere nuovi traguardi di efficienza e redditività.

Creare Soluzioni su Misura

Ogni attività è unica, e così devono esserlo le soluzioni. Sviluppo strumenti personalizzati che rispondono alle esigenze specifiche di ciascun cliente, automatizzando processi ripetitivi e liberando tempo per ciò che conta davvero: far crescere il business.

Trasforma la Tua Attività con la Tecnologia

Che tu gestisca un negozio, uno studio professionale o un'azienda, posso aiutarti a sfruttare le potenzialità dell'informatica per lavorare meglio, più velocemente e in modo più intelligente.

Parliamone Insieme →

Join the Community

Join the developer community where we discuss software, AI, architecture and DevOps. Share ideas, ask questions and grow with us.

Channel

FC Dev Blog

Get notifications on new articles, complete series, weekly tips and featured tools. Bilingual IT/EN content directly in your Telegram.

New articles as they are published
Weekly tips and code snippets
Polls on future topics

Subscribe to Channel

Group

FC Dev Community

A bilingual IT/EN community for developers. Discussions, Q&A, mutual help and networking with other professionals.

Discussions on articles and technologies
Coding help and code review
Job opportunities and collaboration

Join the Group

Discussion Topics

View

Master SQL

RoadMap.sh

November 2024

View

Oracle Certified Foundations Associate

Oracle

October 2024

View

People Leadership Credential

Connect

September 2024

💻 Languages & Technologies

Java

Python

JavaScript

Angular

React

TypeScript

SQL

PHP

CSS/SCSS

Node.js

Docker

Git

💼

12/2024 - Present

Custom Software Engineering Analyst

Accenture

Bari, Puglia, Italy · Hybrid Analysis and development of computer systems through the use of Java and Quarkus in Health and Public Sector. Continuous training on modern technologies for creating customized and efficient software solutions and on agents.

💼

06/2022 - 12/2024

Software analyst and Back End Developer Associate Consultant

Links Management and Technology SpA

Experience analyzing as-is software systems and ETL flows using PowerCenter. Completed Spring Boot training for developing modern and scalable backend applications. Backend developer specialized in Spring Boot, with experience in database design, analysis, development and testing of assigned tasks.

💼

02/2021 - 10/2021

Software programmer

Adesso.it (prima era WebScience srl)

Experience in AS-IS and TO-BE analysis, SEO evolutions and website evolutions to improve user performance and engagement.

🎓

2018 - 2025

Degree in Computer Science

University of Bari Aldo Moro

Bachelor's degree in Computer Science, focusing on software engineering, algorithms, and modern development practices.

📚

2013 - 2018

Diploma - Corporate Information Systems

Technical Commercial Institute of Maglie

Technical diploma specializing in Business Information Systems, combining IT knowledge with business management.

Contattami

Hai un progetto in mente? Parliamone! Compila il form qui sotto e ti risponderò al più presto.

* Campi obbligatori. I tuoi dati saranno utilizzati solo per rispondere alla tua richiesta.

Introduction: The Architecture That Changed Everything

The Transformer, introduced in the paper "Attention Is All You Need" (Vaswani et al., 2017), revolutionized deep learning by completely eliminating the recurrence of RNNs. In its place, the self-attention mechanism allows every element of the sequence to directly "look at" all other elements, capturing long-range dependencies without the vanishing gradient bottleneck.

Since its introduction, the Transformer has become the dominant architecture not only in NLP (BERT, GPT, T5) but also in computer vision (Vision Transformer), audio (Whisper), and image generation (DALL-E, Stable Diffusion). Understanding this architecture is essential for anyone working in modern deep learning.

What You Will Learn

Self-Attention: how each token "looks at" others in the sequence
Query, Key, Value: the mechanics of attention
Multi-Head Attention: capturing different patterns simultaneously
Positional Encoding: how the Transformer knows sequence order
Complete Encoder-Decoder architecture
BERT vs GPT: encoder-only vs decoder-only
Practical implementation with Hugging Face Transformers

Self-Attention: The Heart of the Transformer

Self-attention (or intra-attention) allows every position in the sequence to compute an attention weight with respect to all other positions. This means that to understand the word "bank" in a sentence, the model can directly look at whether the context contains words like "river" (river bank) or "money" (financial bank).

The mechanism is based on three vectors computed for each token:

Query (Q): represents "what am I looking for" - the question each token asks of others
Key (K): represents "what do I offer" - the label with which each token presents itself
Value (V): represents "my content" - the actual information to transmit

The attention score between two tokens is the dot product between the Query of the first and the Key of the second, normalized by the square root of the dimension. After a softmax, these scores weight the Values to produce the output.


import torch
import torch.nn as nn
import math

class ScaledDotProductAttention(nn.Module):
    """Scaled Dot-Product Attention: Attention(Q, K, V)"""
    def __init__(self, d_k):
        super().__init__()
        self.scale = math.sqrt(d_k)

    def forward(self, query, key, value, mask=None):
        # query, key, value: (batch, seq_len, d_k)
        scores = torch.matmul(query, key.transpose(-2, -1)) / self.scale

        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))

        attention_weights = torch.softmax(scores, dim=-1)
        output = torch.matmul(attention_weights, value)
        return output, attention_weights

# Example
batch_size, seq_len, d_model = 2, 10, 64
Q = torch.randn(batch_size, seq_len, d_model)
K = torch.randn(batch_size, seq_len, d_model)
V = torch.randn(batch_size, seq_len, d_model)

attention = ScaledDotProductAttention(d_k=d_model)
output, weights = attention(Q, K, V)
print(f"Output: {output.shape}")    # [2, 10, 64]
print(f"Weights: {weights.shape}")  # [2, 10, 10]

Multi-Head Attention

A single attention mechanism captures only one type of relationship between tokens. Multi-Head Attention performs attention in parallel with different linear projections (heads), allowing the model to simultaneously capture syntactic, semantic, positional, and coreference relationships.

Each head operates on a sub-dimension of the embedding space: if d_model=512 and we have 8 heads, each head works on d_k=64 dimensions. The results are concatenated and projected through a final linear transformation.


class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        assert d_model % num_heads == 0
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def forward(self, query, key, value, mask=None):
        batch_size = query.size(0)

        # Linear projections and split into heads
        Q = self.W_q(query).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_k(key).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_v(value).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)

        # Scaled dot-product attention
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        attn = torch.softmax(scores, dim=-1)
        context = torch.matmul(attn, V)

        # Concatenate heads and final projection
        context = context.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
        return self.W_o(context)

mha = MultiHeadAttention(d_model=512, num_heads=8)
x = torch.randn(4, 20, 512)  # batch=4, seq=20, dim=512
output = mha(x, x, x)
print(f"MHA output: {output.shape}")  # [4, 20, 512]

Positional Encoding

Unlike RNNs that process tokens sequentially, the Transformer processes all tokens in parallel. Without positional information, the model treats the sequence as an unordered set. Positional encoding adds position information to each token's embedding.

The original paper uses sinusoidal functions with different frequencies for each dimension. This scheme has the advantage of generalizing to longer sequences than those seen during training, since sine and cosine functions are periodic.

Why the Transformer Is Better Than RNNs

Three key advantages: (1) Parallelization - all tokens are processed simultaneously, fully leveraging GPUs. (2) Long-range dependencies - every token can directly "look at" any other token, without sequential propagation. (3) Scalability - the architecture scales efficiently to billions of parameters (GPT-3: 175B, GPT-4: ~1.8T estimated), which is impossible with RNNs.

Encoder-Decoder Architecture

The original Transformer has an encoder-decoder architecture:

Encoder: 6 identical layers, each with Multi-Head Self-Attention + Feed-Forward Network, separated by Layer Normalization and residual connections. Processes the full input
Decoder: 6 identical layers with Masked Self-Attention (to prevent looking at future tokens), Cross-Attention (to look at encoder output), and Feed-Forward Network. Generates output auto-regressively

BERT vs GPT: Two Philosophies

BERT (Encoder-Only)

BERT (Bidirectional Encoder Representations from Transformers) uses only the encoder. During pre-training, it randomly masks 15% of tokens and predicts them (Masked Language Modeling), learning bidirectional representations. It excels at understanding tasks: classification, NER, question answering.

GPT (Decoder-Only)

GPT (Generative Pre-trained Transformer) uses only the decoder with masked attention. Pre-trained to predict the next token (Causal Language Modeling), it excels at text generation. GPT-3 and GPT-4 demonstrated emergent capabilities as scale increases.


from transformers import pipeline, AutoTokenizer, AutoModel

# Sentiment analysis with Hugging Face pipeline
classifier = pipeline("sentiment-analysis")
result = classifier("This movie was absolutely fantastic!")
print(result)  # [{'label': 'POSITIVE', 'score': 0.9998}]

# Text generation with GPT-2
generator = pipeline("text-generation", model="gpt2")
text = generator("Deep learning is", max_length=50, num_return_sequences=1)
print(text[0]['generated_text'])

# BERT embeddings for downstream tasks
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

inputs = tokenizer("The transformer architecture is revolutionary",
                   return_tensors="pt")
outputs = model(**inputs)
# outputs.last_hidden_state: (1, seq_len, 768)
cls_embedding = outputs.last_hidden_state[:, 0, :]  # CLS token
print(f"CLS embedding: {cls_embedding.shape}")  # [1, 768]

Vision Transformer (ViT) and Beyond

The success of Transformers in NLP inspired their application in other domains. The Vision Transformer (ViT) divides an image into patches (typically 16x16), treats them as tokens, and applies the standard Transformer architecture. Surprisingly, ViT achieves competitive performance with CNNs on large datasets and surpasses them when pre-trained on massive datasets.

Today Transformers are the foundation of: language models (GPT-4, Claude, Llama), image generation (DALL-E, Stable Diffusion), speech recognition (Whisper), robotics (RT-2), and multimodal models (GPT-4V, Gemini). The architecture has proven to be a universal foundation for modern artificial intelligence.

Next Steps in the Series

In the next article we will explore GANs (Generative Adversarial Networks)
We will see how two competing networks generate realistic synthetic data
We will analyze DCGAN, StyleGAN, and the challenges of adversarial training