Hi! I'm

Federico Calò

Software Developer | Technical Writer

I create modern web applications and custom digital tools to help businesses grow through technological innovation. My passion is combining computer science and economics to generate real value.

Contact Me

About Me

My passion for computer science was born at the Technical Commercial Institute of Maglie, where I discovered the power of programming and the fascination of creating digital solutions. From the start, I understood that computer science was not just code, but an extraordinary tool for turning ideas into reality.

During my studies in Business Information Systems, I began to interweave computer science and economics, understanding how technology can be the engine of growth for any business. This vision accompanied me to the University of Bari, where I obtained my degree in Computer Science, deepening my technical skills and passion for software development.

Today I put this experience at the service of businesses, professionals and startups, creating tailor-made digital solutions that automate processes, optimize resources and open new business opportunities. Because true innovation begins when technology meets the real needs of people.

My Skills

Data Analysis & Predictive Models

I transform data into strategic insights with in-depth analysis and predictive models for informed decisions

Process Automation

I create custom tools that automate repetitive operations and free up time for value-added activities

Custom Systems

I develop tailor-made software systems, from platform integrations to customized dashboards

const federico = {
  nome: "Federico Calò",
  ruolo: "Sviluppatore Software",
  città: "Bari, Italia",
  missione: "Aiutare attraverso l'informatica",
  passioni: [
    "Codice Pulito",
    "Innovazione",
    "Crescita Continua"
  ]
};

La Mia Missione

Credo fermamente che l'informatica sia lo strumento più potente per trasformare le idee in realtà e migliorare la vita delle persone.

Democratizzare la Tecnologia

La mia missione è rendere l'informatica accessibile a tutti: dalle piccole imprese locali alle startup innovative, fino ai professionisti che vogliono digitalizzare la propria attività. Ogni realtà merita di sfruttare le potenzialità del digitale.

Unire Informatica ed Economia

Non è solo questione di scrivere codice: è capire come la tecnologia possa generare valore reale. Intrecciando competenze informatiche e visione economica, aiuto le attività a crescere, ottimizzare processi e raggiungere nuovi traguardi di efficienza e redditività.

Creare Soluzioni su Misura

Ogni attività è unica, e così devono esserlo le soluzioni. Sviluppo strumenti personalizzati che rispondono alle esigenze specifiche di ciascun cliente, automatizzando processi ripetitivi e liberando tempo per ciò che conta davvero: far crescere il business.

Trasforma la Tua Attività con la Tecnologia

Che tu gestisca un negozio, uno studio professionale o un'azienda, posso aiutarti a sfruttare le potenzialità dell'informatica per lavorare meglio, più velocemente e in modo più intelligente.

Parliamone Insieme →

Join the Community

Join the developer community where we discuss software, AI, architecture and DevOps. Share ideas, ask questions and grow with us.

Channel

FC Dev Blog

Get notifications on new articles, complete series, weekly tips and featured tools. Bilingual IT/EN content directly in your Telegram.

New articles as they are published
Weekly tips and code snippets
Polls on future topics

Subscribe to Channel

Group

FC Dev Community

A bilingual IT/EN community for developers. Discussions, Q&A, mutual help and networking with other professionals.

Discussions on articles and technologies
Coding help and code review
Job opportunities and collaboration

Join the Group

Discussion Topics

View

Master SQL

RoadMap.sh

November 2024

View

Oracle Certified Foundations Associate

Oracle

October 2024

View

People Leadership Credential

Connect

September 2024

💻 Languages & Technologies

Java

Python

JavaScript

Angular

React

TypeScript

SQL

PHP

CSS/SCSS

Node.js

Docker

Git

💼

12/2024 - Present

Custom Software Engineering Analyst

Accenture

Bari, Puglia, Italy · Hybrid Analysis and development of computer systems through the use of Java and Quarkus in Health and Public Sector. Continuous training on modern technologies for creating customized and efficient software solutions and on agents.

💼

06/2022 - 12/2024

Software analyst and Back End Developer Associate Consultant

Links Management and Technology SpA

Experience analyzing as-is software systems and ETL flows using PowerCenter. Completed Spring Boot training for developing modern and scalable backend applications. Backend developer specialized in Spring Boot, with experience in database design, analysis, development and testing of assigned tasks.

💼

02/2021 - 10/2021

Software programmer

Adesso.it (prima era WebScience srl)

Experience in AS-IS and TO-BE analysis, SEO evolutions and website evolutions to improve user performance and engagement.

🎓

2018 - 2025

Degree in Computer Science

University of Bari Aldo Moro

Bachelor's degree in Computer Science, focusing on software engineering, algorithms, and modern development practices.

📚

2013 - 2018

Diploma - Corporate Information Systems

Technical Commercial Institute of Maglie

Technical diploma specializing in Business Information Systems, combining IT knowledge with business management.

Contattami

Hai un progetto in mente? Parliamone! Compila il form qui sotto e ti risponderò al più presto.

* Campi obbligatori. I tuoi dati saranno utilizzati solo per rispondere alla tua richiesta.

Introduction: From Experiment to Production

Using an LLM in the playground is easy. Bringing it to production is an entirely different engineering challenge. In production, you need to handle rate limiting, retry on errors, caching to reduce costs, monitoring to track latency and quality, fallback when a provider is down, and monthly budgets that can explode without control.

This article covers the entire journey: from the APIs of major providers (OpenAI, Anthropic) to deploying open source models, with proven architectural patterns for robust and scalable LLM applications.

What You'll Learn in This Article

OpenAI and Anthropic APIs: setup, models, and pricing
Deploying open source models with Ollama and vLLM
Production patterns: retry, caching, rate limiting
Streaming for a reactive UX
Fallback and multi-provider strategies
Monitoring, logging, and cost management

OpenAI API: The Market Leader

OpenAI offers the most mature and widespread API ecosystem. GPT-4 and GPT-4o models represent the de facto standard for many applications, with extensive documentation and an active community.


# Complete OpenAI API setup with error handling
from openai import OpenAI, APIError, RateLimitError, APITimeoutError
import time

client = OpenAI(
    api_key="sk-...",       # Better from environment variable
    timeout=30.0,           # Timeout in seconds
    max_retries=3           # Automatic retries
)

def call_openai_with_retry(
    messages: list,
    model: str = "gpt-4o",
    max_retries: int = 3,
    base_delay: float = 1.0
) -> str:
    """Call OpenAI with exponential backoff on errors."""
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model=model,
                messages=messages,
                temperature=0.7,
                max_tokens=1000
            )
            return response.choices[0].message.content

        except RateLimitError:
            delay = base_delay * (2 ** attempt)
            print(f"Rate limited. Retry in {delay}s...")
            time.sleep(delay)

        except APITimeoutError:
            print(f"Timeout. Attempt {attempt + 1}/{max_retries}")
            time.sleep(base_delay)

        except APIError as e:
            print(f"API error: {e.status_code} - {e.message}")
            if e.status_code >= 500:
                time.sleep(base_delay * (2 ** attempt))
            else:
                raise

    raise Exception("Max retries exceeded")

# Usage
result = call_openai_with_retry(
    messages=[{"role": "user", "content": "Explain the Repository pattern"}]
)

Anthropic API: Safety and Reliability

Anthropic offers the Claude model family, with a focus on safety, hallucination reduction, and long context windows (up to 200K tokens). The API is similar in structure but with some key differences.


# Anthropic API setup with streaming
from anthropic import Anthropic

client = Anthropic(api_key="sk-ant-...")

# Basic call
response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1000,
    system="You are a software architecture expert. Respond concisely.",
    messages=[
        {"role": "user", "content": "Compare monolith vs microservices"}
    ]
)
print(response.content[0].text)
print(f"Tokens used: {response.usage.input_tokens} in + {response.usage.output_tokens} out")

# Streaming for reactive UX
def stream_claude_response(prompt: str) -> str:
    """Stream the response token by token for a reactive UX."""
    full_response = ""
    with client.messages.stream(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1000,
        messages=[{"role": "user", "content": prompt}]
    ) as stream:
        for text in stream.text_stream:
            print(text, end="", flush=True)
            full_response += text
    print()  # final newline
    return full_response

response = stream_claude_response("Write a quick guide to Python testing")

Open Source Models: Freedom and Control

Open source models like Llama 3 and Mistral offer total control over data and infrastructure. No data leaves your environment, no per-token cost, but you need to manage GPU infrastructure.

Ollama: The Simplest Way

Ollama is the fastest way to run open source models locally. A single command downloads and starts the model, exposing an OpenAI-compatible API.


# Using Ollama with OpenAI-compatible API
from openai import OpenAI

# Ollama exposes an OpenAI-compatible API on localhost
ollama_client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Ollama doesn't require an API key
)

# Use exactly the same interface as OpenAI!
response = ollama_client.chat.completions.create(
    model="llama3.1:8b",  # Local model
    messages=[
        {"role": "system", "content": "You are a technical assistant."},
        {"role": "user", "content": "Explain Docker in 3 points"}
    ],
    temperature=0.7,
    max_tokens=500
)
print(response.choices[0].message.content)
# Cost: $0 (only electricity and hardware)

Production Patterns: Caching

Caching is the most effective strategy to reduce costs in production. If the same question (or a similar one) is asked repeatedly, there's no need to call the LLM every time.


# Caching system for LLM responses
import hashlib
import json
from datetime import datetime, timedelta

class LLMCache:
    """Simple cache for LLM responses with TTL."""

    def __init__(self, ttl_hours: int = 24):
        self.cache: dict = {}
        self.ttl = timedelta(hours=ttl_hours)
        self.hits = 0
        self.misses = 0

    def _make_key(self, model: str, messages: list, temperature: float) -> str:
        """Generate a deterministic cache key."""
        content = json.dumps({
            "model": model,
            "messages": messages,
            "temperature": temperature
        }, sort_keys=True)
        return hashlib.sha256(content.encode()).hexdigest()

    def get(self, model: str, messages: list, temperature: float) -> str | None:
        """Look up in cache. Returns None on miss."""
        key = self._make_key(model, messages, temperature)
        if key in self.cache:
            entry = self.cache[key]
            if datetime.now() - entry["timestamp"] < self.ttl:
                self.hits += 1
                return entry["response"]
            del self.cache[key]
        self.misses += 1
        return None

    def set(self, model: str, messages: list, temperature: float, response: str):
        """Store in cache."""
        key = self._make_key(model, messages, temperature)
        self.cache[key] = {
            "response": response,
            "timestamp": datetime.now()
        }

    def stats(self) -> dict:
        total = self.hits + self.misses
        return {
            "hits": self.hits,
            "misses": self.misses,
            "hit_rate": f"{self.hits/total*100:.1f}%" if total > 0 else "N/A",
            "cached_entries": len(self.cache)
        }

# Usage
cache = LLMCache(ttl_hours=24)

def cached_llm_call(messages: list, model: str = "gpt-4o") -> str:
    cached = cache.get(model, messages, 0.7)
    if cached:
        return cached
    response = call_openai_with_retry(messages, model)
    cache.set(model, messages, 0.7, response)
    return response

Multi-Provider Fallback

In production, relying on a single provider is risky. A multi-provider fallback system guarantees availability even when a provider has issues.


# Multi-provider router with automatic fallback
from anthropic import Anthropic
from openai import OpenAI

class LLMRouter:
    """Router that tries multiple providers in priority order."""

    def __init__(self):
        self.providers = [
            {"name": "anthropic", "client": Anthropic(), "model": "claude-3-5-sonnet-20241022"},
            {"name": "openai", "client": OpenAI(), "model": "gpt-4o"},
            {"name": "ollama", "client": OpenAI(base_url="http://localhost:11434/v1", api_key="ollama"), "model": "llama3.1:8b"},
        ]

    def call(self, messages: list, max_tokens: int = 1000) -> dict:
        """Try each provider in order. Return on first success."""
        errors = []
        for provider in self.providers:
            try:
                if provider["name"] == "anthropic":
                    response = provider["client"].messages.create(
                        model=provider["model"],
                        max_tokens=max_tokens,
                        messages=messages
                    )
                    return {
                        "content": response.content[0].text,
                        "provider": provider["name"],
                        "model": provider["model"]
                    }
                else:
                    response = provider["client"].chat.completions.create(
                        model=provider["model"],
                        messages=messages,
                        max_tokens=max_tokens
                    )
                    return {
                        "content": response.choices[0].message.content,
                        "provider": provider["name"],
                        "model": provider["model"]
                    }
            except Exception as e:
                errors.append(f"{provider['name']}: {str(e)}")
                continue

        raise Exception(f"All providers failed: {errors}")

# Usage
router = LLMRouter()
result = router.call([{"role": "user", "content": "Hello!"}])
print(f"Response from {result['provider']}: {result['content']}")

Monitoring and Cost Management

Without monitoring, LLM API costs can spiral out of control quickly. A tracking system is essential to maintain control.

      Cost Comparison by Provider (per 1M tokens)
      
        
          Model
          Input
          Output
          Notes
        

          GPT-4o
          $2.50
          $10.00
          Best all-round
        

          GPT-4o-mini
          $0.15
          $0.60
          Excellent quality/price ratio
        

          Claude 3.5 Sonnet
          $3.00
          $15.00
          200K context, safety
        

          Claude 3.5 Haiku
          $0.25
          $1.25
          Fast and economical
        

          Llama 3.1 8B (Ollama)
          $0
          $0
          Fixed hardware cost
        

    


# Cost monitoring system
from dataclasses import dataclass, field
from datetime import datetime

@dataclass
class UsageTracker:
    """Track LLM API usage and costs."""

    daily_budget_usd: float = 50.0
    records: list = field(default_factory=list)

    # Pricing per 1M tokens (input, output)
    PRICING = {
        "gpt-4o": (2.50, 10.00),
        "gpt-4o-mini": (0.15, 0.60),
        "claude-3-5-sonnet-20241022": (3.00, 15.00),
        "claude-3-5-haiku-20241022": (0.25, 1.25),
    }

    def log_usage(self, model: str, input_tokens: int, output_tokens: int):
        pricing = self.PRICING.get(model, (0, 0))
        cost = (input_tokens * pricing[0] + output_tokens * pricing[1]) / 1_000_000
        self.records.append({
            "timestamp": datetime.now(),
            "model": model,
            "input_tokens": input_tokens,
            "output_tokens": output_tokens,
            "cost_usd": cost
        })

        # Alert if near budget
        daily_total = self.get_daily_cost()
        if daily_total > self.daily_budget_usd * 0.8:
            print(f"ALERT: {daily_total:.2f}/{self.daily_budget_usd} USD daily budget!")

    def get_daily_cost(self) -> float:
        today = datetime.now().date()
        return sum(
            r["cost_usd"] for r in self.records
            if r["timestamp"].date() == today
        )

    def report(self) -> dict:
        return {
            "total_requests": len(self.records),
            "total_cost": f"#123;sum(r['cost_usd'] for r in self.records):.4f}",
            "daily_cost": f"#123;self.get_daily_cost():.4f}",
            "budget_remaining": f"#123;self.daily_budget_usd - self.get_daily_cost():.2f}"
        }

tracker = UsageTracker(daily_budget_usd=50.0)

Complete Production Architecture

A production-ready LLM architecture combines all the patterns seen into a coherent system: caching, retry, fallback, monitoring, and rate limiting.

Production Deployment Checklist

Retry with exponential backoff on transient errors (429, 500, 503)
Response caching for repeated questions (saves 30-60% on costs)
Internal rate limiting to respect provider quotas
Multi-provider fallback for high availability
Streaming for reactive UX (time to first token < 500ms)
Monitoring of latency, token usage, and costs per model
Alerts on daily/monthly budget
Structured logging of input/output for debug and audit
Content moderation on output before showing to user
Configurable timeouts for each request type

Conclusions

Bringing an LLM to production requires much more than a simple API call. The patterns presented in this article - retry, caching, multi-provider fallback, cost monitoring - are the foundation for building robust and economically sustainable LLM applications.

The choice between proprietary and open source providers isn't binary: many architectures use a hybrid approach, with proprietary models for critical tasks and open source models for high-volume tasks or those with strict privacy requirements.

In the next article, we'll explore a different domain of generative AI: image generation with Stable Diffusion, DALL-E, and Midjourney, analyzing how diffusion models work and how to integrate them into your applications.

Model	Input	Output	Notes
GPT-4o	$2.50	$10.00	Best all-round
GPT-4o-mini	$0.15	$0.60	Excellent quality/price ratio
Claude 3.5 Sonnet	$3.00	$15.00	200K context, safety
Claude 3.5 Haiku	$0.25	$1.25	Fast and economical
Llama 3.1 8B (Ollama)	$0	$0	Fixed hardware cost