Hi! I'm

Federico Calò

Software Developer | Technical Writer

I create modern web applications and custom digital tools to help businesses grow through technological innovation. My passion is combining computer science and economics to generate real value.

Contact Me

About Me

My passion for computer science was born at the Technical Commercial Institute of Maglie, where I discovered the power of programming and the fascination of creating digital solutions. From the start, I understood that computer science was not just code, but an extraordinary tool for turning ideas into reality.

During my studies in Business Information Systems, I began to interweave computer science and economics, understanding how technology can be the engine of growth for any business. This vision accompanied me to the University of Bari, where I obtained my degree in Computer Science, deepening my technical skills and passion for software development.

Today I put this experience at the service of businesses, professionals and startups, creating tailor-made digital solutions that automate processes, optimize resources and open new business opportunities. Because true innovation begins when technology meets the real needs of people.

My Skills

Data Analysis & Predictive Models

I transform data into strategic insights with in-depth analysis and predictive models for informed decisions

Process Automation

I create custom tools that automate repetitive operations and free up time for value-added activities

Custom Systems

I develop tailor-made software systems, from platform integrations to customized dashboards

const federico = {
  nome: "Federico Calò",
  ruolo: "Sviluppatore Software",
  città: "Bari, Italia",
  missione: "Aiutare attraverso l'informatica",
  passioni: [
    "Codice Pulito",
    "Innovazione",
    "Crescita Continua"
  ]
};

La Mia Missione

Credo fermamente che l'informatica sia lo strumento più potente per trasformare le idee in realtà e migliorare la vita delle persone.

Democratizzare la Tecnologia

La mia missione è rendere l'informatica accessibile a tutti: dalle piccole imprese locali alle startup innovative, fino ai professionisti che vogliono digitalizzare la propria attività. Ogni realtà merita di sfruttare le potenzialità del digitale.

Unire Informatica ed Economia

Non è solo questione di scrivere codice: è capire come la tecnologia possa generare valore reale. Intrecciando competenze informatiche e visione economica, aiuto le attività a crescere, ottimizzare processi e raggiungere nuovi traguardi di efficienza e redditività.

Creare Soluzioni su Misura

Ogni attività è unica, e così devono esserlo le soluzioni. Sviluppo strumenti personalizzati che rispondono alle esigenze specifiche di ciascun cliente, automatizzando processi ripetitivi e liberando tempo per ciò che conta davvero: far crescere il business.

Trasforma la Tua Attività con la Tecnologia

Che tu gestisca un negozio, uno studio professionale o un'azienda, posso aiutarti a sfruttare le potenzialità dell'informatica per lavorare meglio, più velocemente e in modo più intelligente.

Parliamone Insieme →

Join the Community

Join the developer community where we discuss software, AI, architecture and DevOps. Share ideas, ask questions and grow with us.

Channel

FC Dev Blog

Get notifications on new articles, complete series, weekly tips and featured tools. Bilingual IT/EN content directly in your Telegram.

New articles as they are published
Weekly tips and code snippets
Polls on future topics

Subscribe to Channel

Group

FC Dev Community

A bilingual IT/EN community for developers. Discussions, Q&A, mutual help and networking with other professionals.

Discussions on articles and technologies
Coding help and code review
Job opportunities and collaboration

Join the Group

Discussion Topics

View

Master SQL

RoadMap.sh

November 2024

View

Oracle Certified Foundations Associate

Oracle

October 2024

View

People Leadership Credential

Connect

September 2024

💻 Languages & Technologies

Java

Python

JavaScript

Angular

React

TypeScript

SQL

PHP

CSS/SCSS

Node.js

Docker

Git

💼

12/2024 - Present

Custom Software Engineering Analyst

Accenture

Bari, Puglia, Italy · Hybrid Analysis and development of computer systems through the use of Java and Quarkus in Health and Public Sector. Continuous training on modern technologies for creating customized and efficient software solutions and on agents.

💼

06/2022 - 12/2024

Software analyst and Back End Developer Associate Consultant

Links Management and Technology SpA

Experience analyzing as-is software systems and ETL flows using PowerCenter. Completed Spring Boot training for developing modern and scalable backend applications. Backend developer specialized in Spring Boot, with experience in database design, analysis, development and testing of assigned tasks.

💼

02/2021 - 10/2021

Software programmer

Adesso.it (prima era WebScience srl)

Experience in AS-IS and TO-BE analysis, SEO evolutions and website evolutions to improve user performance and engagement.

🎓

2018 - 2025

Degree in Computer Science

University of Bari Aldo Moro

Bachelor's degree in Computer Science, focusing on software engineering, algorithms, and modern development practices.

📚

2013 - 2018

Diploma - Corporate Information Systems

Technical Commercial Institute of Maglie

Technical diploma specializing in Business Information Systems, combining IT knowledge with business management.

Contattami

Hai un progetto in mente? Parliamone! Compila il form qui sotto e ti risponderò al più presto.

* Campi obbligatori. I tuoi dati saranno utilizzati solo per rispondere alla tua richiesta.

Scaling ML on Kubernetes: Production Deployment and Orchestration

Your machine learning model has passed all offline tests, the metrics are excellent, and FastAPI serving works perfectly locally. Then the critical moment arrives: you need to handle 10,000 requests per second, scale dynamically based on load, guarantee high availability, and achieve zero-downtime updates. A single container is no longer enough.

Kubernetes has become the de facto standard for orchestrating ML workloads in production, with 78% of enterprise organizations using it for model deployment according to the CNCF Survey 2025. But simply placing a model in a Pod is not sufficient: you need to manage GPU scheduling, event-driven autoscaling, resource quotas, canary deployments, and monitoring specialized for inference. The MLOps market, worth $4.38 billion in 2026 with a CAGR of 39.8%, has Kubernetes as its primary scaling engine.

In this article, we explore the complete architecture for bringing ML models to Kubernetes in production: from KServe and Seldon Core for inference serving, to GPU scheduling with NVIDIA Device Plugin, to intelligent autoscaling with HPA, VPA and KEDA, through to monitoring with Prometheus and Grafana.

What You'll Learn

Why Kubernetes is the standard for ML in production and when to use it
GPU scheduling and sharing with NVIDIA Device Plugin and MIG
KServe: deploying InferenceService with canary rollout and scale-to-zero
Seldon Core v2: composable pipelines and multi-model serving
Advanced autoscaling: HPA, VPA and KEDA for event-driven ML workloads
Resource management: requests, limits, priority classes and node affinity
Monitoring with Prometheus + Grafana specialized for ML inference
Cost optimization and best practices for GPU clusters

Why Kubernetes for ML in Production

Before diving into technical configuration, it is fundamental to understand why Kubernetes has established itself as the reference platform for ML workloads, surpassing solutions like bare metal, dedicated VMs, or proprietary cloud services.

ML workloads have unique characteristics compared to traditional web applications. Training requires massive GPUs for hours or days, then resources must be released. Inference has unpredictable spikes and critical latency requirements. Models must be updated without downtime. Datasets can be enormous and require specialized storage. Kubernetes addresses all these scenarios with native primitives: Pod scheduling on specific GPU nodes, Persistent Volumes for datasets, Jobs for batch training, HorizontalPodAutoscaler for inference scaling.

Kubernetes vs Cloud-Managed Alternatives

Self-managed or managed Kubernetes (EKS, GKE, AKS): full control, multi-cloud portability, optimizable costs, but high operational complexity.
SageMaker / Vertex AI / Azure ML: fast setup, cloud-native integration, but vendor lock-in, higher long-term costs, and less flexibility for custom architectures.
Practical rule: team <5 people or limited budget? Start with managed ML. Team >5 with multiple models in production? Kubernetes pays back the investment in 6-12 months.

Reference Architecture

A Kubernetes cluster for ML in production is typically structured across three distinct layers, each with well-defined responsibilities:

Infrastructure Layer: CPU nodes for lightweight serving and orchestration, GPU nodes for training and heavy inference, storage nodes for datasets and artifacts. GPU pools are separated with specific node labels.
Platform Layer: KServe or Seldon Core for inference serving, Kubeflow for training pipelines, MLflow for experiment tracking (see article 4), Argo Workflows for complex orchestration.
Observability Layer: Prometheus for metrics, Grafana for dashboards, Jaeger for distributed tracing, Loki for log aggregation.

# Namespace structure for ML cluster
# Separate environments and responsibilities

kubectl create namespace ml-training      # Training jobs
kubectl create namespace ml-serving       # Inference services
kubectl create namespace ml-monitoring    # Prometheus, Grafana
kubectl create namespace mlflow           # Experiment tracking
kubectl create namespace kubeflow         # Pipeline orchestration

# Label nodes for GPU scheduling
kubectl label nodes gpu-node-1 accelerator=nvidia-a100
kubectl label nodes gpu-node-2 accelerator=nvidia-t4
kubectl label nodes cpu-node-1 workload=inference-cpu

GPU Scheduling and Sharing

GPUs are the most expensive resource in an ML cluster. Managing them poorly means wasting tens of thousands of euros per month. Kubernetes exposes GPUs as schedulable resources through the NVIDIA Device Plugin, a DaemonSet that automatically detects GPUs on nodes and registers them as nvidia.com/gpu in the kubelet.

# Install NVIDIA Device Plugin via Helm
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update

helm upgrade -i nvdp nvdp/nvidia-device-plugin \
  --namespace kube-system \
  --set failOnInitError=false

# Verify available GPUs on nodes
kubectl describe nodes | grep -A5 "Allocatable:"
# Expected output:
#   nvidia.com/gpu: 8
#   cpu: 96
#   memory: 768Gi

# Pod requesting 1 full GPU
apiVersion: v1
kind: Pod
metadata:
  name: ml-training-job
spec:
  containers:
  - name: trainer
    image: pytorch/pytorch:2.5.0-cuda12.4-cudnn9-runtime
    resources:
      limits:
        nvidia.com/gpu: 1   # Request 1 full GPU
        cpu: "8"
        memory: "32Gi"
      requests:
        nvidia.com/gpu: 1
        cpu: "4"
        memory: "16Gi"
  nodeSelector:
    accelerator: nvidia-a100  # Force scheduling on A100

For inference workloads that do not require a full GPU, NVIDIA offers two GPU sharing strategies: Time-Slicing and Multi-Instance GPU (MIG).

# Time-Slicing configuration (for T4/V100 GPUs, software sharing)
# Each physical GPU is divided into N logical replicas
apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: kube-system
data:
  any: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        replicas: 4   # 4 pods share 1 physical GPU
        failRequestsGreaterThanOne: false

# Apply to device plugin
kubectl patch clusterpolicies/cluster-policy \
  -n gpu-operator --type merge \
  -p '{"spec": {"devicePlugin": {"config": {"name": "time-slicing-config"}}}}'

# MIG configuration for A100/H100 (hardware isolation)
# Partition A100 into 7 MIG instances of 10GB each
nvidia-smi mig -cgi 9,9,9,9,9,9,9 -C

# Pod using one MIG slice
resources:
  limits:
    nvidia.com/mig-1g.10gb: 1   # Use 1 MIG instance of 10GB

Time-Slicing vs MIG: When to Use Which

Time-Slicing: suitable for lightweight inference workloads (models <2GB VRAM), introduces latency from context switching. Works on any NVIDIA GPU. MIG: complete hardware isolation, guaranteed dedicated memory, zero interference between workloads. Available only on A100, A30, H100. Ideal for stringent latency SLAs. Never combine both approaches on the same node.

KServe: Native Inference Serving on Kubernetes

KServe (formerly KFServing) is the CNCF standard for inference serving on Kubernetes, born from collaboration between Google, IBM, Bloomberg and others. It provides an InferenceService abstraction that hides deployment complexity, automatically handling canary rollouts, scale-to-zero, request-based autoscaling, and support for multiple frameworks (PyTorch, TensorFlow, scikit-learn, XGBoost, ONNX, Hugging Face).

# Install KServe (version 0.13+)
kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.13.0/kserve.yaml

# Verify installation
kubectl get pods -n kserve
# kserve-controller-manager-xxx   Running
# kserve-gateway-xxx              Running

# InferenceService for scikit-learn model
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "churn-predictor"
  namespace: ml-serving
  annotations:
    serving.kserve.io/enable-prometheus-scraping: "true"
spec:
  predictor:
    minReplicas: 1
    maxReplicas: 10
    scaleTarget: 50          # Target: 50 req/sec per replica
    scaleMetric: rps         # Scale based on requests-per-second
    sklearn:
      storageUri: "gs://my-ml-bucket/models/churn-model/v3"
      runtimeVersion: "1.5.2"
      resources:
        requests:
          cpu: "500m"
          memory: "1Gi"
        limits:
          cpu: "2"
          memory: "4Gi"

One of KServe's key strengths is native support for Canary Rollouts: you can send a percentage of traffic to the new model version while the rest continues using the stable one, exactly like the A/B testing we covered in the previous article of this series.

# Canary Rollout: 20% traffic to new version
# InferenceGraph for explicit traffic splitting
apiVersion: "serving.kserve.io/v1alpha1"
kind: "InferenceGraph"
metadata:
  name: "churn-ab-split"
  namespace: ml-serving
spec:
  nodes:
    root:
      routerType: WeightedEnsemble
      routes:
      - serviceName: churn-predictor-v3
        weight: 80
      - serviceName: churn-predictor-v4
        weight: 20

# Test the endpoint
curl -X POST \
  http://churn-predictor.ml-serving.svc.cluster.local/v1/models/churn-predictor:predict \
  -H 'Content-Type: application/json' \
  -d '{"instances": [[35, 12000, 2, 1, 0.8, 3]]}'

KServe's scale-to-zero feature (based on Knative Serving) is particularly valuable for occasionally used models: the pod shuts down after a configurable inactivity period and automatically restarts on the first request, with a typical cold start under 30 seconds for pre-cached models.

# Scale-to-zero configuration with custom timeout
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "batch-analyzer"
  namespace: ml-serving
  annotations:
    # Scale-to-zero after 5 minutes of inactivity
    autoscaling.knative.dev/scaleToZeroGracePeriod: "300s"
    # Window for scale-up calculation
    autoscaling.knative.dev/window: "60s"
    # Utilization target (percentage)
    autoscaling.knative.dev/targetUtilizationPercentage: "70"
spec:
  predictor:
    minReplicas: 0   # Enables scale-to-zero
    maxReplicas: 5
    pytorch:
      storageUri: "gs://my-ml-bucket/models/analyzer/v1"
      runtimeVersion: "2.5.0"

Seldon Core v2: Composable ML Pipelines

While KServe excels at serving individual models, Seldon Core v2 shines in managing complex ML architectures: multi-step pipelines, model ensembles, A/B routing with business logic, and Kafka integration for stream processing. Seldon v2 uses MLServer as its inference runtime, compatible with the V2 protocol (KFServing Inference Protocol), and natively supports PyTorch, scikit-learn, XGBoost, Hugging Face, and custom models.

# Install Seldon Core v2 via Helm
helm repo add seldonio https://storage.googleapis.com/seldon-charts
helm install seldon-core-v2 seldonio/seldon-core-v2 \
  --namespace seldon-mesh \
  --create-namespace \
  --set controller.clusterwide=true

# Model: single XGBoost model
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: churn-xgb
  namespace: ml-serving
spec:
  storageUri: "gs://my-ml-bucket/models/churn-xgb/v2"
  requirements:
    - xgboost
  memory: 100Mi

# Pipeline: preprocessing + prediction + postprocessing
apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
  name: churn-pipeline
  namespace: ml-serving
spec:
  steps:
    - name: preprocessor
      inputs:
        - churn-pipeline.inputs
    - name: churn-xgb
      inputs:
        - preprocessor.outputs
    - name: postprocessor
      inputs:
        - churn-xgb.outputs
  output:
    steps:
      - postprocessor
  replicas: 1
  scaling:
    replicas:
      minReplicas: 1
      maxReplicas: 8
      metric: rps
      target: 100

KServe vs Seldon Core: Which to Choose

KServe: better for single models, integration with Knative and Istio, native scale-to-zero, CNCF community. Ideal for teams starting with K8s ML.
Seldon Core v2: better for complex pipelines, ensembles, Kafka integration, multi-model serving. Ideal for advanced ML architectures with business routing logic.
Both: support V2 protocol, Prometheus monitoring, Triton for GPU serving. They are not mutually exclusive - some organizations use both for different use cases.

Intelligent Autoscaling: HPA, VPA and KEDA

Scaling ML workloads is more complex than traditional web applications. CPU and memory metrics often do not accurately reflect the real load of a model: a GPU-bound model can saturate the graphics card while the CPU sits at 80% idle. Kubernetes offers three complementary autoscaling mechanisms that, when used correctly together, cover all ML scenarios.

# HPA (Horizontal Pod Autoscaler): scales replica count
# Configuration for inference service based on custom metrics
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: churn-predictor-hpa
  namespace: ml-serving
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: churn-predictor
  minReplicas: 2
  maxReplicas: 20
  metrics:
    # Scale on CPU (fallback)
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    # Scale on custom metric: inference latency P99
    - type: Pods
      pods:
        metric:
          name: inference_request_duration_p99
        target:
          type: AverageValue
          averageValue: "500m"  # 500ms P99 latency target
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30    # Fast reaction to traffic
      policies:
        - type: Percent
          value: 100
          periodSeconds: 30
    scaleDown:
      stabilizationWindowSeconds: 300   # Slow to remove pods (warm models)

# VPA (Vertical Pod Autoscaler): optimizes resource requests/limits
# NOTE: do not use VPA and HPA on the same CPU/Memory metrics!
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: batch-trainer-vpa
  namespace: ml-training
spec:
  targetRef:
    apiVersion: batch/v1
    kind: Job
    name: model-training
  updatePolicy:
    updateMode: "Off"  # Recommendations only, does not auto-apply
  resourcePolicy:
    containerPolicies:
      - containerName: trainer
        minAllowed:
          cpu: "1"
          memory: 4Gi
        maxAllowed:
          cpu: "16"
          memory: 128Gi
        controlledResources: ["cpu", "memory"]

# Read VPA recommendations
kubectl describe vpa batch-trainer-vpa
# Output:
#   Recommendation:
#     Container Recommendations:
#       Container Name: trainer
#         Lower Bound:  cpu: 2, memory: 8Gi
#         Target:       cpu: 6, memory: 32Gi
#         Upper Bound:  cpu: 12, memory: 64Gi

KEDA (Kubernetes Event-Driven Autoscaler, CNCF graduated project) is the most powerful tool for event-driven ML workloads: it scales pods based on events from message queues, databases, Prometheus metrics, or HTTP triggers, enabling scale-to-zero for batch processing.

# KEDA: scale ML workers based on inference request queue
# Installation
helm repo add kedacore https://kedacore.github.io/charts
helm install keda kedacore/keda --namespace keda --create-namespace

# ScaledObject for batch ML processing from RabbitMQ
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: ml-batch-processor-scaler
  namespace: ml-serving
spec:
  scaleTargetRef:
    name: batch-ml-processor
  minReplicaCount: 0    # Scale-to-zero when queue is empty
  maxReplicaCount: 30   # Max 30 workers for GPU cluster
  pollingInterval: 15   # Check queue every 15 seconds
  cooldownPeriod: 60    # Wait 60s before scale-down
  triggers:
    - type: rabbitmq
      metadata:
        host: amqp://rabbitmq.ml-serving.svc.cluster.local
        queueName: inference-requests
        mode: QueueLength
        value: "10"    # 1 pod per 10 messages in queue

    # Alternative trigger: Prometheus metric
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring.svc.cluster.local:9090
        metricName: inference_queue_depth
        query: sum(inference_queue_depth{namespace="ml-serving"})
        threshold: "50"   # 1 replica per 50 pending requests

ML Autoscaling Pattern: Recommendation

Combine the three mechanisms with distinct responsibilities:
- HPA on latency/RPS for online inference (reactive, fast)
- VPA in Off mode to optimize training job requests (consult and update manually)
- KEDA for batch processing and event-driven pipelines (scale-to-zero included)
Never use HPA and VPA on the same resource (CPU/Memory) simultaneously: scaling conflicts cause unpredictable oscillations and resource waste.

Resource Management and Priority Classes

On a cluster shared among different ML teams, resource management is fundamental to prevent a training job from blocking production inference, or an experiment from consuming all available GPUs. Kubernetes offers three tools: ResourceQuota, LimitRange, and PriorityClass.

# ResourceQuota: limits resources per namespace
apiVersion: v1
kind: ResourceQuota
metadata:
  name: ml-serving-quota
  namespace: ml-serving
spec:
  hard:
    requests.cpu: "40"
    requests.memory: 160Gi
    limits.cpu: "80"
    limits.memory: 320Gi
    requests.nvidia.com/gpu: "4"    # Max 4 GPUs for inference namespace
    limits.nvidia.com/gpu: "4"
    pods: "50"

---
# LimitRange: sets defaults and limits per container
apiVersion: v1
kind: LimitRange
metadata:
  name: ml-container-limits
  namespace: ml-serving
spec:
  limits:
    - type: Container
      default:           # Default limits if not specified
        cpu: "2"
        memory: 4Gi
      defaultRequest:    # Default requests if not specified
        cpu: "500m"
        memory: 1Gi
      max:               # Maximum per container
        cpu: "8"
        memory: 32Gi
        nvidia.com/gpu: "2"
      min:               # Minimum per container
        cpu: "100m"
        memory: 256Mi

---
# PriorityClass: ensures inference is not preempted by training
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: ml-inference-critical
value: 1000000           # High priority for serving
globalDefault: false
description: "Critical inference services - not preemptable"

---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: ml-training-batch
value: 100000            # Low priority for training
preemptionPolicy: PreemptLowerPriority
description: "Training jobs - preemptable when necessary"

Monitoring with Prometheus and Grafana for ML

Monitoring an ML system on Kubernetes requires metrics at two levels: standard Kubernetes infrastructure metrics (CPU, memory, network) and ML inference-specific metrics (latency per model, throughput, error rate, data drift signal). KServe and Seldon automatically expose Prometheus metrics in the standard format.

# Prometheus configuration for scraping KServe metrics
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: kserve-inference-monitor
  namespace: ml-monitoring
  labels:
    release: prometheus
spec:
  namespaceSelector:
    matchNames:
      - ml-serving
  selector:
    matchLabels:
      serving.kserve.io/inferenceservice: "true"
  endpoints:
    - port: metrics
      interval: 15s
      path: /metrics
      honorLabels: true

# KServe automatically exposed metrics:
# kserve_inference_request_total{model_name, namespace, status_code}
# kserve_inference_request_duration_seconds{model_name, quantile}
# kserve_inference_request_size_bytes{model_name}
# kserve_inference_response_size_bytes{model_name}

# PrometheusRule: alert for high latency
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: ml-inference-alerts
  namespace: ml-monitoring
spec:
  groups:
    - name: ml-inference.rules
      rules:
        - alert: HighInferenceLatency
          expr: |
            histogram_quantile(0.99,
              rate(kserve_inference_request_duration_seconds_bucket[5m])
            ) > 1.0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "P99 latency > 1s for {{ $labels.model_name }}"
            description: "Model {{ $labels.model_name }} has P99 latency of {{ $value }}s"

        - alert: ModelErrorRateHigh
          expr: |
            rate(kserve_inference_request_total{status_code!="200"}[5m])
            / rate(kserve_inference_request_total[5m]) > 0.05
          for: 2m
          labels:
            severity: critical
          annotations:
            summary: "Error rate > 5% for {{ $labels.model_name }}"

# Grafana Dashboard: key queries for ML inference monitoring

# 1. Model throughput (req/sec)
rate(kserve_inference_request_total[5m])

# 2. P50, P95, P99 latency
histogram_quantile(0.50, rate(kserve_inference_request_duration_seconds_bucket[5m]))
histogram_quantile(0.95, rate(kserve_inference_request_duration_seconds_bucket[5m]))
histogram_quantile(0.99, rate(kserve_inference_request_duration_seconds_bucket[5m]))

# 3. GPU Utilization per node (requires NVIDIA DCGM Exporter)
DCGM_FI_DEV_GPU_UTIL{namespace="ml-serving"}

# 4. GPU Memory usage
DCGM_FI_DEV_FB_USED{namespace="ml-serving"} /
DCGM_FI_DEV_FB_TOTAL{namespace="ml-serving"} * 100

# 5. Active replicas per model
kube_deployment_status_replicas_available{
  namespace="ml-serving",
  deployment=~".*-predictor.*"
}

# 6. Scaling events (useful for autoscaler debugging)
kube_horizontalpodautoscaler_status_desired_replicas{
  namespace="ml-serving"
}

Cost Optimization for ML Clusters

GPUs are the dominant cost item of an ML cluster: an NVIDIA A100 SXM4 costs approximately $2-3/hour on cloud, an H100 $4-5/hour. On a 20-GPU cluster, monthly costs can exceed $100,000. Cost optimization is not optional.

Cost Optimization Strategies

Spot/Preemptible instances for training: 60-80% savings for interruption-tolerant jobs. Use frequent checkpoints and Argo Workflows for automatic resume.
Scale-to-zero for occasional models: KServe with minReplicas=0 eliminates GPU costs when the model receives no traffic.
GPU Time-Slicing for lightweight inference: 4-8 models per physical GPU reduces cost per model by 4-8x.
Cluster Autoscaler with mixed node pools: GPU nodes added/removed automatically based on actual cluster load.
Node consolidation with Karpenter: consolidates pods on fewer nodes before terminating empty ones (20-40% savings on clusters with variable utilization).

Best Practices and Anti-Patterns

After exploring the technical implementation, here are the lessons learned from ML deployments on Kubernetes in enterprise environments:

Best Practices

Always specify both resource requests AND limits: without requests, the scheduler cannot correctly place pods. Without limits, an OOM model can destabilize the entire node.
Use ML-specific readiness probes: a container can be "running" but the model still loading. The readiness probe must verify the model is actually ready to serve.
Pre-pull images on GPU nodes: PyTorch images with CUDA often exceed 10GB. Use DaemonSets or image pre-caching to avoid high cold starts.
Separate namespaces by environment: staging and production on different namespaces with distinct ResourceQuotas prevents accidental interference.
Implement circuit breakers: if a model has error rate > 10%, automatically stop traffic with Istio or a sidecar proxy.
Explicit model versioning: every InferenceService must have a version tag in the name or labels. Never use "latest" in production.

Anti-Patterns to Avoid

Training on inference nodes: training jobs saturate CPU/GPU and cause latency spikes on production models. Always use separate node pools with taints.
HPA on CPU for GPU-bound models: CPU may be low while GPU is saturated. Always use custom metrics (latency, RPS, GPU utilization) for GPU workloads.
No graceful shutdown: ML pods must complete in-flight requests before terminating. Always configure terminationGracePeriodSeconds >= 30s.
Models baked into Docker images: including model weights in the Docker image makes updates slow and images enormous. Use separate model storage (S3, GCS).
No disruption budget: without PodDisruptionBudget, a cluster update can remove all model replicas simultaneously.

Budget for Small Teams: Starting Under 5,000 EUR/Year

You do not need an enterprise cluster costing $100K/month to get started with ML on Kubernetes. Here is a realistic stack for a team with a limited budget:

K3s cluster on cloud VMs (2 nodes, 8 vCPU, 32GB RAM): approximately 150-200 EUR/month. K3s is Rancher's lightweight Kubernetes distribution, perfect for small clusters.
1 NVIDIA T4 GPU node (spot instance): 0.35-0.50 EUR/hour, approximately 120-180 EUR/month if used 12 hours/day. Scale-to-zero when not needed.
KServe + MLflow + Prometheus: all free, open-source, installable with Helm in 30 minutes.
S3-compatible storage (self-hosted MinIO): zero licensing costs. 100GB of models and datasets: approximately 2-5 EUR/month on cloud object storage.

Estimated total: 300-400 EUR/month, under 5,000 EUR/year for a production-ready environment with GPU, full monitoring, and autoscaling. By comparison, SageMaker with equivalent configuration would cost 3-5x more.

Conclusions

Kubernetes has become the industry standard for ML model deployment in production for good reason: it offers the unique combination of GPU scheduling, event-driven autoscaling, workload isolation, and a specialized tool ecosystem (KServe, Seldon, KEDA) that no other platform can match in terms of flexibility and long-term cost.

The optimal path for those starting out: configure a K3s cluster with one GPU node, install KServe for first model serving, add Prometheus and Grafana for monitoring, and only when the cluster grows beyond 5-10 production models invest in KEDA and Seldon Core for more complex architectures. Kubernetes' complexity pays off only when the volume of workloads justifies it.

In the next article of the series, we explore ML Governance: how to ensure compliance with the EU AI Act, implement explainability with SHAP and LIME, manage audit trails and fairness of models in production.

Cross-Series

Advanced Deep Learning Series - Training complex models to deploy on K8s
Computer Vision Series - CV models optimized for GPU inference