Modern ETL vs ELT: dbt, Airbyte and Fivetran
Data pipelines are the circulatory system of every data-driven architecture. Without a reliable, documented and tested flow from source to warehouse, machine learning models produce inaccurate predictions, business reports show inconsistent numbers, and strategic decisions are built on unstable foundations. Yet in most companies, data pipelines are still a tangle of SQL scripts scheduled with cron jobs, manually updated spreadsheets, and ETL processes built decades ago that nobody dares to touch.
The landscape has changed radically over the past five years. The advent of cloud computing overturned the fundamental paradigm of data movement: it no longer makes sense to transform data before loading it into the warehouse, as traditional ETL did, when the cloud's computational power is available at negligible cost and the modern warehouse can execute complex transformations in seconds. This gave rise to the ELT (Extract, Load, Transform) paradigm, which inverts the order of operations by moving transformation inside the warehouse itself.
The data pipeline market reflects this transformation: the ELT segment is growing at 26% annually, with the overall data pipeline tools market estimated at $12 billion in 2024 and projected to reach $48 billion by 2030. Fivetran, the leading managed ELT player, reached $300M ARR with 50% year-over-year growth. This is not a niche: it is the new standard.
In this article we will explore both paradigms in depth, analyze the tools dominating the market (dbt, Airbyte, Fivetran), build a complete reference architecture, and provide practical recommendations for choosing the right stack based on team size and skills.
What You Will Learn in This Article
- The fundamental differences between traditional ETL and modern ELT, with pros/cons and use cases
- How dbt (Data Build Tool) works: SQL models, testing, documentation, lineage
- dbt Core vs dbt Cloud: which to choose and at what cost
- Airbyte: open-source connectors, architecture and deployment for self-hosting
- Fivetran: the managed SaaS model and the new MAR pricing (2025)
- Detailed comparison of the three tools for different business scenarios
- The modern ELT reference architecture: Airbyte/Fivetran + Warehouse + dbt + BI
- Best practices for testing, documentation and pipeline orchestration
Traditional ETL: How It Works and Why It Is No Longer Enough
For almost thirty years, ETL (Extract, Transform, Load) was the dominant paradigm for enterprise data movement. The process unfolds in three sequential phases:
- Extract: Data is extracted from source systems, typically OLTP databases (ERP, CRM, e-commerce), flat files (CSV, XML) or external APIs, without modification.
- Transform: Extracted data is transformed in an intermediate system, called a staging area or ETL engine, separate from both the source and destination. Transformations include cleaning, normalization, enrichment, aggregation and conforming to warehouse standards.
- Load: The transformed data is loaded into the destination data warehouse in its final form, already optimized for analytical queries.
Traditional ETL Tools
| Tool | Vendor | Indicative Cost | Target |
|---|---|---|---|
| SSIS | Microsoft | Included in SQL Server | SMBs in Microsoft ecosystem |
| Informatica PowerCenter | Informatica | $50,000-$500,000/year | Enterprise banking/insurance |
| Oracle Data Integrator | Oracle | Bundled with Oracle DB | Oracle ecosystem |
| Talend Open Studio | Qlik | Free (core) / $1,170+/month | SMBs, open source |
| Pentaho Data Integration | Hitachi Vantara | Free (CE) / Custom (EE) | SMBs, open source |
These tools work. They have been (and in many cases still are) the backbone of critical systems moving billions in transactions every day. But they have structural limitations that the cloud paradigm has made increasingly evident.
Structural Limits of Classic ETL
Why Traditional ETL Struggles in the Modern Context
- Costly external compute: Transformations run on a separate server (ETL server), which must be sized for peak load. If the nightly batch processes 100 million rows, the server must be powerful enough to handle that time window, even if it sits nearly idle for 22 hours a day.
- Rigidity and cost of change: Adding a field to an SSIS transformation requires modifying the visual flow, testing in staging, and coordinating the release. In structured teams this takes weeks.
- No native versioning: ETL flows are not testable, versionable code like application software. Governance becomes difficult: who changed this transformation? When? Why?
- Complex debugging: When a transformation produces unexpected results, tracing the problem through a visual ETL flow can take hours. There is no standard data lineage showing where each column comes from.
- Wasting warehouse power: You invest tens of thousands per year in Snowflake or BigQuery for their computational power, yet have data processed on a separate server. The ELT paradigm recognizes that the warehouse itself is the most efficient transformation engine.
Modern ELT: Why Cloud Changed Everything
ELT (Extract, Load, Transform) inverts the order of the last two phases: data is first extracted from sources and loaded into the warehouse in raw form, then transformed using the warehouse's own compute power.
This paradigm shift was enabled by three converging factors:
- Cheap cloud storage: Amazon S3, Azure Blob Storage and Google Cloud Storage cost $20-23 per TB per month. There is no longer a reason to be frugal with storage — load everything and decide later what to transform.
- Elastic compute: Snowflake, BigQuery and Databricks auto-scale compute. A complex transformation query executes in seconds leveraging clusters of hundreds of nodes, paying only for actual execution time.
- SQL as the universal language: Data analysts know SQL far better than proprietary ETL tools. With ELT, transformations are plain SQL queries that anyone on the team can read, modify and review.
ETL vs ELT: Decision Table
ETL vs ELT: Full Comparison
| Dimension | Traditional ETL | Modern ELT |
|---|---|---|
| Where transformation happens | Dedicated ETL server (external to warehouse) | Inside the data warehouse (native SQL) |
| When to transform | Before loading | After loading into the raw layer |
| Data volume support | Limited by ETL server capacity | Scales with the warehouse (potentially unlimited) |
| Unstructured data | Difficult or impossible | Supported (native JSON, semi-structured) |
| Language | Proprietary GUI / Java / Python | Standard SQL (+ Jinja in dbt) |
| Version control | Difficult, often absent | Git-native (models are .sql files) |
| Testing | Manual or limited | Built-in test framework (dbt test) |
| Data lineage | Often absent or manual | Automatic (visual DAG in dbt) |
| Security / Compliance | Sensitive data never reaches warehouse unmasked | Raw data in warehouse: masking and governance required |
| Transformation latency | Depends on ETL server | Depends on warehouse (batch or on-demand) |
| Learning curve | High (proprietary GUIs) | Low for SQL-proficient teams |
| Ideal for | Sensitive data, legacy systems, strict compliance | Cloud-first, SQL teams, high data volumes |
When to Choose ETL vs ELT
- Choose ETL when: compliance requirements prohibit loading raw data to the cloud (e.g. PII without anonymization), you work with on-premise legacy systems where the warehouse is too distant from the source, or you have computationally intensive transformations that do not benefit from warehouse SQL.
- Choose ELT when: your warehouse is cloud-based (Snowflake, BigQuery, Databricks, Redshift), the team has solid SQL skills, you want to version transformations with Git, and you work with growing data volumes and want to leverage cloud elasticity.
- Hybrid approach: Many companies adopt a mixed approach: ETL for anonymizing sensitive data before loading, ELT for all subsequent analytical transformations.
dbt: The Transformation Layer of the Modern Stack
dbt (Data Build Tool) is the tool that defined the modern ELT paradigm. Created by dbt Labs in 2016, dbt transforms how data analysts write transformations: instead of scattered SQL procedures without structure, dbt introduces a software-engineering-inspired development framework with versioned models, automated tests and auto-generated documentation.
The core concept is simple: every dbt model is a .sql file containing a SELECT statement. dbt handles creating the table or view in the warehouse, managing dependencies between models and building a DAG (Directed Acyclic Graph) of transformations.
dbt Architecture: How It Works
dbt has no runtime of its own: it uses the destination warehouse's compute. It acts as a SQL compiler + orchestrator: it takes .sql files with Jinja macros, compiles them into pure SQL, and executes them on the warehouse in the correct order based on declared dependencies. The result is a set of tables and views in the warehouse, built in a reproducible way.
-- models/staging/stg_orders.sql
-- Staging model: cleaning and standardizing orders
-- dbt creates a view (or table) called 'stg_orders' in the warehouse
WITH source AS (
-- Reference to raw source table (loaded by Airbyte/Fivetran)
SELECT * FROM {{ source('erp', 'raw_orders') }}
),
cleaned AS (
SELECT
order_id::BIGINT AS order_id,
customer_id::INT AS customer_id,
product_id::INT AS product_id,
quantity::INT AS quantity,
unit_price::DECIMAL(10, 2) AS unit_price,
COALESCE(discount, 0.0)::DECIMAL(5, 2) AS discount,
CAST(order_date AS TIMESTAMP) AS order_date,
LOWER(TRIM(status)) AS status,
-- Calculated net amount
quantity * unit_price * (1 - COALESCE(discount, 0))
AS net_amount,
-- Audit metadata
_loaded_at AS ingested_at
FROM source
WHERE order_id IS NOT NULL
AND customer_id IS NOT NULL
AND quantity > 0
AND unit_price > 0
)
SELECT * FROM cleaned
-- models/marts/finance/fct_orders_monthly.sql
-- Fact model: monthly aggregations for the finance team
-- Depends on stg_orders and dim_customers (dependencies resolved automatically)
{{ config(
materialized='table',
schema='finance',
tags=['finance', 'monthly']
) }}
WITH orders AS (
SELECT * FROM {{ ref('stg_orders') }}
),
customers AS (
SELECT * FROM {{ ref('dim_customers') }}
),
monthly_metrics AS (
SELECT
DATE_TRUNC('month', o.order_date) AS month,
c.region,
c.segment,
COUNT(DISTINCT o.order_id) AS total_orders,
COUNT(DISTINCT o.customer_id) AS unique_customers,
SUM(o.net_amount) AS gross_revenue,
AVG(o.net_amount) AS avg_order_value,
SUM(o.quantity) AS units_sold
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.customer_id
GROUP BY 1, 2, 3
)
SELECT
*,
gross_revenue / NULLIF(unique_customers, 0) AS revenue_per_customer
FROM monthly_metrics
ORDER BY month DESC, gross_revenue DESC
Testing in dbt: Guaranteeing Data Quality
One of dbt's most important strengths is its native testing system. dbt tests are declarative:
defined in a YAML file and executed automatically after each run with the dbt test
command.
# models/staging/schema.yml
# Test definitions for the stg_orders model
version: 2
models:
- name: stg_orders
description: >
Cleaned and standardized orders from the source ERP system.
Loaded hourly by Airbyte, transformed by this model.
columns:
- name: order_id
description: "Unique order identifier (PK)"
tests:
- unique # No duplicates allowed
- not_null # Every row must have an order_id
- name: customer_id
description: "Reference to the customers dimension"
tests:
- not_null
- relationships: # Referential integrity check
to: ref('dim_customers')
field: customer_id
- name: status
description: "Order status"
tests:
- accepted_values:
values: ['pending', 'confirmed', 'shipped', 'delivered', 'cancelled']
- name: net_amount
description: "Net order amount (quantity * price * (1 - discount))"
tests:
- not_null
- dbt_utils.expression_is_true:
expression: "net_amount >= 0"
- name: order_date
description: "Order timestamp"
tests:
- not_null
- dbt_utils.expression_is_true:
expression: "order_date <= CURRENT_TIMESTAMP"
sources:
- name: erp
description: "Raw data loaded from the ERP via Airbyte"
schema: raw_erp
tables:
- name: raw_orders
loaded_at_field: _loaded_at
freshness:
warn_after: {count: 12, period: hour}
error_after: {count: 24, period: hour}
dbt Core vs dbt Cloud
dbt comes in two variants: dbt Core (open-source, free) and dbt Cloud (managed SaaS, paid). The choice depends on the team's infrastructure needs and requirement for advanced features.
dbt Core vs dbt Cloud: Full Comparison
| Feature | dbt Core (Open Source) | dbt Cloud Team | dbt Cloud Enterprise |
|---|---|---|---|
| Cost | Free | $100/seat/month + $0.01 per model above 15,000 | Custom (contact sales) |
| Job execution | Manual / External orchestrator (Airflow, Dagster) | Native scheduler, integrated CI/CD | Advanced scheduler + SLA monitoring |
| Web IDE | No (local editor only) | dbt Cloud IDE (browser-based) | dbt Cloud IDE |
| Documentation | Generated locally (dbt docs generate) | Automatically hosted | Hosted + Advanced Data Catalog |
| Visual lineage | Local only | Cloud hosted, shareable | Cloud hosted + Cross-project lineage |
| Team collaboration | Via Git (manual) | Multi-user, PR-based workflow | RBAC, SSO, audit logs |
| Semantic Layer | No | Included (5,000 metrics/month) | Included (20,000 metrics/month) |
| SOC 2 compliance | N/A (self-managed) | Included | Included + PrivateLink |
| Ideal for | Technical teams, limited budget, already have Airflow | 2-20 person teams without ETL infrastructure | Enterprise, multi-team, strict compliance |
The practical recommendation: SMBs with a team of 1-3 data engineers should start with dbt Core + Airflow or Dagster, offering full flexibility at zero cost. If the team grows or you want to avoid managing orchestration infrastructure, dbt Cloud becomes cost-effective with just 2-3 developer seats.
Airbyte: The Open-Source Ingestion Layer
If dbt handles the T (transform) in ELT, Airbyte manages the E (extract) and L (load). Founded in 2020 with over $180M in funding, Airbyte is the most widely adopted open-source data integration platform on the market, with 600+ pre-built connectors and a Python SDK for building custom connectors.
Airbyte's main strength compared to competitors is the combination of open-source (zero vendor lock-in, modifiable code) with a cloud-native architecture that supports Change Data Capture (CDC) for real-time database replication.
Airbyte Architecture
Airbyte consists of several components working in tandem. Every sync is executed by a Worker that spins up two separate Docker containers: the Source Connector (reading from the source) and the Destination Connector (writing to the destination). Data flows between them via the Airbyte Protocol, a standard JSON format.
# Example: Airbyte connector configuration via API (YAML/JSON)
# Creating a PostgreSQL -> Snowflake connection
# 1. Source configuration (PostgreSQL)
POST /api/v1/sources/create
{
"sourceDefinitionId": "decd338e-5647-4c0b-adf4-da0e75f5a750",
"connectionConfiguration": {
"host": "db.production.mycompany.com",
"port": 5432,
"database": "erp_production",
"username": "airbyte_reader",
"password": "{{ secrets.POSTGRES_PASSWORD }}",
"schemas": ["public", "orders"],
"replication_method": {
"method": "CDC",
"replication_slot": "airbyte_slot",
"publication": "airbyte_publication"
}
},
"name": "ERP Production PostgreSQL",
"workspaceId": "workspace-uuid"
}
# 2. Destination configuration (Snowflake)
POST /api/v1/destinations/create
{
"destinationDefinitionId": "424892c4-daac-4491-b35d-c6688ba547ba",
"connectionConfiguration": {
"host": "abc12345.eu-west-1.snowflakecomputing.com",
"role": "AIRBYTE_ROLE",
"warehouse": "AIRBYTE_WH",
"database": "RAW_DATA",
"schema": "erp",
"username": "airbyte_user",
"credentials": {
"auth_type": "key_pair",
"private_key": "{{ secrets.SNOWFLAKE_PRIVATE_KEY }}"
}
},
"name": "Snowflake Production",
"workspaceId": "workspace-uuid"
}
# 3. Create the connection
POST /api/v1/connections/create
{
"sourceId": "source-uuid",
"destinationId": "destination-uuid",
"syncCatalog": {
"streams": [
{
"stream": {"name": "orders", "namespace": "public"},
"config": {
"syncMode": "incremental",
"destinationSyncMode": "append_dedup",
"cursorField": ["updated_at"],
"primaryKey": [["order_id"]]
}
},
{
"stream": {"name": "customers", "namespace": "public"},
"config": {
"syncMode": "full_refresh",
"destinationSyncMode": "overwrite"
}
}
]
},
"scheduleType": "cron",
"scheduleData": {"cron": {"cronExpression": "0 */1 * * *", "cronTimeZone": "Europe/Rome"}},
"namespaceDefinition": "customformat",
"namespaceFormat": "raw_{{SOURCE_NAMESPACE}}"
}
Airbyte: Deployment Options
Airbyte: Open Source vs Cloud vs Enterprise
| Aspect | Open Source (Self-hosted) | Airbyte Cloud | Airbyte Enterprise |
|---|---|---|---|
| Cost | Free (infrastructure at your expense) | Pay-per-use ($2.50-$10 per credit) | Custom (contact sales) |
| Maintenance | Your team's responsibility | Managed by Airbyte | Managed by Airbyte |
| Connectors | 600+ | 600+ | 600+ + certified premium connectors |
| CDC | Supported | Supported | Supported + advanced CDC |
| RBAC | Basic | Included | Advanced (SSO, audit logs) |
| SLA | N/A | 99.9% uptime | 99.99% uptime |
| Ideal for | Technical teams, limited budget, sensitive data | SMBs wanting speed without ops | Enterprise with strict compliance |
# Deploy Airbyte Open Source with Docker Compose
# Prerequisites: Docker, Docker Compose, 8GB RAM, 20GB disk
# 1. Clone the repository
git clone --depth=1 https://github.com/airbytehq/airbyte.git
cd airbyte
# 2. Start Airbyte (first run: ~15 minutes for image downloads)
./run-ab-platform.sh
# Airbyte is accessible at http://localhost:8000
# Username: airbyte / Password: password (change in production!)
# 3. For Kubernetes production deployment with Helm
helm repo add airbyte https://airbytehq.github.io/helm-charts
helm install airbyte airbyte/airbyte \
--namespace airbyte \
--create-namespace \
--set global.state.storage.type=MINIO \
--set global.storage.bucket.log=airbyte-logs \
--values custom-values.yaml
# 4. Environment variables for production (.env)
# DATABASE_URL=postgresql://airbyte:password@postgres:5432/airbyte
# SECRET_PERSISTENCE=GOOGLE_SECRET_MANAGER
# LOG_LEVEL=INFO
# TRACKING_STRATEGY=segment
Fivetran: The Simplicity of Managed SaaS
While Airbyte focuses on open-source flexibility and control, Fivetran chose the opposite path: maximum simplicity, zero maintenance, enterprise-grade connectors fully managed by the vendor. Founded in 2012, Fivetran is today the leader in managed ELT with $300M ARR, 6,300+ customers and a $5.6 billion valuation.
Fivetran's value proposition is clear: a Fivetran connector to Salesforce, Shopify or any other SaaS system is maintained by a dedicated team of Fivetran engineers who handle every source API change, every breaking change, every schema update. The customer does nothing.
The New Fivetran 2025 Pricing Model
In March 2025, Fivetran significantly updated its pricing model. The Starter and Private Deployment plans were eliminated and replaced with four tiers: Free, Standard, Enterprise and Business Critical. The billing metric remains MAR (Monthly Active Rows), but the calculation changed: each connection is now billed separately, no longer aggregated per account.
Fivetran: Plans and Pricing 2025
| Plan | Included MAR | Additional cost | Key features |
|---|---|---|---|
| Free | 500,000 MAR/month | N/A | All Standard features, up to 5,000 model runs |
| Standard | Unlimited (pay-per-use) | $5 base/connection + MAR pricing | 600+ connectors, CDC, dbt integration, scheduling |
| Enterprise | Negotiable | Custom (volume discount) | SSO/SAML, RBAC, VPN, priority support, SLA |
| Business Critical | Negotiable | Custom | PrivateLink, HIPAA compliance, dedicated support, 99.99% SLA |
How MAR Calculation Works in Fivetran
MAR (Monthly Active Row) counts distinct rows synced in a calendar month, tracked via primary key. A row modified 30 times in a month counts as 1 MAR, not 30. The advantage: cost does not explode with sync frequency, but with the number of unique records that actually change.
Practical example: A company with 50,000 active orders per month and 500,000 products in the catalog (rarely updated) primarily pays for the 50,000 orders that change status each month, not for the entire product catalog.
dbt vs Airbyte vs Fivetran: Which to Choose?
It is important to understand that dbt, Airbyte and Fivetran are not alternative tools that exclude each other: they solve different problems within the same stack. dbt handles transformations, Airbyte and Fivetran handle ingestion. The right question is: Airbyte vs Fivetran for ingestion?
Airbyte vs Fivetran: Scenario Comparison
| Dimension | Airbyte Open Source | Airbyte Cloud | Fivetran Standard |
|---|---|---|---|
| Base cost | $0 (infrastructure extra) | Pay-per-use | $5/connection/month + MAR |
| Connectors | 600+ (community maintained) | 600+ | 650+ (enterprise-grade) |
| Connector maintenance | Community / internal team | Airbyte team | Fivetran team (guaranteed SLA) |
| CDC | Supported (Debezium) | Supported | Supported (log-based) |
| Custom connectors | Python SDK (free) | Python SDK | Custom connector SDK (paid) |
| Data residency | Full (self-hosted) | Region-specific | Region-specific |
| Setup time | 1-4 hours (infrastructure) | 30 minutes | 15 minutes |
| dbt integration | Manual / Airflow | Native | Native (dbt Cloud) |
| Ideal for | Technical teams, custom sources, local GDPR | Tech-savvy SMBs, variable budget | SMBs/Enterprise with standard SaaS sources |
Practical Decision Rule
- Use Fivetran if your sources are standard SaaS (Salesforce, HubSpot, Shopify, Stripe, Google Analytics, Facebook Ads) and the team does not want to deal with connector maintenance. The productivity gained is worth the cost.
- Use Airbyte Cloud if you have a mix of standard and custom sources, or if you are starting out and want to control costs with pay-per-use.
- Use Airbyte Open Source if you have GDPR/data residency requirements that prevent data from transiting through third-party infrastructure, or if you have highly custom sources requiring internally written connectors.
Modern ELT Reference Architecture
Combining all components, here is the modern ELT architecture adopted by leading companies today. Each layer of the stack has a precise purpose and reference tools.
The Modern ELT Stack: Layer by Layer
| Layer | Function | Primary Tools |
|---|---|---|
| 1. Ingestion | Extract and load raw data into the warehouse | Fivetran, Airbyte, Stitch, custom scripts |
| 2. Storage (Raw) | Unmodified raw data (Bronze layer) | Snowflake, BigQuery, Databricks, Redshift |
| 3. Transformation | SQL models, testing, documentation, lineage | dbt Core, dbt Cloud |
| 4. Serving (Gold) | Tables optimized for analytics and ML | Snowflake, BigQuery, Databricks (warehouse) |
| 5. Orchestration | Scheduling, dependencies, pipeline monitoring | Airflow, Dagster, Prefect, dbt Cloud |
| 6. BI and Reporting | Dashboards, ad-hoc queries, self-service analytics | Looker, Metabase, Power BI, Tableau |
| 7. Data Quality | Monitoring, alerting, anomaly detection | dbt tests, Great Expectations, Monte Carlo |
# Example: Complete ELT pipeline orchestrated with Dagster
from dagster import asset, AssetIn, define_asset_job, ScheduleDefinition
# Asset 1: Raw data (loaded by Fivetran/Airbyte - external to Dagster)
# Dagster can "observe" tables loaded by Fivetran as external assets
@asset(
group_name="raw",
description="Raw orders loaded by Fivetran (ERP)"
)
def raw_orders():
# Fivetran writes here automatically every hour
# This asset "declares" the table for visual lineage
pass
# Asset 2: dbt staging (transformation with dbt)
@asset(
group_name="staging",
deps=["raw_orders"],
description="Cleaned and standardized orders (dbt model stg_orders)"
)
def stg_orders(context):
# Runs the dbt stg_orders model
context.log.info("Running dbt model: stg_orders")
return {"rows_processed": 15000}
# Asset 3: dbt marts (gold model ready for business)
@asset(
group_name="marts",
ins={"staging": AssetIn("stg_orders")},
description="Monthly revenue for finance reporting"
)
def fct_orders_monthly(context, staging):
context.log.info(f"Building monthly metrics from {staging['rows_processed']} rows")
return {"mart_rows": 360}
# Scheduling: runs hourly, updates all assets
elt_pipeline_job = define_asset_job(
name="elt_pipeline",
selection=["stg_orders", "fct_orders_monthly"]
)
elt_schedule = ScheduleDefinition(
job=elt_pipeline_job,
cron_schedule="0 * * * *", # Every hour
execution_timezone="Europe/Rome"
)
Best Practices for Reliable Data Pipelines
Building an ELT pipeline that works in a demo is straightforward. Building one that scales, is monitorable and maintainable by a team over time requires discipline and the application of established engineering practices.
1. Layering and Naming Conventions
Adopt a clear, consistent dbt directory structure that mirrors the layered architecture (Medallion or equivalent):
# Recommended dbt project structure
dbt_project/
├── models/
│ ├── staging/ # Silver layer: source cleaning
│ │ ├── erp/
│ │ │ ├── stg_orders.sql
│ │ │ ├── stg_customers.sql
│ │ │ └── schema.yml # Tests + documentation
│ │ ├── crm/
│ │ │ ├── stg_contacts.sql
│ │ │ └── schema.yml
│ │ └── ecommerce/
│ │ ├── stg_sessions.sql
│ │ └── schema.yml
│ ├── intermediate/ # Intermediate models (complex joins)
│ │ ├── int_customer_orders.sql
│ │ └── schema.yml
│ └── marts/ # Gold layer: business-ready
│ ├── finance/
│ │ ├── fct_orders_monthly.sql
│ │ ├── fct_revenue_by_product.sql
│ │ └── schema.yml
│ ├── marketing/
│ │ ├── fct_campaign_performance.sql
│ │ └── schema.yml
│ └── operations/
│ ├── fct_fulfillment_kpis.sql
│ └── schema.yml
├── seeds/ # Static data (lookup tables, mappings)
│ ├── country_codes.csv
│ └── product_categories.csv
├── snapshots/ # SCD Type 2 (Slowly Changing Dimensions)
│ └── snap_customers.sql
├── tests/ # Custom SQL tests
│ └── assert_revenue_positive.sql
├── macros/ # Reusable Jinja macros
│ ├── generate_schema_name.sql
│ └── audit_columns.sql
└── dbt_project.yml # Global configuration
2. Testing as a Data Quality Contract
dbt tests are not optional: they are the contract that ensures transformations produce reliable data. Every model should have at minimum:
- unique + not_null on the primary key: Guarantees the integrity of every model.
- relationships: Verifies referential integrity between models.
- accepted_values: Ensures categorical fields contain only valid values.
- freshness on source: Alerts when data is not updated within expected windows.
- dbt_utils.expression_is_true: For business-specific constraints (e.g. revenue >= 0).
3. Versioning and CI/CD for Data Pipelines
# .github/workflows/dbt-ci.yml
# CI/CD to validate dbt models on every PR
name: dbt CI Pipeline
on:
pull_request:
branches: [main]
paths:
- 'dbt_project/**'
jobs:
dbt-lint-and-test:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install dbt
run: |
pip install dbt-snowflake==1.8.0
pip install sqlfluff==3.0.0
- name: dbt debug (connection test)
run: dbt debug
env:
DBT_SNOWFLAKE_ACCOUNT: {{ secrets.SNOWFLAKE_ACCOUNT }}
DBT_SNOWFLAKE_USER: {{ secrets.SNOWFLAKE_USER }}
DBT_SNOWFLAKE_PASSWORD: {{ secrets.SNOWFLAKE_PASSWORD }}
DBT_SNOWFLAKE_DATABASE: DBT_CI_DB
DBT_SNOWFLAKE_SCHEMA: CI_{{ github.event.pull_request.number }}
- name: sqlfluff lint (SQL style check)
run: sqlfluff lint models/ --dialect snowflake
- name: dbt compile (syntax check)
run: dbt compile
- name: dbt run (modified models only)
run: dbt run --select state:modified+
env:
DBT_DEFER_TO_PROD: true
- name: dbt test (tests on modified models)
run: dbt test --select state:modified+
- name: dbt docs generate
run: dbt docs generate
- name: Cleanup CI schema
if: always()
run: dbt run-operation drop_schema --args '{"schema": "CI_{{ github.event.pull_request.number }}"}'
Anti-Patterns to Avoid in ELT Pipelines
- Models without tests: A dbt model without at least a unique + not_null test is a time bomb. Eventually it will produce duplicated or null data that invalidates reports.
- Missing schema.yml: Without documentation, models become incomprehensible to anyone who did not write them, including the author six months later.
- Always full refresh: Reloading the entire table instead of doing incremental append/upsert is expensive and fragile for large datasets.
- Business logic in the staging layer: Staging should only clean and standardize. Aggregations and business logic belong in marts. Mixing layers creates unmanageable dependencies.
- Ignoring data lineage: When a report shows a wrong number, without documented lineage tracing the problem takes hours. With dbt docs, you reach the source in a few clicks.
- No freshness alerting: If Fivetran stops syncing overnight and nobody monitors it, morning reports show yesterday's data with no visible warning to business users.
Recommendations for Growing Teams
The data pipeline landscape can seem overwhelming for organizations with limited resources and small teams. The good news is you do not need to adopt the entire stack at once. Here is a progressive three-phase journey:
Adoption Roadmap: 3 Phases
| Phase | Timeline | Recommended Stack | Indicative Monthly Cost |
|---|---|---|---|
| Phase 1: Foundations | Month 1-3 | Fivetran Free + BigQuery sandbox + dbt Core | $0 - $200 |
| Phase 2: Production | Month 4-9 | Fivetran Standard + Snowflake/BigQuery + dbt Core + Managed Airflow | $800 - $3,000 |
| Phase 3: Scale | Month 10+ | Fivetran Enterprise + Snowflake + dbt Cloud + Dagster Cloud | $3,000 - $15,000 |
Phase 1 - Foundations: Start with Fivetran's free plan (500,000 MAR/month) to connect 2-3 critical sources (ERP, CRM, e-commerce). Use BigQuery's sandbox tier (free up to 10GB storage + 1TB queries/month). Install dbt Core locally and write your first 10-15 models. The goal is learning patterns and demonstrating business value with no upfront investment.
Phase 2 - Production: Once the PoC has demonstrated value, scale to Fivetran Standard for more connectors, move to a production cloud warehouse with an SLA (Snowflake or BigQuery production), and add managed orchestration with Airflow (Cloud Composer on GCP or MWAA on AWS). Monthly cost remains contained and business impact is measurable.
Phase 3 - Scale: When the data team grows to 3-5 people, adding dbt Cloud for the collaborative IDE and automatic CI/CD becomes worthwhile, along with Dagster Cloud for more sophisticated orchestration with observability. At this point the pipeline becomes a strategic business asset managed with professional engineering standards.
Conclusions and Next Steps
The evolution from ETL to ELT is not simply a change in letter order: it is a fundamental transformation in how data pipelines are designed, developed and maintained. The cloud made compute elastic and affordable, rendering the pre-load transformation model on dedicated servers obsolete.
The modern stack — dbt + Airbyte/Fivetran + cloud warehouse — represents the 2025 state of the art and is adopted by thousands of companies from startups to Fortune 500. The concrete benefits are measurable: pipelines versionable as code, automated tests guaranteeing data quality, documentation and lineage generated automatically, and costs that scale with the business rather than requiring upfront investment.
Key Takeaways
- ETL is not dead: it still makes sense for sensitive data that cannot transit unmasked in the cloud, or for on-premise legacy sources with low latency requirements.
- ELT is the new standard for cloud-first architectures: load everything raw, transform where you have the most power (the warehouse).
- dbt is the transformation layer: it brings software engineering best practices (versioning, testing, documentation) to the SQL world.
- Airbyte is the open-source choice for technical teams with custom sources or GDPR data residency requirements.
- Fivetran is the managed choice for teams wanting zero connector maintenance for standard SaaS sources (Salesforce, HubSpot, Shopify, etc.).
- For growing teams: start with Fivetran Free + BigQuery + dbt Core to validate value before investing in the full infrastructure.
The next article in the series dives into pipeline orchestration: Airflow, Dagster and Prefect compared. If dbt is the transformation engine and Airbyte/Fivetran are the ingestion system, the orchestrator is the brain coordinating everything: who runs what, when, in which order, and what to do when something goes wrong. A critical component that deserves dedicated coverage.
Hands-On Exercise: Set Up dbt Core in 30 Minutes
Before moving on to the next article, try configuring dbt Core on an existing warehouse (even the free BigQuery sandbox):
# 1. Install dbt with the profile for your warehouse
pip install dbt-bigquery # for BigQuery
# or: pip install dbt-snowflake, dbt-redshift, dbt-duckdb
# 2. Create a new dbt project
dbt init my_dbt_project
# 3. Configure the profile (~/.dbt/profiles.yml for BigQuery)
# my_dbt_project:
# target: dev
# outputs:
# dev:
# type: bigquery
# method: oauth
# project: my-gcp-project
# dataset: dbt_dev
# threads: 4
# timeout_seconds: 300
# 4. Test the connection
cd my_dbt_project
dbt debug
# 5. Create your first model
cat > models/staging/stg_example.sql << 'EOF'
SELECT
id,
name,
created_at,
UPPER(TRIM(email)) AS email_normalized
FROM {{ source('raw', 'users') }}
WHERE id IS NOT NULL
EOF
# 6. Run and test
dbt run
dbt test
dbt docs generate
dbt docs serve # Opens visual lineage at http://localhost:8080







