Your Analytics Stack Doesn’t Need a SaaS Bill
You’ve got data sitting in Postgres. Or maybe DuckDB. Or ClickHouse if you’ve been reading too many HN threads. Either way, you want to transform that raw garbage into something a dashboard can actually use — star schema, aggregated fact tables, the whole shebang — without writing raw SQL scripts that you’ll never remember the order of in six months.
Enter dbt-core. The open-source part of dbt (data build tool). The part that doesn’t cost $100/seat/month.
Here’s the thing: dbt-cloud is a great product if you’re a data team at a company with a budget. But for a homelab, a small team, or anyone self-hosting their analytics stack, dbt-core + a local warehouse is genuinely all you need. SQL transformations with dependency graphs, built-in testing, auto-generated docs, and scheduling via whatever you already run (cron, Dagster, a systemd timer, whatever). No vendor lock-in. No subscription.
This article walks through the full setup: dbt-core with Postgres, a real example ingesting Plausible/Umami analytics data into a star schema, wiring it up to Metabase or Lightdash, and a quick look at SQLMesh if you want to live dangerously.
If you’re running ClickHouse for metrics, check out the related post on self-hosted observability with ClickHouse — it pairs well with what we’re building here.
What dbt Actually Does (And What It Doesn’t)
dbt doesn’t move data. It doesn’t extract from APIs, doesn’t load CSVs into your warehouse. That’s your ETL tool’s job (Airbyte, custom scripts, whatever). dbt’s job is the T in ELT — transforming data that’s already in your database into clean, documented, tested models.
A “model” in dbt is just a SQL SELECT statement saved as a .sql file. dbt wraps it in a CREATE TABLE AS or CREATE VIEW AS and runs it in the right order based on dependencies.
The key primitives:
models/— Your SQL SELECT files. Each file = one table or view.ref()— How you declare dependencies between models.ref('orders')tells dbt “run that model first.”sources— Raw tables from your ingestion layer, declared in YAML.tests— Generic (not_null, unique, accepted_values) or custom SQL tests that run against your models.snapshots— Slowly changing dimension tracking. Captures row history over time.
The compile vs. runtime distinction matters: dbt compiles your Jinja+SQL templates into plain SQL first, then runs the compiled SQL against your warehouse. You can inspect what it generated in target/compiled/ before anything touches the database.
Install and Project Setup
pip install dbt-core dbt-postgres# or for DuckDB:pip install dbt-core dbt-duckdb# or ClickHouse:pip install dbt-core dbt-clickhouseInitialize a project:
dbt init analyticscd analyticsThis creates the scaffold. The only files you care about immediately:
analytics/ dbt_project.yml # project config profiles.yml # connection config (usually in ~/.dbt/) models/ # your SQL models live here tests/ # custom test SQL snapshots/ # SCD snapshots macros/ # reusable Jinja macrosConfigure your Postgres connection in ~/.dbt/profiles.yml:
analytics: target: dev outputs: dev: type: postgres host: localhost user: dbt_user password: "{{ env_var('DBT_PASSWORD') }}" port: 5432 dbname: analytics schema: dbt_dev threads: 4Pull the password from an env var — don’t hardcode it. Your future self will thank you at 2 AM when you’re rotating credentials.
Test the connection:
dbt debugYou want to see all green. If you don’t, fix it now before you’ve got 30 models depending on a broken profile.
Real Example: Plausible/Umami Analytics Into a Star Schema
Let’s say you’re self-hosting Plausible or Umami for website analytics, and you want to expose session/event data to Metabase without giving your BI tool direct access to the raw OLTP tables.
First, declare your source tables — the raw tables that dbt didn’t create:
version: 2
sources: - name: plausible_raw database: analytics schema: public tables: - name: events description: "Raw pageview and custom events from Plausible" columns: - name: id tests: - unique - not_null - name: session_id tests: - not_null - name: timestamp tests: - not_null - name: sessions description: "Session-level data"Now build a staging model that cleans up the raw events:
with source as ( select * from {{ source('plausible_raw', 'events') }}),
cleaned as ( select id as event_id, session_id, timestamp::timestamptz as event_at, coalesce(pathname, '/') as page_path, lower(trim(referrer)) as referrer, lower(trim(browser)) as browser, lower(trim(operating_system)) as os, country_code, props::jsonb as event_props from source where timestamp >= '2024-01-01' -- trim old junk during dev)
select * from cleanedThen a fact table that uses ref() to depend on the staging model:
with events as ( select * from {{ ref('stg_events') }}),
sessions as ( select * from {{ ref('stg_sessions') }})
select e.event_id, e.event_at, date_trunc('day', e.event_at)::date as event_date, e.page_path, e.referrer, e.browser, e.os, e.country_code, s.duration_seconds, s.bouncefrom events eleft join sessions s on e.session_id = s.session_idwhere e.page_path not like '/api/%'And a daily aggregation dimension:
select event_date, page_path, country_code, count(distinct event_id) as pageviews, count(distinct session_id) as sessions, avg(duration_seconds) as avg_session_duration, sum(bounce::int)::float / nullif(count(distinct session_id), 0) as bounce_ratefrom {{ ref('fct_pageviews') }}group by 1, 2, 3Run the whole thing:
dbt rundbt figures out the dependency order automatically from ref() calls. stg_events and stg_sessions run first, then fct_pageviews, then dim_daily_traffic. You don’t have to manage that. That’s the whole point.
Tests: The Part Most People Skip (Don’t)
This is where dbt earns its keep. Run dbt test and it validates your models against assertions you defined.
Add schema tests in YAML:
version: 2
models: - name: fct_pageviews description: "One row per pageview event" columns: - name: event_id tests: - unique - not_null - name: event_date tests: - not_null - name: country_code tests: - accepted_values: values: ['US', 'GB', 'DE', 'FR', 'CA', 'AU', 'NL', 'SE'] quote: true config: severity: warn # warn, don't fail — you'll get unexpected countriesFor more complex assertions, write a custom test — just a SQL file that returns rows when the test fails:
-- Fails if any events are timestamped in the future (clock skew or bad data)select event_idfrom {{ ref('fct_pageviews') }}where event_at > now() + interval '1 hour'Run tests:
dbt test# or test a specific model:dbt test --select fct_pageviewsIf something breaks during a load, you’ll know before your dashboard shows garbage. That’s the tradeoff you’re making when you skip the SaaS layer — you own the data quality, so you’d better test it.
Incremental Models: Don’t Reprocess Everything Every Run
By default dbt rebuilds the whole table on every dbt run. That’s fine for small datasets. For anything over a few million rows, you want incremental models:
{{ config( materialized='incremental', unique_key='event_id', on_schema_change='sync_all_columns' )}}
with events as ( select * from {{ ref('stg_events') }}
{% if is_incremental() %} -- Only process events newer than the last run where event_at > (select max(event_at) from {{ this }}) {% endif %})
-- ... rest of your queryOn the first run, is_incremental() is false and you build the full table. On subsequent runs, dbt only processes new rows and merges them in. unique_key tells it how to handle duplicates (upsert behavior).
Docs: Free BI-Adjacent Documentation
dbt docs generatedbt docs serveThat’s it. dbt generates a full documentation site at http://localhost:8080 with your model lineage DAG, column descriptions, test results, and source definitions. Point a teammate at it when they ask “where does this number come from.” It’s not fancy but it works.
For a self-hosted setup, you can dump the generated target/ directory to a static file server or just run dbt docs serve in a screen/tmux session when you need it.
Scheduling: Pick Your Poison
dbt doesn’t have a built-in scheduler. Options by complexity:
Cron / systemd timer — Simplest. A systemd timer unit that runs dbt run && dbt test on a schedule. Zero dependencies.
[Unit]Description=dbt analytics run
[Service]Type=oneshotUser=analyticsWorkingDirectory=/opt/analyticsExecStart=/opt/analytics/.venv/bin/dbt run --profiles-dir /opt/analyticsExecStartPost=/opt/analytics/.venv/bin/dbt test --profiles-dir /opt/analytics[Unit]Description=Run dbt analytics every hour
[Timer]OnCalendar=hourlyPersistent=true
[Install]WantedBy=timers.targetDagster — If you want a proper orchestrator with a UI, retries, and asset lineage. Has a first-class dbt integration (dagster-dbt). Self-hosts fine with Docker Compose.
Airflow — Battle-tested, heavy. Overkill for a homelab. Use Dagster instead unless you already have Airflow running.
dbt-core vs SQLMesh: The New Kid
SQLMesh is worth knowing about. It’s newer, dbt-compatible (can import dbt projects), and has a built-in semantic layer — you define metrics once and they’re reusable across queries. It also has smarter incremental execution: it tracks which models changed and only reruns what’s actually affected, not just what you tell it to.
Honest take: SQLMesh is impressive and the semantic layer is genuinely better designed than dbt’s approach. But the ecosystem is smaller, the docs are thinner, and you’ll hit rough edges if you’re off the happy path. For a homelab analytics project, dbt-core has a bigger community, more adapters, and more Stack Overflow answers when things break. SQLMesh is the one to watch as it matures.
If you’re starting fresh and willing to ride the cutting edge, try SQLMesh. If you want something that Just Works today and has years of production hardening behind it, dbt-core wins.
Wiring to Metabase or Lightdash
Once fct_pageviews and dim_daily_traffic exist in your dbt_dev schema (or dbt_prod once you promote), point your BI tool at them.
Metabase: Connect to the analytics database, select the dbt_prod schema, sync. Your dbt model descriptions even show up as table descriptions in Metabase if you’ve written them in your schema YAML.
Lightdash: Reads dbt project metadata directly. Point it at your dbt_project.yml and it builds the semantic layer from your model YAML. Worth a look if you want a more data-analyst-friendly interface and you’re already maintaining dbt schema files.
Neither tool needs to know about dbt internals — they just see clean, tested tables in Postgres. That’s the whole value prop: your BI tool gets a stable, documented interface instead of raw OLTP tables that change whenever a dev refactors the schema.
The Bottom Line
dbt-core + Postgres (or DuckDB for local analytics, ClickHouse for high-volume metrics) is a genuinely complete analytics-engineering stack. You get dependency-aware transformations, data quality tests, auto-generated documentation, and incremental processing — everything the cloud version sells you — running entirely on infrastructure you control.
The tradeoff is you own the scheduling and you write your own CI. That’s maybe two hours of setup with systemd timers and a basic test run in your deployment pipeline. Not bad for not paying $1,200/year for the privilege of using SQL.
Set up the project, write your first staging model, run dbt test, and you’ll immediately understand why this tool ate the data engineering world. Your data pipeline will finally have the same kind of structure and confidence you expect from application code. About time.