Skip to content
Go back

dbt-core for Self-Hosters: SQL With Tests, Without dbt Cloud

By SumGuy 10 min read
dbt-core for Self-Hosters: SQL With Tests, Without dbt Cloud

Your Analytics Stack Doesn’t Need a SaaS Bill

You’ve got data sitting in Postgres. Or maybe DuckDB. Or ClickHouse if you’ve been reading too many HN threads. Either way, you want to transform that raw garbage into something a dashboard can actually use — star schema, aggregated fact tables, the whole shebang — without writing raw SQL scripts that you’ll never remember the order of in six months.

Enter dbt-core. The open-source part of dbt (data build tool). The part that doesn’t cost $100/seat/month.

Here’s the thing: dbt-cloud is a great product if you’re a data team at a company with a budget. But for a homelab, a small team, or anyone self-hosting their analytics stack, dbt-core + a local warehouse is genuinely all you need. SQL transformations with dependency graphs, built-in testing, auto-generated docs, and scheduling via whatever you already run (cron, Dagster, a systemd timer, whatever). No vendor lock-in. No subscription.

This article walks through the full setup: dbt-core with Postgres, a real example ingesting Plausible/Umami analytics data into a star schema, wiring it up to Metabase or Lightdash, and a quick look at SQLMesh if you want to live dangerously.

If you’re running ClickHouse for metrics, check out the related post on self-hosted observability with ClickHouse — it pairs well with what we’re building here.


What dbt Actually Does (And What It Doesn’t)

dbt doesn’t move data. It doesn’t extract from APIs, doesn’t load CSVs into your warehouse. That’s your ETL tool’s job (Airbyte, custom scripts, whatever). dbt’s job is the T in ELT — transforming data that’s already in your database into clean, documented, tested models.

A “model” in dbt is just a SQL SELECT statement saved as a .sql file. dbt wraps it in a CREATE TABLE AS or CREATE VIEW AS and runs it in the right order based on dependencies.

The key primitives:

The compile vs. runtime distinction matters: dbt compiles your Jinja+SQL templates into plain SQL first, then runs the compiled SQL against your warehouse. You can inspect what it generated in target/compiled/ before anything touches the database.


Install and Project Setup

Terminal window
pip install dbt-core dbt-postgres
# or for DuckDB:
pip install dbt-core dbt-duckdb
# or ClickHouse:
pip install dbt-core dbt-clickhouse

Initialize a project:

Terminal window
dbt init analytics
cd analytics

This creates the scaffold. The only files you care about immediately:

analytics/
dbt_project.yml # project config
profiles.yml # connection config (usually in ~/.dbt/)
models/ # your SQL models live here
tests/ # custom test SQL
snapshots/ # SCD snapshots
macros/ # reusable Jinja macros

Configure your Postgres connection in ~/.dbt/profiles.yml:

~/.dbt/profiles.yml
analytics:
target: dev
outputs:
dev:
type: postgres
host: localhost
user: dbt_user
password: "{{ env_var('DBT_PASSWORD') }}"
port: 5432
dbname: analytics
schema: dbt_dev
threads: 4

Pull the password from an env var — don’t hardcode it. Your future self will thank you at 2 AM when you’re rotating credentials.

Test the connection:

Terminal window
dbt debug

You want to see all green. If you don’t, fix it now before you’ve got 30 models depending on a broken profile.


Real Example: Plausible/Umami Analytics Into a Star Schema

Let’s say you’re self-hosting Plausible or Umami for website analytics, and you want to expose session/event data to Metabase without giving your BI tool direct access to the raw OLTP tables.

First, declare your source tables — the raw tables that dbt didn’t create:

models/sources.yml
version: 2
sources:
- name: plausible_raw
database: analytics
schema: public
tables:
- name: events
description: "Raw pageview and custom events from Plausible"
columns:
- name: id
tests:
- unique
- not_null
- name: session_id
tests:
- not_null
- name: timestamp
tests:
- not_null
- name: sessions
description: "Session-level data"

Now build a staging model that cleans up the raw events:

models/staging/stg_events.sql
with source as (
select * from {{ source('plausible_raw', 'events') }}
),
cleaned as (
select
id as event_id,
session_id,
timestamp::timestamptz as event_at,
coalesce(pathname, '/') as page_path,
lower(trim(referrer)) as referrer,
lower(trim(browser)) as browser,
lower(trim(operating_system)) as os,
country_code,
props::jsonb as event_props
from source
where timestamp >= '2024-01-01' -- trim old junk during dev
)
select * from cleaned

Then a fact table that uses ref() to depend on the staging model:

models/marts/fct_pageviews.sql
with events as (
select * from {{ ref('stg_events') }}
),
sessions as (
select * from {{ ref('stg_sessions') }}
)
select
e.event_id,
e.event_at,
date_trunc('day', e.event_at)::date as event_date,
e.page_path,
e.referrer,
e.browser,
e.os,
e.country_code,
s.duration_seconds,
s.bounce
from events e
left join sessions s on e.session_id = s.session_id
where e.page_path not like '/api/%'

And a daily aggregation dimension:

models/marts/dim_daily_traffic.sql
select
event_date,
page_path,
country_code,
count(distinct event_id) as pageviews,
count(distinct session_id) as sessions,
avg(duration_seconds) as avg_session_duration,
sum(bounce::int)::float
/ nullif(count(distinct session_id), 0) as bounce_rate
from {{ ref('fct_pageviews') }}
group by 1, 2, 3

Run the whole thing:

Terminal window
dbt run

dbt figures out the dependency order automatically from ref() calls. stg_events and stg_sessions run first, then fct_pageviews, then dim_daily_traffic. You don’t have to manage that. That’s the whole point.


Tests: The Part Most People Skip (Don’t)

This is where dbt earns its keep. Run dbt test and it validates your models against assertions you defined.

Add schema tests in YAML:

models/marts/schema.yml
version: 2
models:
- name: fct_pageviews
description: "One row per pageview event"
columns:
- name: event_id
tests:
- unique
- not_null
- name: event_date
tests:
- not_null
- name: country_code
tests:
- accepted_values:
values: ['US', 'GB', 'DE', 'FR', 'CA', 'AU', 'NL', 'SE']
quote: true
config:
severity: warn # warn, don't fail — you'll get unexpected countries

For more complex assertions, write a custom test — just a SQL file that returns rows when the test fails:

tests/assert_no_future_events.sql
-- Fails if any events are timestamped in the future (clock skew or bad data)
select event_id
from {{ ref('fct_pageviews') }}
where event_at > now() + interval '1 hour'

Run tests:

Terminal window
dbt test
# or test a specific model:
dbt test --select fct_pageviews

If something breaks during a load, you’ll know before your dashboard shows garbage. That’s the tradeoff you’re making when you skip the SaaS layer — you own the data quality, so you’d better test it.


Incremental Models: Don’t Reprocess Everything Every Run

By default dbt rebuilds the whole table on every dbt run. That’s fine for small datasets. For anything over a few million rows, you want incremental models:

models/marts/fct_pageviews.sql
{{
config(
materialized='incremental',
unique_key='event_id',
on_schema_change='sync_all_columns'
)
}}
with events as (
select * from {{ ref('stg_events') }}
{% if is_incremental() %}
-- Only process events newer than the last run
where event_at > (select max(event_at) from {{ this }})
{% endif %}
)
-- ... rest of your query

On the first run, is_incremental() is false and you build the full table. On subsequent runs, dbt only processes new rows and merges them in. unique_key tells it how to handle duplicates (upsert behavior).


Docs: Free BI-Adjacent Documentation

Terminal window
dbt docs generate
dbt docs serve

That’s it. dbt generates a full documentation site at http://localhost:8080 with your model lineage DAG, column descriptions, test results, and source definitions. Point a teammate at it when they ask “where does this number come from.” It’s not fancy but it works.

For a self-hosted setup, you can dump the generated target/ directory to a static file server or just run dbt docs serve in a screen/tmux session when you need it.


Scheduling: Pick Your Poison

dbt doesn’t have a built-in scheduler. Options by complexity:

Cron / systemd timer — Simplest. A systemd timer unit that runs dbt run && dbt test on a schedule. Zero dependencies.

/etc/systemd/system/dbt-analytics.service
[Unit]
Description=dbt analytics run
[Service]
Type=oneshot
User=analytics
WorkingDirectory=/opt/analytics
ExecStart=/opt/analytics/.venv/bin/dbt run --profiles-dir /opt/analytics
ExecStartPost=/opt/analytics/.venv/bin/dbt test --profiles-dir /opt/analytics
/etc/systemd/system/dbt-analytics.timer
[Unit]
Description=Run dbt analytics every hour
[Timer]
OnCalendar=hourly
Persistent=true
[Install]
WantedBy=timers.target

Dagster — If you want a proper orchestrator with a UI, retries, and asset lineage. Has a first-class dbt integration (dagster-dbt). Self-hosts fine with Docker Compose.

Airflow — Battle-tested, heavy. Overkill for a homelab. Use Dagster instead unless you already have Airflow running.


dbt-core vs SQLMesh: The New Kid

SQLMesh is worth knowing about. It’s newer, dbt-compatible (can import dbt projects), and has a built-in semantic layer — you define metrics once and they’re reusable across queries. It also has smarter incremental execution: it tracks which models changed and only reruns what’s actually affected, not just what you tell it to.

Honest take: SQLMesh is impressive and the semantic layer is genuinely better designed than dbt’s approach. But the ecosystem is smaller, the docs are thinner, and you’ll hit rough edges if you’re off the happy path. For a homelab analytics project, dbt-core has a bigger community, more adapters, and more Stack Overflow answers when things break. SQLMesh is the one to watch as it matures.

If you’re starting fresh and willing to ride the cutting edge, try SQLMesh. If you want something that Just Works today and has years of production hardening behind it, dbt-core wins.


Wiring to Metabase or Lightdash

Once fct_pageviews and dim_daily_traffic exist in your dbt_dev schema (or dbt_prod once you promote), point your BI tool at them.

Metabase: Connect to the analytics database, select the dbt_prod schema, sync. Your dbt model descriptions even show up as table descriptions in Metabase if you’ve written them in your schema YAML.

Lightdash: Reads dbt project metadata directly. Point it at your dbt_project.yml and it builds the semantic layer from your model YAML. Worth a look if you want a more data-analyst-friendly interface and you’re already maintaining dbt schema files.

Neither tool needs to know about dbt internals — they just see clean, tested tables in Postgres. That’s the whole value prop: your BI tool gets a stable, documented interface instead of raw OLTP tables that change whenever a dev refactors the schema.


The Bottom Line

dbt-core + Postgres (or DuckDB for local analytics, ClickHouse for high-volume metrics) is a genuinely complete analytics-engineering stack. You get dependency-aware transformations, data quality tests, auto-generated documentation, and incremental processing — everything the cloud version sells you — running entirely on infrastructure you control.

The tradeoff is you own the scheduling and you write your own CI. That’s maybe two hours of setup with systemd timers and a basic test run in your deployment pipeline. Not bad for not paying $1,200/year for the privilege of using SQL.

Set up the project, write your first staging model, run dbt test, and you’ll immediately understand why this tool ate the data engineering world. Your data pipeline will finally have the same kind of structure and confidence you expect from application code. About time.


Share this post on:

Send a Webmention

Written about this post on your own site? Send a webmention and it'll show up above once verified.


Next Post
Argo Workflows vs Tekton

Discussion

Powered by Garrul . Sign in with GitHub or Google, or post anonymously.

Related Posts