dbt-core for Self-Hosters: SQL With Tests, Without dbt Cloud

Your Analytics Stack Doesn’t Need a SaaS Bill

You’ve got data sitting in Postgres. Or maybe DuckDB. Or ClickHouse if you’ve been reading too many HN threads. Either way, you want to transform that raw garbage into something a dashboard can actually use, star schema, aggregated fact tables, the whole shebang, without writing raw SQL scripts that you’ll never remember the order of in six months.

Enter dbt-core. The open-source part of dbt (data build tool). The part that doesn’t cost $100/seat/month.

dbt-cloud is a great product if you’re a data team at a company with a budget. But for a homelab, a small team, or anyone self-hosting their analytics stack, dbt-core + a local warehouse is genuinely all you need. SQL transformations with dependency graphs, built-in testing, auto-generated docs, and scheduling via whatever you already run (cron, Dagster, a systemd timer, whatever). No vendor lock-in. No subscription.

Here’s the full setup: dbt-core with Postgres, a real example ingesting Plausible/Umami analytics data into a star schema, wiring it up to Metabase or Lightdash, and a quick look at SQLMesh if you want to live dangerously.

If you’re running ClickHouse for metrics, check out the related post on self-hosted observability with ClickHouse, it pairs well with what we’re building here.

What dbt Actually Does (And What It Doesn’t)

dbt doesn’t move data. It doesn’t extract from APIs, doesn’t load CSVs into your warehouse. That’s your ETL tool’s job (Airbyte, custom scripts, whatever). dbt’s job is the T in ELT, transforming data that’s already in your database into clean, documented, tested models.

A “model” in dbt is just a SQL SELECT statement saved as a .sql file. dbt wraps it in a CREATE TABLE AS or CREATE VIEW AS and runs it in the right order based on dependencies.

The key primitives:

models/: Your SQL SELECT files. Each file = one table or view.
ref(): How you declare dependencies between models. ref('orders') tells dbt “run that model first.”
sources: Raw tables from your ingestion layer, declared in YAML.
tests: Generic (not_null, unique, accepted_values) or custom SQL tests that run against your models.
snapshots: Slowly changing dimension tracking. Captures row history over time.

The compile vs. runtime distinction matters: dbt compiles your Jinja+SQL templates into plain SQL first, then runs the compiled SQL against your warehouse. You can inspect what it generated in target/compiled/ before anything touches the database.

Install and Project Setup

pip install dbt-core dbt-postgres
# or for DuckDB:
pip install dbt-core dbt-duckdb
# or ClickHouse:
pip install dbt-core dbt-clickhouse

Initialize a project:

dbt init analytics
cd analytics

This creates the scaffold. The only files you care about immediately:

analytics/
  dbt_project.yml      # project config
  profiles.yml         # connection config (usually in ~/.dbt/)
  models/              # your SQL models live here
  tests/               # custom test SQL
  snapshots/           # SCD snapshots
  macros/              # reusable Jinja macros

Configure your Postgres connection in ~/.dbt/profiles.yml:

analytics:
  target: dev
  outputs:
    dev:
      type: postgres
      host: localhost
      user: dbt_user
      password: "{{ env_var('DBT_PASSWORD') }}"
      port: 5432
      dbname: analytics
      schema: dbt_dev
      threads: 4

Pull the password from an env var. Don’t hardcode it. Your future self will thank you at 2 AM when you’re rotating credentials.

Test the connection:

dbt debug

You want to see all green. If you don’t, fix it now before you’ve got 30 models depending on a broken profile.

Real Example: Plausible/Umami Analytics Into a Star Schema

Let’s say you’re self-hosting Plausible or Umami for website analytics, and you want to expose session/event data to Metabase without giving your BI tool direct access to the raw OLTP tables.

First, declare your source tables, the raw tables that dbt didn’t create:

version: 2

sources:
  - name: plausible_raw
    database: analytics
    schema: public
    tables:
      - name: events
        description: "Raw pageview and custom events from Plausible"
        columns:
          - name: id
            tests:
              - unique
              - not_null
          - name: session_id
            tests:
              - not_null
          - name: timestamp
            tests:
              - not_null
      - name: sessions
        description: "Session-level data"

Now build a staging model that cleans up the raw events:

with source as (
    select * from {{ source('plausible_raw', 'events') }}
),

cleaned as (
    select
        id                                          as event_id,
        session_id,
        timestamp::timestamptz                      as event_at,
        coalesce(pathname, '/')                     as page_path,
        lower(trim(referrer))                       as referrer,
        lower(trim(browser))                        as browser,
        lower(trim(operating_system))               as os,
        country_code,
        props::jsonb                                as event_props
    from source
    where timestamp >= '2024-01-01'  -- trim old junk during dev
)

select * from cleaned

Then a fact table that uses ref() to depend on the staging model:

with events as (
    select * from {{ ref('stg_events') }}
),

sessions as (
    select * from {{ ref('stg_sessions') }}
)

select
    e.event_id,
    e.event_at,
    date_trunc('day', e.event_at)::date  as event_date,
    e.page_path,
    e.referrer,
    e.browser,
    e.os,
    e.country_code,
    s.duration_seconds,
    s.bounce
from events e
left join sessions s on e.session_id = s.session_id
where e.page_path not like '/api/%'

And a daily aggregation dimension:

select
    event_date,
    page_path,
    country_code,
    count(distinct event_id)      as pageviews,
    count(distinct session_id)    as sessions,
    avg(duration_seconds)         as avg_session_duration,
    sum(bounce::int)::float
        / nullif(count(distinct session_id), 0) as bounce_rate
from {{ ref('fct_pageviews') }}
group by 1, 2, 3

Run the whole thing:

dbt run

dbt figures out the dependency order automatically from ref() calls. stg_events and stg_sessions run first, then fct_pageviews, then dim_daily_traffic. You don’t have to manage that. That’s the whole point.

Tests: The Part Most People Skip (Don’t)

This is where dbt earns its keep. Run dbt test and it validates your models against assertions you defined.

Add schema tests in YAML:

version: 2

models:
  - name: fct_pageviews
    description: "One row per pageview event"
    columns:
      - name: event_id
        tests:
          - unique
          - not_null
      - name: event_date
        tests:
          - not_null
      - name: country_code
        tests:
          - accepted_values:
              values: ['US', 'GB', 'DE', 'FR', 'CA', 'AU', 'NL', 'SE']
              quote: true
              config:
                severity: warn  # warn, don't fail — you'll get unexpected countries

For more complex assertions, write a custom test: just a SQL file that returns rows when the test fails:

-- Fails if any events are timestamped in the future (clock skew or bad data)
select event_id
from {{ ref('fct_pageviews') }}
where event_at > now() + interval '1 hour'

Run tests:

dbt test
# or test a specific model:
dbt test --select fct_pageviews

If something breaks during a load, you’ll know before your dashboard shows garbage. That’s the tradeoff you’re making when you skip the SaaS layer: you own the data quality, so you’d better test it.

Incremental Models: Don’t Reprocess Everything Every Run

By default dbt rebuilds the whole table on every dbt run. That’s fine for small datasets. For anything over a few million rows, you want incremental models:

{{
    config(
        materialized='incremental',
        unique_key='event_id',
        on_schema_change='sync_all_columns'
    )
}}

with events as (
    select * from {{ ref('stg_events') }}

    {% if is_incremental() %}
    -- Only process events newer than the last run
    where event_at > (select max(event_at) from {{ this }})
    {% endif %}
)

-- ... rest of your query

On the first run, is_incremental() is false and you build the full table. On subsequent runs, dbt only processes new rows and merges them in. unique_key tells it how to handle duplicates (upsert behavior).

Docs: Free BI-Adjacent Documentation

dbt docs generate
dbt docs serve

That’s it. dbt generates a full documentation site at http://localhost:8080 with your model lineage DAG, column descriptions, test results, and source definitions. Point a teammate at it when they ask “where does this number come from.” It’s not fancy but it works.

For a self-hosted setup, you can dump the generated target/ directory to a static file server or just run dbt docs serve in a screen/tmux session when you need it.

Scheduling: Pick Your Poison

dbt doesn’t have a built-in scheduler. Options by complexity:

Cron / systemd timer: Simplest. A systemd timer unit that runs dbt run && dbt test on a schedule. Zero dependencies.

[Unit]
Description=dbt analytics run

[Service]
Type=oneshot
User=analytics
WorkingDirectory=/opt/analytics
ExecStart=/opt/analytics/.venv/bin/dbt run --profiles-dir /opt/analytics
ExecStartPost=/opt/analytics/.venv/bin/dbt test --profiles-dir /opt/analytics

[Unit]
Description=Run dbt analytics every hour

[Timer]
OnCalendar=hourly
Persistent=true

[Install]
WantedBy=timers.target

Dagster: If you want a proper orchestrator with a UI, retries, and asset lineage. Has a first-class dbt integration (dagster-dbt). Self-hosts fine with Docker Compose.

Airflow: Battle-tested, heavy. Overkill for a homelab. Use Dagster instead unless you already have Airflow running.

dbt-core vs SQLMesh: The New Kid

SQLMesh is worth knowing about. It’s newer, dbt-compatible (can import dbt projects), and has a built-in semantic layer, you define metrics once and they’re reusable across queries. It also has smarter incremental execution: it tracks which models changed and only reruns what’s actually affected, not just what you tell it to.

Honest take: SQLMesh is impressive and the semantic layer is genuinely better designed than dbt’s approach. But the ecosystem is smaller, the docs are thinner, and you’ll hit rough edges if you’re off the happy path. For a homelab analytics project, dbt-core has a bigger community, more adapters, and more Stack Overflow answers when things break. SQLMesh is the one to watch as it matures.

If you’re starting fresh and willing to ride the cutting edge, try SQLMesh. If you want something that Just Works today and has years of production hardening behind it, dbt-core wins.

Wiring to Metabase or Lightdash

Once fct_pageviews and dim_daily_traffic exist in your dbt_dev schema (or dbt_prod once you promote), point your BI tool at them.

Metabase: Connect to the analytics database, select the dbt_prod schema, sync. Your dbt model descriptions even show up as table descriptions in Metabase if you’ve written them in your schema YAML.

Lightdash: Reads dbt project metadata directly. Point it at your dbt_project.yml and it builds the semantic layer from your model YAML. Worth a look if you want a more data-analyst-friendly interface and you’re already maintaining dbt schema files.

Neither tool needs to know about dbt internals; they just see clean, tested tables in Postgres. That’s the whole value prop: your BI tool gets a stable, documented interface instead of raw OLTP tables that change whenever a dev refactors the schema.

The Bottom Line

dbt-core + Postgres (or DuckDB for local analytics, ClickHouse for high-volume metrics) is a genuinely complete analytics-engineering stack. You get dependency-aware transformations, data quality tests, auto-generated documentation, and incremental processing, everything the cloud version sells you, running entirely on infrastructure you control.

The tradeoff is you own the scheduling and you write your own CI. That’s maybe two hours of setup with systemd timers and a basic test run in your deployment pipeline. Not bad for not paying $1,200/year for the privilege of using SQL.

Set up the project, write your first staging model, run dbt test, and you’ll immediately understand why this tool ate the data engineering world. Your data pipeline will finally have the same kind of structure and confidence you expect from application code. About time.

dbt-core for Self-Hosters: SQL With Tests, Without dbt Cloud

Your Analytics Stack Doesn’t Need a SaaS Bill

What dbt Actually Does (And What It Doesn’t)

Install and Project Setup

Real Example: Plausible/Umami Analytics Into a Star Schema

Tests: The Part Most People Skip (Don’t)

Incremental Models: Don’t Reprocess Everything Every Run

Docs: Free BI-Adjacent Documentation

Scheduling: Pick Your Poison

dbt-core vs SQLMesh: The New Kid

Wiring to Metabase or Lightdash

The Bottom Line

Responses from around the web

Discussion

Related Posts

Gitea vs Forgejo vs GitLab CE: Self-Hosted Git

Memos: Self-Hosted Microblog Meets Daily Note

Convoy: Self-Hosted Webhook Delivery That Doesn't Drop Events

Forgejo Actions: Self-Hosted GitHub-Style CI Without GitHub

dbt-core for Self-Hosters: SQL With Tests, Without dbt Cloud

Your Analytics Stack Doesn’t Need a SaaS Bill

What dbt Actually Does (And What It Doesn’t)

Install and Project Setup

Real Example: Plausible/Umami Analytics Into a Star Schema

Tests: The Part Most People Skip (Don’t)

Incremental Models: Don’t Reprocess Everything Every Run

Docs: Free BI-Adjacent Documentation

Scheduling: Pick Your Poison

dbt-core vs SQLMesh: The New Kid

Wiring to Metabase or Lightdash

The Bottom Line

Related Reading

Responses from around the web

Discussion

Related Posts

Gitea vs Forgejo vs GitLab CE: Self-Hosted Git

Memos: Self-Hosted Microblog Meets Daily Note

Convoy: Self-Hosted Webhook Delivery That Doesn't Drop Events

Forgejo Actions: Self-Hosted GitHub-Style CI Without GitHub