You Don’t Need Snowflake
You’ve got 100 million rows of event logs sitting in a directory somewhere. Your Grafana dashboard is choking on Postgres aggregation queries. Someone on Reddit said ClickHouse is the move. Someone else said DuckDB. Your coworker who just got back from a conference said StarRocks is the future.
Honestly? They’re all right. They’re also all wrong, depending on what you’re actually trying to do.
This is the comparison I wish existed when I was staring at a pile of JSON logs wondering how to make sense of them without spinning up a data warehouse that costs more than my rent.
Three engines. One decision. Let’s go.
The Contenders
ClickHouse 25.x — The Production Workhorse
ClickHouse is a columnar OLAP database that’s been eating the competition’s lunch for a decade. It’s genuinely fast on aggregations over huge datasets — think Grafana metric dashboards, user event analytics, log ingestion pipelines.
The core primitive is the MergeTree engine family. You pick a variant based on what you need:
MergeTree— baseline, append-only inserts, sorted by primary keyReplacingMergeTree— deduplicates rows with the same sorting key (eventually — merges happen in the background, not immediately)AggregatingMergeTree— pre-aggregates state during merges; pairs with materialized views for real-time rollupsSummingMergeTree— sums numeric columns automatically on merge
The “eventually” part of ReplacingMergeTree trips people up constantly. You insert a row, insert an update, and the old version is still visible until a background merge runs. Always query with FINAL or use max(version) if you need consistent deduplication right now.
-- ClickHouse: create an event table with deduplicationCREATE TABLE events( event_id String, user_id UInt64, event_type LowCardinality(String), ts DateTime, properties String -- JSON blob)ENGINE = ReplacingMergeTree()ORDER BY (event_type, user_id, ts);
-- Insert a batchINSERT INTO eventsSELECT generateUUIDv4(), rand() % 10000, ['click','view','purchase'][rand() % 3 + 1], now() - toIntervalSecond(rand() % 86400), '{}'FROM numbers(1000000);
-- Query with FINAL to force deduplicationSELECT event_type, toStartOfHour(ts) AS hour, count() AS events, uniqExact(user_id) AS usersFROM events FINALWHERE ts >= now() - INTERVAL 24 HOURGROUP BY event_type, hourORDER BY hour DESC;On a single node with 2–3 GB RAM, ClickHouse handles this query over 100M rows in under a second. Legitimately under a second. It’s not magic — it’s column storage + vectorized execution + aggressive compression.
Async inserts (enabled since 22.x, mature in 25.x) let you fire-and-forget small batches without hammering the server. Rows get buffered and merged in the background:
SET async_insert = 1;SET wait_for_async_insert = 0;The weak spot historically has been JOINs. ClickHouse was designed around denormalized wide tables, not relational schemas. Large JOIN performance has improved significantly in 24.x+ with hash join improvements, but if your workload is fundamentally join-heavy across normalized tables, read the StarRocks section carefully.
Docker Compose:
services: clickhouse: image: clickhouse/clickhouse-server:25.4 container_name: clickhouse ports: - "8123:8123" # HTTP interface - "9000:9000" # Native protocol volumes: - clickhouse_data:/var/lib/clickhouse - ./clickhouse-config.xml:/etc/clickhouse-server/config.d/custom.xml environment: CLICKHOUSE_USER: default CLICKHOUSE_PASSWORD: yourpassword CLICKHOUSE_DB: analytics ulimits: nofile: soft: 262144 hard: 262144
volumes: clickhouse_data:Memory footprint: 1–3 GB at rest on a single node, depending on caches. Goes higher under load. Plan for at least 4 GB available if you’re running this alongside other services.
DuckDB 1.x — The Binary That Slaps
DuckDB is not a server. It’s a library. You embed it in your script, your Python notebook, your Go ETL pipeline — it runs in-process and disappears when your process exits.
That’s the pitch. No daemon, no connection pool, no port to open. You need to analyze a 20 GB Parquet file? Point DuckDB at it and run SQL. Done.
import duckdb
# DuckDB: analyze a parquet file directly — no server requiredcon = duckdb.connect()
result = con.execute(""" SELECT event_type, date_trunc('hour', ts::TIMESTAMP) AS hour, count(*) AS events, count(DISTINCT user_id) AS users FROM read_parquet('/data/events/*.parquet') WHERE ts::TIMESTAMP >= now() - INTERVAL '24 hours' GROUP BY event_type, hour ORDER BY hour DESC""").fetchdf()
print(result.head(20))DuckDB 1.x ships with native Parquet, CSV, JSON, Arrow, and Iceberg support. You can query remote S3 files directly. You can attach a Postgres database and join against it:
# Attach Postgres and query across systemscon.execute("INSTALL postgres; LOAD postgres;")con.execute("ATTACH 'dbname=mydb host=localhost user=app password=pass' AS pg (TYPE postgres)")
result = con.execute(""" SELECT e.event_type, u.email, count(*) AS events FROM read_parquet('/data/events.parquet') e JOIN pg.users u ON e.user_id = u.id GROUP BY e.event_type, u.email ORDER BY events DESC LIMIT 50""").fetchdf()That query pattern — Parquet on disk, metadata in Postgres, joined in DuckDB — is genuinely useful for home lab ETL work.
JOIN performance is DuckDB’s strength. It uses a vectorized hash join with adaptive spill-to-disk. You can throw complex multi-table joins at it and it handles them gracefully, even on a machine with 8 GB RAM.
The catch: DuckDB is single-process. One writer at a time. No concurrent connections doing heavy queries. It’s not a server — it doesn’t pretend to be.
For a CLI-accessible setup, you can use the duckdb binary:
# Install DuckDB CLI (Linux amd64)curl -LO https://github.com/duckdb/duckdb/releases/download/v1.2.1/duckdb_cli-linux-amd64.zipunzip duckdb_cli-linux-amd64.zipchmod +x duckdbsudo mv duckdb /usr/local/bin/
# Query a CSV right from the shellduckdb -c "SELECT count(*) FROM read_csv_auto('/var/log/nginx/access.log')"Or run it via MotherDuck (their managed cloud offering) if you want a persistent server. But honestly, for home lab use, the CLI or embedded mode is the right call.
Docker Compose (if you want a persistent DuckDB REST interface via their HTTP API wrapper):
services: duckdb-api: image: ghcr.io/iotabah/duckdb-api:latest ports: - "8080:8080" volumes: - ./data:/data - duckdb_state:/state environment: DUCKDB_PATH: /state/analytics.duckdb
volumes: duckdb_state:Memory footprint at rest: essentially zero. DuckDB uses what it needs and releases it. For a 100M-row Parquet query, expect 2–6 GB peak usage depending on the query shape.
StarRocks 3.x — The Enterprise Interloper
StarRocks is an Apache 2.0-licensed MPP (massively parallel processing) analytical database. It forked from Apache Doris, which itself has deep roots in Google Mesa. The headline feature: it speaks the MySQL wire protocol, so anything that talks to MySQL can talk to StarRocks.
That means your Grafana MySQL datasource, your Superset connection, your little Python script using pymysql — all of it works without modification. That’s a bigger deal than it sounds when you’re stitching together home lab tooling.
StarRocks 3.x introduced shared-nothing to shared-data architecture, better Iceberg/Hudi/Delta Lake support, and solid upsert semantics via Primary Key tables:
-- StarRocks: Primary Key table with upsert supportCREATE TABLE events ( event_id BIGINT NOT NULL, user_id BIGINT, event_type VARCHAR(64), ts DATETIME, properties JSON)PRIMARY KEY (event_id)DISTRIBUTED BY HASH(event_id) BUCKETS 8PROPERTIES ( "replication_num" = "1", "enable_persistent_index" = "true");
-- Upsert via stream load or INSERT ... ON CONFLICTINSERT INTO events (event_id, user_id, event_type, ts, properties)VALUES (12345, 789, 'click', '2026-07-15 12:00:00', '{"page": "/home"}')ON DUPLICATE KEY UPDATE event_type = VALUES(event_type), ts = VALUES(ts), properties = VALUES(properties);The JOIN story is where StarRocks actually earns its place. It was designed for multi-table analytical queries from day one. Broadcast joins, shuffle joins, colocate joins — it picks the right strategy automatically. Running a join across a 100M-row fact table and a 10M-row dimension table? StarRocks is going to beat ClickHouse here.
Lakehouse integration is genuinely impressive for a self-hosted setup. You can create external catalogs pointing at Iceberg tables on S3-compatible storage:
-- Create an Iceberg catalog pointing at MinIOCREATE EXTERNAL CATALOG iceberg_catalogPROPERTIES ( "type" = "iceberg", "iceberg.catalog.type" = "rest", "iceberg.catalog.uri" = "http://minio:8181", "aws.s3.endpoint" = "http://minio:9000", "aws.s3.access_key" = "minioadmin", "aws.s3.secret_key" = "minioadmin", "aws.s3.enable_path_style_access" = "true");
-- Query across local StarRocks table and Iceberg lake tableSELECT e.event_type, p.product_name, count(*) AS conversionsFROM events eJOIN iceberg_catalog.lake.products p ON e.properties->'$.product_id' = p.idWHERE e.ts >= '2026-07-01'GROUP BY e.event_type, p.product_name;The downside: StarRocks has a real appetite for resources. A minimum useful single-node deployment wants at least 8 GB RAM, and 16 GB is where it starts feeling comfortable. The FE (frontend coordinator) and BE (backend executor) are separate JVM processes with their own heap requirements.
Docker Compose (single-node development setup):
services: starrocks-fe: image: starrocks/fe-ubuntu:3.3.7 hostname: starrocks-fe ports: - "8030:8030" # HTTP - "9020:9020" # RPC - "9030:9030" # MySQL protocol volumes: - fe_data:/opt/starrocks/fe/meta environment: - HOST_TYPE=FQDN command: /opt/starrocks/fe/bin/start_fe.sh
starrocks-be: image: starrocks/be-ubuntu:3.3.7 hostname: starrocks-be ports: - "8040:8040" volumes: - be_data:/opt/starrocks/be/storage depends_on: - starrocks-fe command: > sh -c "sleep 10 && mysql -h starrocks-fe -P 9030 -u root -e 'ALTER SYSTEM ADD BACKEND \"starrocks-be:9050\";' && /opt/starrocks/be/bin/start_be.sh"
volumes: fe_data: be_data:Memory footprint: 4–6 GB for FE + BE combined at idle. In practice, give it a host with 16 GB free and don’t share it with heavy workloads.
The Numbers (Rough, Honest, Not a Press Release)
100M-row event table, single server, 8-core CPU, 16 GB RAM:
| Query | ClickHouse | DuckDB | StarRocks |
|---|---|---|---|
COUNT(*) | ~0.05s | ~0.3s | ~0.2s |
| Group by + count distinct (1 key) | ~0.4s | ~0.8s | ~0.5s |
| Group by + count distinct (3 keys) | ~0.9s | ~1.2s | ~0.8s |
| 3-table JOIN aggregate | ~2.1s | ~0.9s | ~0.6s |
| Filtered range scan + percentile | ~0.3s | ~0.6s | ~0.4s |
These are directional numbers pulled from public benchmarks (ClickBench, StarRocks public benchmarks, DuckDB docs). Your mileage will vary based on data distribution, query patterns, and whether you remembered to tune the JVM heap.
The takeaway: ClickHouse wins on raw aggregation speed. DuckDB wins on JOINs and flexibility. StarRocks wins on join-heavy OLAP workloads at concurrency.
Ecosystem: Grafana, Superset, and Friends
ClickHouse has a first-class Grafana plugin (grafana-clickhouse-datasource) that supports macros for time-series queries. Superset has a ClickHouse SQLAlchemy driver. It’s well-supported.
DuckDB doesn’t have a native Grafana datasource (because it’s not a server). You can use it as a backend for dbt, export to Parquet and serve from somewhere else, or wrap it in a lightweight FastAPI layer. It’s not a dashboarding target — it’s a processing engine.
StarRocks speaks MySQL protocol, so it works with any MySQL-compatible client. Grafana MySQL datasource, Superset’s MySQL connector, Metabase, TablePlus — all of them work. This is a significant practical advantage.
Update/Delete Semantics — The Honest Version
ClickHouse: Deletes are mutations — they rewrite data parts asynchronously. Use ALTER TABLE DELETE WHERE ... sparingly. ReplacingMergeTree handles deduplication but not real updates. Plan your schema around append-only or version-based patterns.
DuckDB: Full ACID transactions with real UPDATE and DELETE. It’s a proper relational engine. This works exactly like you’d expect.
StarRocks: Primary Key tables support true upserts and deletes with ACID guarantees. Much better than ClickHouse for mutable data. This is one of StarRocks’ real differentiators.
Pick Your Fighter
Home Lab Metrics Dashboard (Loki Replacement)
Use ClickHouse.
You’re ingesting time-series log events, querying recent windows, running aggregations. Async inserts handle bursty ingest from Vector or Fluentd. Materialized views pre-aggregate the heavy stuff. The Grafana plugin is solid. Single-node runs fine on 4 GB RAM.
# Ship logs from Vector to ClickHouse# In your vector.toml:[sinks.clickhouse]type = "clickhouse"inputs = ["my_source"]endpoint = "http://clickhouse:8123"database = "logs"table = "events"compression = "gzip"auth.strategy = "basic"auth.user = "default"auth.password = "yourpassword"One-Off “What Does This CSV/Parquet Look Like”
Use DuckDB.
It’s a binary. You don’t need a server. You don’t need Docker. You need answers now.
duckdb -c " SELECT event_type, count(*) AS n FROM read_csv_auto('/tmp/events.csv') GROUP BY event_type ORDER BY n DESC LIMIT 10"Run it. Get your answer. Close the terminal. Done.
Multi-Source Dashboarding With Joins
Use StarRocks.
You’ve got Iceberg tables in MinIO, a Postgres OLTP database, and you want to join them in real-time for a business dashboard. StarRocks handles this with external catalogs and its MySQL-compatible interface means your existing BI tooling works without modification.
Should You Bother?
Yes — but pick one and commit.
The failure mode here isn’t using the wrong tool, it’s running all three because you couldn’t decide. Your home lab doesn’t need a three-engine OLAP stack. Pick the use case that actually matters to you right now:
- Metrics and logs you want in Grafana: ClickHouse, full stop. It’s mature, well-documented, and the operational overhead on a single node is minimal.
- Data exploration and ETL scripting: DuckDB. Install it in two minutes, use it everywhere, pay zero ongoing costs.
- Analytical dashboards across multiple data sources with real update requirements: StarRocks. Accept that it wants more RAM, and you’ll get MySQL-compatible OLAP that handles joins properly.
All three are genuinely good at what they do. The key is knowing which problem you have before you pick the hammer.
Your 2 AM self will appreciate the clarity.