Skip to content
Go back

ClickHouse for Self-Hosted Logs

By SumGuy 11 min read
ClickHouse for Self-Hosted Logs

Loki Was Great Until You Tried to Query 30 Days of Nginx Logs

You set up Loki. You felt good about it — lightweight, Prometheus-native, no JVM to wrestle with. Then six months later you tried to pull nginx access logs across 30 days to debug a weird spike, and Loki looked you dead in the eye and said “query timeout.”

Sound familiar?

Here’s the thing: Loki is purpose-built for label-indexed log streams. It’s not a general-purpose log store, and it was never pretending to be. But when your homelab starts eating 50–500 GB of logs per day — container stdout, syslog, nginx, Postgres slow query logs, whatever — you need something that can actually query all of it at scale, fast, without a dedicated cluster or a $2,000/month cloud bill.

Enter ClickHouse. It’s a columnar OLAP database that was originally built at Yandex to handle petabyte-scale analytics. It compresses your raw JSON logs by 10–20x, queries terabytes in seconds on commodity hardware, and runs happily on a single well-tuned box. It’s not a time-series DB, it’s not a log aggregator — it’s a data warehouse that happens to eat logs for breakfast.

Let’s wire it up.


Why ClickHouse Actually Makes Sense for Logs

The secret is the storage model. ClickHouse is a columnar store, which means each field in your log row lives in its own compressed byte stream on disk rather than packed row-by-row.

For logs, this is massive. A log line has maybe 15–20 fields. When you query WHERE status_code = 500, ClickHouse reads only the status_code column, skips everything else, and decompresses a tiny slice of data. Compare that to row-oriented stores (MySQL, Postgres, Elasticsearch in some respects) that drag entire rows off disk even if you only care about one field.

Then add the codecs. ClickHouse supports LZ4 and ZSTD compression per column. LZ4 is blazing fast with decent compression. ZSTD is slower but pushes compression ratios much higher. On raw nginx JSON logs, expect:

In practice, 100 GB of raw logs becomes 6–10 GB on disk. A terabyte of logs fits comfortably on a 100 GB NVMe partition. This is not a gimmick — it’s the reason ClickHouse Cloud sells observability tiers that beat Datadog pricing by 80%.

For queries: a SELECT count(), avg(response_time) FROM logs WHERE level = 'ERROR' AND timestamp > now() - INTERVAL 7 DAY on 500M rows typically runs in under 2 seconds on an 8-core machine with an NVMe drive. Elasticsearch would need 30 seconds and a prayer.

The engine making this work is MergeTree — ClickHouse’s primary storage engine, purpose-built for append-heavy workloads with time-range queries. Data is written in small sorted parts, merged in the background, and indexed with sparse primary key indexes that make range scans brutally efficient.


Schema Design 101: Get This Right Before You Ingest Anything

The schema is where most people shoot themselves in the foot. Take five minutes to think it through.

create_table.sql
CREATE TABLE logs
(
-- Primary timestamp — always first in ORDER BY
timestamp DateTime64(3, 'UTC'),
-- Source identity
host LowCardinality(String),
service LowCardinality(String),
level LowCardinality(String), -- info, warn, error, debug
-- High-cardinality fields
message String CODEC(ZSTD(3)),
trace_id String,
span_id String,
-- HTTP-specific (nullable for non-HTTP sources)
http_method LowCardinality(String),
http_status UInt16,
http_path String CODEC(ZSTD(3)),
response_time Float32,
-- Catch-all for everything else
attributes Map(String, String) CODEC(ZSTD(3))
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (service, level, timestamp)
TTL timestamp + INTERVAL 90 DAY
SETTINGS index_granularity = 8192;

Key decisions explained:

LowCardinality(String) — Use this for fields with under ~10,000 unique values (hostnames, services, log levels, HTTP methods). ClickHouse uses dictionary encoding under the hood. Storage shrinks dramatically, and filters on these columns are 2–5x faster.

ORDER BY (service, level, timestamp) — This is your primary sort key and determines query performance. Put the fields you filter most first. Timestamp alone is the naive choice — adding service and level means “give me all errors from nginx in the last hour” hits a tiny fraction of the data.

PARTITION BY toYYYYMM(timestamp) — Monthly partitions let you drop old data instantly with ALTER TABLE logs DROP PARTITION '202501' instead of slow row-by-row deletes. Also helps cold storage tiering later.

TTL timestamp + INTERVAL 90 DAY — ClickHouse will automatically expire and delete rows older than 90 days. Set it and forget it.

CODEC(ZSTD(3)) on message and http_path — These are your biggest, most compressible columns. ZSTD(3) adds a bit of CPU on ingest but cuts disk usage roughly in half versus LZ4 on free-text fields.


Ingest Layer: Vector Does the Heavy Lifting

You need something to collect logs from all your sources and ship them to ClickHouse. Vector is the right tool here — it’s written in Rust, uses ~30 MB of RAM idle, handles backpressure correctly, supports ClickHouse as a native sink, and doesn’t require a JVM tax. If you’re already running an OpenTelemetry Collector, there’s a ClickHouse exporter — but Vector is simpler for most homelabs.

vector.yaml
sources:
docker_logs:
type: docker_logs
include_containers: [] # empty = all containers
syslog_in:
type: syslog
mode: udp
address: "0.0.0.0:514"
transforms:
parse_logs:
type: remap
inputs: ["docker_logs", "syslog_in"]
source: |
# Normalize timestamp
.timestamp = to_timestamp!(.timestamp ?? now())
# Tag with service name (use container name for Docker sources)
.service = .container_name ?? .host ?? "unknown"
.host = .hostname ?? get_hostname!()
# Normalize level
.level = downcase(string!(.severity ?? .level ?? "info"))
# Everything else that doesn't map cleanly goes to attributes
.attributes = {}
if exists(.http_status) {
.attributes.http_status = to_string!(.http_status)
}
sinks:
clickhouse_out:
type: clickhouse
inputs: ["parse_logs"]
endpoint: "http://clickhouse:8123"
database: default
table: logs
compression: gzip
encoding:
timestamp_format: unix
# Batch for efficiency — don't send one row at a time
batch:
max_bytes: 10485760 # 10 MB
max_events: 50000
timeout_secs: 5
buffer:
type: disk
max_size: 268435456 # 256 MB on-disk buffer for backpressure
when_full: block

The disk buffer is important. If ClickHouse restarts or is slow, Vector will queue logs to disk instead of dropping them. On a homelab, this saves your sanity at 2 AM when ClickHouse OOMs on a bad query.


Full Compose Stack

docker-compose.yml
services:
clickhouse:
image: clickhouse/clickhouse-server:24.3
container_name: clickhouse
volumes:
- clickhouse_data:/var/lib/clickhouse
- ./clickhouse-config.xml:/etc/clickhouse-server/config.d/custom.xml:ro
ports:
- "8123:8123" # HTTP interface (Grafana, Vector)
- "9000:9000" # Native protocol
environment:
CLICKHOUSE_DB: default
CLICKHOUSE_USER: default
CLICKHOUSE_DEFAULT_ACCESS_MANAGEMENT: 1
ulimits:
nofile:
soft: 262144
hard: 262144
vector:
image: timberio/vector:0.38.0-alpine
container_name: vector
volumes:
- ./vector.yaml:/etc/vector/vector.yaml:ro
- /var/run/docker.sock:/var/run/docker.sock:ro
- vector_buffer:/var/lib/vector
depends_on:
- clickhouse
command: ["--config", "/etc/vector/vector.yaml"]
grafana:
image: grafana/grafana:11.0.0
container_name: grafana
ports:
- "3000:3000"
volumes:
- grafana_data:/var/lib/grafana
environment:
GF_INSTALL_PLUGINS: grafana-clickhouse-datasource
# Fake log producer for testing
log_generator:
image: mingrammer/flog
container_name: flog
command: ["-f", "json", "-d", "100ms", "-l"]
volumes:
clickhouse_data:
vector_buffer:
grafana_data:

Spin it up, run the CREATE TABLE statement above, and you’re ingesting within 60 seconds.

Terminal window
docker compose up -d
docker exec -it clickhouse clickhouse-client
# Paste the CREATE TABLE statement, then:
SELECT count() FROM logs;

Grafana: The Altinity Plugin

Grafana doesn’t ship with a ClickHouse datasource by default. Install the official grafana-clickhouse-datasource plugin (the one from Grafana Labs, not the older Altinity one — both work, the official one is better maintained as of 2024).

Point it at http://clickhouse:8123, database default, no auth if you’re just running local.

For a log explorer panel, use this query as a starting point:

grafana_logs_query.sql
SELECT
timestamp,
level,
service,
message
FROM logs
WHERE
timestamp BETWEEN {dateFrom:DateTime} AND {dateTo:DateTime}
AND ($level = 'all' OR level = $level)
AND service ILIKE {service:String}
ORDER BY timestamp DESC
LIMIT 1000

ClickHouse’s Grafana plugin supports template variables and time range macros, so your dashboards can filter by service, level, and time range just like you’d expect. The log line rendering isn’t as polished as Grafana Loki’s built-in explore view, but it’s usable and infinitely more queryable.

Add a time-series panel for error rate:

grafana_error_rate.sql
SELECT
toStartOfMinute(timestamp) AS minute,
countIf(level = 'error') AS errors,
count() AS total
FROM logs
WHERE timestamp >= now() - INTERVAL 24 HOUR
GROUP BY minute
ORDER BY minute

ClickHouse vs. The Alternatives: Honest Comparison

vs. Loki

Loki is a streaming log aggregator. It indexes labels (key=value pairs at stream level), not log content. This is great when you have 20 services with predictable labels. It’s terrible when you need to grep across full log text, filter by arbitrary fields, or run aggregations. Loki’s query language (LogQL) is also… an acquired taste.

ClickHouse indexes nothing by default at the row level — it relies on column sorting and sparse primary key indexes. This means full-text search is a table scan, which is fine when you add AND service = 'nginx' to reduce the scan scope. For pure log tailing, Loki wins. For analytics over historical logs, ClickHouse wins badly.

vs. Elasticsearch / OpenSearch

Elasticsearch builds an inverted Lucene index for every field by default. This is why it’s great at full-text search and why it eats 5–8x more disk than ClickHouse for the same data. The JVM overhead is real: a production Elasticsearch node wants 8–16 GB of heap just to start. ClickHouse on the same hardware uses 2 GB and goes faster.

ELK makes sense if you need real-time full-text search across unstructured log content (security logs, user-generated content). For structured logs with known schemas, it’s overkill by a large margin.

vs. SigNoz

SigNoz is an OpenTelemetry-native observability platform that uses ClickHouse under the hood. If you want traces + metrics + logs in one UI without wiring anything together yourself, SigNoz is the answer. If you already have Grafana and just need logs storage, ClickHouse-direct saves you the overhead and gives you more control over the schema.

vs. Quickwit

Quickwit is a cloud-native search engine purpose-built for logs, also written in Rust, with S3 as primary storage. It’s genuinely interesting for cloud-first setups where you want to index logs in object storage at minimal cost. But it’s newer, the ecosystem is smaller, and for a local NVMe-backed homelab, ClickHouse is more mature and better documented.


Production Tips (Even for Homelabs)

Separate disk for ClickHouse data. Give it a dedicated NVMe or at least a separate partition. ClickHouse is merge-heavy on disk I/O during compaction. If it shares a disk with your OS or other services, you’ll notice.

Enable async_insert for high-volume ingest. Instead of ClickHouse blocking on every insert, async inserts buffer in memory and flush in batches. Reduces lock contention significantly under heavy load.

async_insert_settings.sql
-- Set per connection or per user
SET async_insert = 1;
SET wait_for_async_insert = 0;
SET async_insert_max_data_size = 10000000;
SET async_insert_busy_timeout_ms = 5000;

Materialized views for pre-aggregated metrics. If you’re repeatedly running the same hourly error count query, make it a materialized view that updates incrementally on insert. Queries drop from seconds to milliseconds.

materialized_view.sql
CREATE MATERIALIZED VIEW logs_error_hourly
ENGINE = SummingMergeTree()
PARTITION BY toYYYYMM(hour)
ORDER BY (service, hour)
AS SELECT
service,
toStartOfHour(timestamp) AS hour,
countIf(level = 'error') AS errors,
count() AS total
FROM logs
GROUP BY service, hour;

Tiered storage for cold logs. ClickHouse supports S3-backed cold storage with automatic tiering. Hot data (last 7 days) stays on local NVMe. Older data moves to S3/Backblaze automatically. Configure in clickhouse-config.xml under <storage_configuration>. This is how you store a year of logs for single-digit dollars.

ZSTD on everything. Default is LZ4. Switch your message and attributes columns to ZSTD(3) at table creation time. The CPU overhead on ingest is negligible compared to the disk savings.


The Bottom Line

For a homelab dumping 50–500 GB/day of logs, a single 8-core box with an NVMe will swallow it without sweating. ClickHouse will compress your logs to 5–10% of their original size, let you query months of history in seconds, and cost you nothing beyond the hardware you already own.

Loki is still great for log tailing and streaming alert rules. Keep it if you use it. But if you’ve ever watched a Loki range query spin for two minutes before timing out, or hit the “too many label values” wall, you know the feeling. ClickHouse is the answer you were looking for the second time you hit that wall.

Wire up Vector, create a sane schema, and watch 30-day queries come back in under three seconds. Your future self — the one debugging a production incident at 2 AM — will owe you a beer.


Share this post on:

Send a Webmention

Written about this post on your own site? Send a webmention and it'll show up above once verified.


Next Post
iperf3 + nload: Network Diagnosis

Discussion

Powered by Garrul . Sign in with GitHub or Google, or post anonymously.

Related Posts