You Don’t Need a Data Lake for 50 Million Rows
Somewhere along the way, “analytics” became synonymous with “spin up a Kafka cluster, a Spark fleet, and a BI platform that costs more than your mortgage.” You’ve got a 5 GB CSV from your Frigate camera events. You want to know which hour of the day gets the most motion triggers. Someone in a conference room is about to suggest you “build a data lake.”
Don’t let them.
DuckDB is what happens when the SQLite team looks at the analytics world and says, “this is embarrassing.” It’s a single binary, in-process columnar OLAP database that reads CSV, Parquet, JSON, and Arrow directly—no import, no server, no YAML manifests for a Kubernetes operator. You run a SQL query against a file on disk and get an answer in seconds.
Let’s get into it.
What DuckDB Actually Is
Think SQLite, but built for the query patterns that SQLite hates: full table scans, group-bys, aggregations, window functions over 50 million rows. SQLite is row-oriented and optimized for transactional workloads—lots of small reads and writes to individual records. DuckDB is column-oriented, which means when you SELECT sum(amount) FROM sales, it only reads the amount column from disk. The other 40 columns don’t exist for that query.
They’re complements, not competitors. The DuckDB team will tell you this themselves. Use SQLite for your application’s operational data. Use DuckDB when you need to ask questions about that data at scale.
DuckDB 1.x (the current stable line as of 2026) ships with:
- Vectorized execution engine
- Parallel query execution out of the box
- Native Parquet, CSV, JSON, and Arrow support
- Extensions for S3/R2/GCS (httpfs), Iceberg, Delta Lake, and more
- Python, Node, Rust, Go, and Java bindings
- A dead-simple CLI
Zero servers. Zero configuration. One file or no file at all.
Installing It
Three reasonable ways depending on your setup:
# Python ecosystem (recommended for data work)pipx install duckdb
# macOS / Linux with Homebrewbrew install duckdb
# Or just grab the static binarywget https://github.com/duckdb/duckdb/releases/latest/download/duckdb_cli-linux-amd64.zipunzip duckdb_cli-linux-amd64.zipchmod +x duckdbsudo mv duckdb /usr/local/bin/Then pop into the CLI:
duckdb mydb.duckdbNo database file? No problem—just run duckdb with no arguments and you get an in-memory session. Perfect for one-off queries against files you’ll never import.
Query a CSV Without Importing It
This is where DuckDB earns its keep immediately. You’ve got nginx_access.csv sitting there, maybe 2 million rows. In the old workflow you’d load it into pandas, wait, curse the memory usage, then ask your question.
In DuckDB:
SELECT strftime(timestamp, '%H') AS hour, count(*) AS hits, count(*) FILTER (WHERE status >= 500) AS errorsFROM read_csv_auto('nginx_access.csv')GROUP BY hourORDER BY hits DESC;DuckDB reads only the columns it needs, streams the file, and projects away everything else. read_csv_auto sniffs the schema automatically—types, delimiter, header row, the works. If it guesses wrong you can override with read_csv('file.csv', delim='|', header=true, columns={'ts': 'TIMESTAMP', 'ip': 'VARCHAR'}).
You can also just drop the function call entirely and use the filename as a table:
SELECT * FROM 'big_file.csv' WHERE response_time_ms > 2000 LIMIT 20;That’s it. That’s the whole import story.
Parquet: DuckDB’s Native Language
If CSV is duct tape, Parquet is load-bearing steel. DuckDB treats Parquet as a first-class citizen with full predicate pushdown—it’ll skip entire row groups that don’t match your WHERE clause without reading them from disk.
-- Local ParquetSELECT customer_id, sum(amount) AS total_spentFROM read_parquet('sales_2026.parquet')GROUP BY customer_idORDER BY total_spent DESCLIMIT 10;
-- S3/R2 with httpfs extensionINSTALL httpfs;LOAD httpfs;
SET s3_region = 'auto';SET s3_endpoint = 'your-account.r2.cloudflarestorage.com';SET s3_access_key_id = 'YOUR_KEY';SET s3_secret_access_key = 'YOUR_SECRET';
SELECT customer_id, sum(amount) AS total_spentFROM 's3://your-bucket/data/year=2026/*.parquet'GROUP BY 1ORDER BY 2 DESC;That glob pattern on the S3 path? DuckDB handles partition pruning. If your Parquet files are partitioned by year/month and your WHERE clause filters on those columns, DuckDB skips files it doesn’t need entirely. It’s not magic—it’s just smart.
On a modern laptop, a 5 GB Parquet file with 50 million rows, running a group-by aggregation, finishes in roughly 2 seconds. Not “fast for Python” fast. Actually fast.
Want to see why? Run EXPLAIN ANALYZE on any query:
EXPLAIN ANALYZESELECT region, sum(revenue)FROM read_parquet('sales.parquet')GROUP BY region;You’ll get a physical plan showing row counts, timing per operator, and where time is actually spent. It’s not beautiful output, but it tells you exactly what the engine decided to do.
Pandas Integration: Zero-Copy via Arrow
Here’s where the Python crowd perks up. DuckDB’s Python API can read from—and write to—pandas DataFrames directly via Apache Arrow. No serialization. The data doesn’t move.
import duckdbimport pandas as pd
# Your existing DataFramedf_in = pd.read_csv("events.csv")
# Query it with SQL — DuckDB sees the variable by nameresult = duckdb.sql("SELECT event_type, count(*) AS n FROM df_in GROUP BY 1").df()
print(result)That df_in in the SQL string? DuckDB finds it in your Python scope automatically. No registration step, no .to_sql(), no SQLAlchemy URI. It just works.
Going the other way—DuckDB result back to pandas—is equally clean:
import duckdb
conn = duckdb.connect("analytics.duckdb")
# Returns a pandas DataFramedf = conn.sql(""" SELECT date_trunc('hour', ts) AS hour, avg(response_ms) AS p50_approx FROM read_parquet('logs/*.parquet') GROUP BY 1 ORDER BY 1""").df()When does pandas still win? Interactive Jupyter cell-by-cell munging where you want to inspect intermediate state. Plotting—matplotlib and seaborn speak pandas natively. And for truly small datasets (under ~100K rows), the overhead difference is irrelevant and pandas is more ergonomic. DuckDB wins the moment you’re doing aggregations, joins, or window functions at scale.
JSON: Application Logs Without Preprocessing
Your application dumps JSON event logs. You want to ask questions about them without standing up Elasticsearch or hand-writing a parser.
-- Read a newline-delimited JSON logSELECT json_extract_string(data, '$.user.email') AS email, json_extract_string(data, '$.event') AS event_type, count(*) AS occurrencesFROM read_json_auto('events.jsonl')GROUP BY 1, 2ORDER BY 3 DESC;read_json_auto handles both JSON arrays and newline-delimited JSONL. For deeply nested structures, json_extract and json_extract_string let you pull specific paths. If the schema is consistent enough, DuckDB will auto-infer columns from the JSON structure and let you query them like regular columns—no extraction needed.
This is extremely useful for Frigate event logs, Home Assistant history exports, or any application that logs structured JSON.
Iceberg and Delta Lake (Because You Asked)
For the home labbers who’ve gone deep on modern table formats, DuckDB speaks both:
-- Apache IcebergINSTALL iceberg;LOAD iceberg;
SELECT count(*), max(event_time)FROM iceberg_scan('s3://your-bucket/warehouse/events/');
-- Delta LakeINSTALL delta;LOAD delta;
SELECT * FROM delta_scan('s3://your-bucket/delta/sales/')WHERE year = 2026;Both extensions handle snapshot isolation, schema evolution, and partition pruning. If you’re already running these table formats (maybe via Spark or Trino at work and you’ve exported some tables), DuckDB can query them directly with zero additional infrastructure.
Real Home Lab Use Cases
Let’s get concrete about where this actually earns its place on your server.
Nginx access logs as CSV. Drop a log_format csv directive in your nginx config, or just pipe the default access log through a quick transform, and suddenly every request is a queryable row. Find your top 10 slowest endpoints in seconds.
Frigate camera event database. Frigate stores events in SQLite. DuckDB can attach to it:
ATTACH '/data/frigate/frigate.db' AS frigate (TYPE SQLITE);
SELECT camera, label, count(*) AS events, avg(score) AS avg_confidenceFROM frigate.eventsWHERE start_time > epoch_ms(unixepoch('now', '-7 days') * 1000)GROUP BY camera, labelORDER BY events DESC;Yes, DuckDB can cross-join against a live SQLite file. Frigate stays running. You get analytics.
Jellyfin playback history. Same trick—Jellyfin uses SQLite. Attach it, query your own watch history, figure out which shows you actually finish versus abandon at episode 3.
Restic backup manifest comparison. Restic can export snapshot metadata as JSON. Dump two snapshots, diff them with DuckDB to find files that changed between backups.
Prometheus metrics dump. Export a metric time series as JSON or CSV, load into DuckDB, run window functions to find anomalies without PromQL gymnastics.
Embedding DuckDB: Quick Python ETL Script
Here’s a real pattern for ingesting JSON event logs and writing aggregated Parquet—the kind of lightweight ETL you’d run as a cron job:
import duckdbfrom datetime import date
conn = duckdb.connect() # in-memory
conn.execute(""" CREATE TABLE events AS SELECT * FROM read_json_auto('/var/log/app/events.jsonl')""")
conn.execute(f""" COPY ( SELECT json_extract_string(data, '$.event_type') AS event_type, date_trunc('hour', CAST(json_extract_string(data, '$.ts') AS TIMESTAMP)) AS hour, count(*) AS n, avg(CAST(json_extract_string(data, '$.duration_ms') AS DOUBLE)) AS avg_duration_ms FROM events GROUP BY 1, 2 ORDER BY 2, 1 ) TO '/data/analytics/events_{date.today()}.parquet' (FORMAT PARQUET, COMPRESSION ZSTD)""")
print("Done.")Run this nightly. After a month you’ve got 30 daily Parquet files you can query in aggregate:
SELECT event_type, sum(n) AS totalFROM read_parquet('/data/analytics/events_*.parquet')GROUP BY 1ORDER BY 2 DESC;No Airflow. No Spark. No Kubernetes operator. Just a cron job and a folder.
DuckDB vs. Pandas: The Real Comparison
| DuckDB | Pandas | |
|---|---|---|
| Execution | Lazy, vectorized, parallel | Eager, single-threaded by default |
| Memory | Predictable, spillable to disk | Loads everything upfront |
| SQL | Full ANSI SQL + window functions | .groupby(), .merge(), .agg()… |
| Schema enforcement | Strong typing | Dtype chaos (object columns, hello) |
| File I/O | Read CSV/Parquet/JSON natively, streaming | pd.read_csv loads entire file |
| Plotting | No native | Native matplotlib/seaborn integration |
| Jupyter UX | SQL strings in cells | Chainable operations feel natural |
The short version: if you’re writing a for loop over a DataFrame to do an aggregation, stop. That’s a GROUP BY. Use DuckDB. If you’re doing interactive exploration where you want to see the DataFrame shape after each transformation and then plot it, pandas is still your friend.
Should You Bother?
Yes, if any of these describe you:
- You’ve ever waited more than 30 seconds for a pandas operation on a file bigger than 1 GB
- You have logs, exports, or dumps sitting on disk that you only query occasionally
- You want to run SQL against files without loading a PostgreSQL or MySQL server
- You already have Parquet files and want to query them without Spark
- You’ve got multiple SQLite databases (Frigate, Jellyfin, Home Assistant) and want to join across them
No, if:
- Your data fits in 10 MB and pandas feels fine
- You need transactional writes—DuckDB is OLAP, not OLTP. Don’t build your app’s primary database on it
- You need multi-user concurrent writes at high throughput. Use Postgres
DuckDB is the kind of tool that makes you feel slightly embarrassed about how much complexity you were tolerating before. It doesn’t replace your database—it replaces the 200-line Python script you wrote to answer a question you could have answered with two lines of SQL.
Install it. Query a CSV. You’ll understand in about 90 seconds.