DuckDB for Local Analytics: Pandas on Steroids

You Don’t Need a Data Lake for 50 Million Rows

Somewhere along the way, “analytics” became synonymous with “spin up a Kafka cluster, a Spark fleet, and a BI platform that costs more than your mortgage.” You’ve got a 5 GB CSV from your Frigate camera events. You want to know which hour of the day gets the most motion triggers. Someone in a conference room is about to suggest you “build a data lake.”

Don’t let them.

DuckDB is what happens when the SQLite team looks at the analytics world and says, “this is embarrassing.” It’s a single binary, in-process columnar OLAP database that reads CSV, Parquet, JSON, and Arrow directly—no import, no server, no YAML manifests for a Kubernetes operator. You run a SQL query against a file on disk and get an answer in seconds.

Let’s get into it.

What DuckDB Actually Is

Think SQLite, but built for the query patterns that SQLite hates: full table scans, group-bys, aggregations, window functions over 50 million rows. SQLite is row-oriented and optimized for transactional workloads—lots of small reads and writes to individual records. DuckDB is column-oriented, which means when you SELECT sum(amount) FROM sales, it only reads the amount column from disk. The other 40 columns don’t exist for that query.

They’re complements, not competitors. The DuckDB team will tell you this themselves. Use SQLite for your application’s operational data. Use DuckDB when you need to ask questions about that data at scale.

DuckDB 1.x (the current stable line as of 2026) ships with:

Vectorized execution engine
Parallel query execution out of the box
Native Parquet, CSV, JSON, and Arrow support
Extensions for S3/R2/GCS (httpfs), Iceberg, Delta Lake, and more
Python, Node, Rust, Go, and Java bindings
A dead-simple CLI

Zero servers. Zero configuration. One file or no file at all.

Installing It

Three reasonable ways depending on your setup:

# Python ecosystem (recommended for data work)
pipx install duckdb

# macOS / Linux with Homebrew
brew install duckdb

# Or just grab the static binary
wget https://github.com/duckdb/duckdb/releases/latest/download/duckdb_cli-linux-amd64.zip
unzip duckdb_cli-linux-amd64.zip
chmod +x duckdb
sudo mv duckdb /usr/local/bin/

Then pop into the CLI:

duckdb mydb.duckdb

No database file? No problem—just run duckdb with no arguments and you get an in-memory session. Perfect for one-off queries against files you’ll never import.

Query a CSV Without Importing It

This is where DuckDB earns its keep immediately. You’ve got nginx_access.csv sitting there, maybe 2 million rows. In the old workflow you’d load it into pandas, wait, curse the memory usage, then ask your question.

In DuckDB:

SELECT
  strftime(timestamp, '%H') AS hour,
  count(*) AS hits,
  count(*) FILTER (WHERE status >= 500) AS errors
FROM read_csv_auto('nginx_access.csv')
GROUP BY hour
ORDER BY hits DESC;

DuckDB reads only the columns it needs, streams the file, and projects away everything else. read_csv_auto sniffs the schema automatically—types, delimiter, header row, the works. If it guesses wrong you can override with read_csv('file.csv', delim='|', header=true, columns={'ts': 'TIMESTAMP', 'ip': 'VARCHAR'}).

You can also just drop the function call entirely and use the filename as a table:

SELECT * FROM 'big_file.csv' WHERE response_time_ms > 2000 LIMIT 20;

That’s it. That’s the whole import story.

Parquet: DuckDB’s Native Language

If CSV is duct tape, Parquet is load-bearing steel. DuckDB treats Parquet as a first-class citizen with full predicate pushdown—it’ll skip entire row groups that don’t match your WHERE clause without reading them from disk.

-- Local Parquet
SELECT
  customer_id,
  sum(amount) AS total_spent
FROM read_parquet('sales_2026.parquet')
GROUP BY customer_id
ORDER BY total_spent DESC
LIMIT 10;

-- S3/R2 with httpfs extension
INSTALL httpfs;
LOAD httpfs;

SET s3_region = 'auto';
SET s3_endpoint = 'your-account.r2.cloudflarestorage.com';
SET s3_access_key_id = 'YOUR_KEY';
SET s3_secret_access_key = 'YOUR_SECRET';

SELECT
  customer_id,
  sum(amount) AS total_spent
FROM 's3://your-bucket/data/year=2026/*.parquet'
GROUP BY 1
ORDER BY 2 DESC;

That glob pattern on the S3 path? DuckDB handles partition pruning. If your Parquet files are partitioned by year/month and your WHERE clause filters on those columns, DuckDB skips files it doesn’t need entirely. It’s not magic—it’s just smart.

On a modern laptop, a 5 GB Parquet file with 50 million rows, running a group-by aggregation, finishes in roughly 2 seconds. Not “fast for Python” fast. Actually fast.

Want to see why? Run EXPLAIN ANALYZE on any query:

EXPLAIN ANALYZE
SELECT region, sum(revenue)
FROM read_parquet('sales.parquet')
GROUP BY region;

You’ll get a physical plan showing row counts, timing per operator, and where time is actually spent. It’s not beautiful output, but it tells you exactly what the engine decided to do.

Pandas Integration: Zero-Copy via Arrow

Here’s where the Python crowd perks up. DuckDB’s Python API can read from—and write to—pandas DataFrames directly via Apache Arrow. No serialization. The data doesn’t move.

import duckdb
import pandas as pd

# Your existing DataFrame
df_in = pd.read_csv("events.csv")

# Query it with SQL — DuckDB sees the variable by name
result = duckdb.sql("SELECT event_type, count(*) AS n FROM df_in GROUP BY 1").df()

print(result)

That df_in in the SQL string? DuckDB finds it in your Python scope automatically. No registration step, no .to_sql(), no SQLAlchemy URI. It just works.

Going the other way—DuckDB result back to pandas—is equally clean:

import duckdb

conn = duckdb.connect("analytics.duckdb")

# Returns a pandas DataFrame
df = conn.sql("""
  SELECT
    date_trunc('hour', ts) AS hour,
    avg(response_ms) AS p50_approx
  FROM read_parquet('logs/*.parquet')
  GROUP BY 1
  ORDER BY 1
""").df()

When does pandas still win? Interactive Jupyter cell-by-cell munging where you want to inspect intermediate state. Plotting—matplotlib and seaborn speak pandas natively. And for truly small datasets (under ~100K rows), the overhead difference is irrelevant and pandas is more ergonomic. DuckDB wins the moment you’re doing aggregations, joins, or window functions at scale.

JSON: Application Logs Without Preprocessing

Your application dumps JSON event logs. You want to ask questions about them without standing up Elasticsearch or hand-writing a parser.

-- Read a newline-delimited JSON log
SELECT
  json_extract_string(data, '$.user.email') AS email,
  json_extract_string(data, '$.event') AS event_type,
  count(*) AS occurrences
FROM read_json_auto('events.jsonl')
GROUP BY 1, 2
ORDER BY 3 DESC;

read_json_auto handles both JSON arrays and newline-delimited JSONL. For deeply nested structures, json_extract and json_extract_string let you pull specific paths. If the schema is consistent enough, DuckDB will auto-infer columns from the JSON structure and let you query them like regular columns—no extraction needed.

This is extremely useful for Frigate event logs, Home Assistant history exports, or any application that logs structured JSON.

Iceberg and Delta Lake (Because You Asked)

For the home labbers who’ve gone deep on modern table formats, DuckDB speaks both:

-- Apache Iceberg
INSTALL iceberg;
LOAD iceberg;

SELECT count(*), max(event_time)
FROM iceberg_scan('s3://your-bucket/warehouse/events/');

-- Delta Lake
INSTALL delta;
LOAD delta;

SELECT * FROM delta_scan('s3://your-bucket/delta/sales/')
WHERE year = 2026;

Both extensions handle snapshot isolation, schema evolution, and partition pruning. If you’re already running these table formats (maybe via Spark or Trino at work and you’ve exported some tables), DuckDB can query them directly with zero additional infrastructure.

Real Home Lab Use Cases

Let’s get concrete about where this actually earns its place on your server.

Nginx access logs as CSV. Drop a log_format csv directive in your nginx config, or just pipe the default access log through a quick transform, and suddenly every request is a queryable row. Find your top 10 slowest endpoints in seconds.

Frigate camera event database. Frigate stores events in SQLite. DuckDB can attach to it:

ATTACH '/data/frigate/frigate.db' AS frigate (TYPE SQLITE);

SELECT
  camera,
  label,
  count(*) AS events,
  avg(score) AS avg_confidence
FROM frigate.events
WHERE start_time > epoch(now() - INTERVAL 7 DAY)
GROUP BY camera, label
ORDER BY events DESC;

Yes, DuckDB can cross-join against a live SQLite file. Frigate stays running. You get analytics.

Jellyfin playback history. Same trick—Jellyfin uses SQLite. Attach it, query your own watch history, figure out which shows you actually finish versus abandon at episode 3.

Restic backup manifest comparison. Restic can export snapshot metadata as JSON. Dump two snapshots, diff them with DuckDB to find files that changed between backups.

Prometheus metrics dump. Export a metric time series as JSON or CSV, load into DuckDB, run window functions to find anomalies without PromQL gymnastics.

Embedding DuckDB: Quick Python ETL Script

Here’s a real pattern for ingesting JSON event logs and writing aggregated Parquet—the kind of lightweight ETL you’d run as a cron job:

import duckdb
from datetime import date

conn = duckdb.connect()  # in-memory

conn.execute("""
  CREATE TABLE events AS
  SELECT *
  FROM read_json_auto('/var/log/app/events.jsonl')
""")

conn.execute(f"""
  COPY (
    SELECT
      json_extract_string(data, '$.event_type') AS event_type,
      date_trunc('hour', CAST(json_extract_string(data, '$.ts') AS TIMESTAMP)) AS hour,
      count(*) AS n,
      avg(CAST(json_extract_string(data, '$.duration_ms') AS DOUBLE)) AS avg_duration_ms
    FROM events
    GROUP BY 1, 2
    ORDER BY 2, 1
  )
  TO '/data/analytics/events_{date.today()}.parquet'
  (FORMAT PARQUET, COMPRESSION ZSTD)
""")

print("Done.")

Run this nightly. After a month you’ve got 30 daily Parquet files you can query in aggregate:

SELECT event_type, sum(n) AS total
FROM read_parquet('/data/analytics/events_*.parquet')
GROUP BY 1
ORDER BY 2 DESC;

No Airflow. No Spark. No Kubernetes operator. Just a cron job and a folder.

DuckDB vs. Pandas: The Real Comparison

	DuckDB	Pandas
Execution	Lazy, vectorized, parallel	Eager, single-threaded by default
Memory	Predictable, spillable to disk	Loads everything upfront
SQL	Full ANSI SQL + window functions	`.groupby()`, `.merge()`, `.agg()`…
Schema enforcement	Strong typing	Dtype chaos (`object` columns, hello)
File I/O	Read CSV/Parquet/JSON natively, streaming	`pd.read_csv` loads entire file
Plotting	No native	Native matplotlib/seaborn integration
Jupyter UX	SQL strings in cells	Chainable operations feel natural

The short version: if you’re writing a for loop over a DataFrame to do an aggregation, stop. That’s a GROUP BY. Use DuckDB. If you’re doing interactive exploration where you want to see the DataFrame shape after each transformation and then plot it, pandas is still your friend.

Should You Bother?

Yes, if any of these describe you:

You’ve ever waited more than 30 seconds for a pandas operation on a file bigger than 1 GB
You have logs, exports, or dumps sitting on disk that you only query occasionally
You want to run SQL against files without loading a PostgreSQL or MySQL server
You already have Parquet files and want to query them without Spark
You’ve got multiple SQLite databases (Frigate, Jellyfin, Home Assistant) and want to join across them

No, if:

Your data fits in 10 MB and pandas feels fine
You need transactional writes—DuckDB is OLAP, not OLTP. Don’t build your app’s primary database on it
You need multi-user concurrent writes at high throughput. Use Postgres

DuckDB is the kind of tool that makes you feel slightly embarrassed about how much complexity you were tolerating before. It doesn’t replace your database—it replaces the 200-line Python script you wrote to answer a question you could have answered with two lines of SQL.

Install it. Query a CSV. You’ll understand in about 90 seconds.

DuckDB for Local Analytics: Pandas on Steroids

You Don’t Need a Data Lake for 50 Million Rows

What DuckDB Actually Is

Installing It

Query a CSV Without Importing It

Parquet: DuckDB’s Native Language

Pandas Integration: Zero-Copy via Arrow

JSON: Application Logs Without Preprocessing

Iceberg and Delta Lake (Because You Asked)

Real Home Lab Use Cases

Embedding DuckDB: Quick Python ETL Script

DuckDB vs. Pandas: The Real Comparison

Should You Bother?

Responses from around the web

Discussion

Related Posts

ClickHouse vs DuckDB vs StarRocks: Light OLAP

Adding NOT NULL on a Big Table Without Downtime

Postgres HA: Patroni + etcd + HAProxy

Dragonfly vs Redis: Single-Binary Performance

DuckDB for Local Analytics: Pandas on Steroids

You Don’t Need a Data Lake for 50 Million Rows

What DuckDB Actually Is

Installing It

Query a CSV Without Importing It

Parquet: DuckDB’s Native Language

Pandas Integration: Zero-Copy via Arrow

JSON: Application Logs Without Preprocessing

Iceberg and Delta Lake (Because You Asked)

Real Home Lab Use Cases

Embedding DuckDB: Quick Python ETL Script

DuckDB vs. Pandas: The Real Comparison

Should You Bother?

Related Reading

Responses from around the web

Discussion

Related Posts

ClickHouse vs DuckDB vs StarRocks: Light OLAP

Adding NOT NULL on a Big Table Without Downtime

Postgres HA: Patroni + etcd + HAProxy

Dragonfly vs Redis: Single-Binary Performance