SMART Disk Monitoring with smartmontools

Disks Fail. The Question Is Whether You’ll Know in Time.

Every hard drive and SSD you own will fail eventually. Not metaphorically. Physically. And when it does, you want to know about it, ideally before your NAS starts rebuilding a RAID array at 3 AM, or worse, before you discover a silent data loss in your backup.

That’s where SMART comes in.

SMART stands for Self-Monitoring, Analysis, and Reporting Technology. It’s been baked into every modern drive for decades. Your drive is constantly measuring things: temperature, seek errors, sector reallocations, command timeouts. The problem? Most monitoring setups are completely useless. Your NAS tells you “SMART OK” and you assume everything’s fine. It’s not. It’s lying to you.

This guide shows you how to read what your drives are actually saying, configure smartmontools to actually catch failures before they wreck you, and integrate that data into your monitoring stack so you can sleep at night.

What SMART Actually Measures (And Why “OK” Doesn’t Mean OK)

SMART is an old standard. It predates SSDs. It was designed by drive manufacturers to tell you (the user) that a drive is about to die, not to give you deep insight into drive health. This is important. SMART status is binary: PASSED or FAILED. That “OK” badge you see is just the PASSED state. It tells you almost nothing.

Here’s the trap: a drive can have several hundred reallocated sectors and still report “OK.” It can be losing sectors in real time and report “OK.” The SMART FAILED state is more like a dead-man’s switch; by the time it trips, you’ve usually got hours to days before total failure, not weeks or months of warning.

Backblaze, the cloud backup company, analyzed petabytes of real drive telemetry. They found that specific SMART attributes correlate with failure rates. Most attributes? Useless noise. The ones that matter, the ones that actually predict failure, are:

Reallocated_Sector_Count (5): The drive found bad sectors and moved the data to a spare pool. One or two reallocations might be normal wear. More than that? Your drive is degrading.
Current_Pending_Sector (197): Sectors the drive suspects are bad but hasn’t reallocated yet. These will become reallocated sectors. If this number is rising, your drive is failing.
Offline_Uncorrectable (198): Sectors the drive can’t read even offline. Game over. This should always be zero.
Reported_Uncorrectable_Errors (187): Drive firmware couldn’t correct errors on read. Should be zero.
Command_Timeout (188): Drive didn’t respond to a command in time. A few timeouts over months? Meh. Dozens in a week? Replace the drive.

Everything else, like Power_On_Hours, Temperature (within normal ranges), and Spin-up time, is mostly decorative. Your 5-year-old drive running at 45°C is fine. Power-on hours don’t kill drives; degradation does.

Getting Started with smartctl

smartmontools gives you two tools: smartctl for one-off queries, and smartd for continuous monitoring. Start with smartctl to get comfortable reading your drives.

Basic Commands

# Get overall health status
smartctl -a /dev/sda

# Get detailed info and firmware
smartctl -i /dev/sda

# Run a quick self-test (usually 1-2 mins)
smartctl -t short /dev/sda

# Run the long test (takes 2+ hours)
smartctl -t long /dev/sda

# Check self-test results
smartctl -l selftest /dev/sda

The -a flag (all) is your main weapon. It dumps the whole SMART table: current values, thresholds, worst values. Read it top to bottom. The attributes that matter have non-zero raw values when failing.

For NVMe drives (the -x flag is your friend):

# NVMe-specific details
smartctl -x /dev/nvme0n1

NVMe attributes are different. Look for:

Critical_Warning: Should be 0. Anything else means the drive is about to give up.
Available_Spare: How much spare capacity is left. SSDs use this for wear leveling. Below 10%? You’re getting close.
Media_Errors: Errors on the flash cells. Should be trending toward zero or staying stable, not climbing.

Installing smartmontools

On most distros, it’s trivial:

# Debian/Ubuntu
sudo apt install smartmontools

# RHEL/Rocky/CentOS
sudo dnf install smartmontools

# Arch
sudo pacman -S smartmontools

On macOS (if you’re doing this locally):

brew install smartmontools

After install, check that smartd isn’t auto-running:

sudo systemctl status smartd

If it’s not enabled, that’s fine. We’ll configure it properly next.

Setting Up smartd for Continuous Monitoring

smartctl is great for poking at a drive once. But you need something running 24/7 to catch degradation in real time. That’s smartd.

The config file is /etc/smartd.conf. Out of the box, it’s often commented out or pointing to all drives without useful alerts. Let’s fix that.

# Monitor all SATA drives with aggressive attribute monitoring
/dev/sda -a -o on -S on -n standby,q -s (S/../.././02|L/../../6/03) -W 4,45,50 -m [email protected] -M exec /path/to/alert-script.sh

# NVMe drives
/dev/nvme0n1 -a -n standby,q -M exec /path/to/alert-script.sh

# Watch specific attributes that predict failure
/dev/sda -l selftest -l errorlog

Breaking this down:

-a = monitor all attributes
-o on = turn on automatic offline testing
-S on = enable automatic attribute autosave
-n standby,q = don’t spin up the drive for testing if it’s in standby (quiet mode)
-s (S/../.././02|L/../../6/03) = run short tests every day at 2 AM, long tests every Saturday at 3 AM
-W 4,45,50 = temperature warning at 45°C, critical at 50°C
-m [email protected] = email alerts to this address (requires mail setup)
-M exec /path/to/alert-script.sh = execute a custom script on alerts

The script part is where the real magic happens. Email often doesn’t work in home labs (no MTA). Instead, use an exec script to send alerts to your monitoring system.

Example alert script:

#!/bin/bash
DEVICE="$1"
MESSAGE="$2"
SEVERITY="$3"

# Send to syslog so systemd-journald picks it up
logger -t smartd -p "user.${SEVERITY:-warning}" "[$DEVICE] $MESSAGE"

# Or send to a webhook/Prometheus pushgateway
curl -s -X POST http://localhost:9091/metrics/job/smartd/instance/${DEVICE} \
  --data-binary @- << EOF
# HELP smartd_alert_count Number of SMART alerts
# TYPE smartd_alert_count counter
smartd_alert_count{device="${DEVICE}",severity="${SEVERITY}"} 1
EOF

Start smartd:

sudo systemctl enable smartd
sudo systemctl start smartd
sudo systemctl status smartd

Check the logs:

sudo journalctl -u smartd -f

Automating Tests with Cron

smartd can handle scheduled tests, but for more control, run them via cron. This is useful if you want to stagger tests across multiple drives so they don’t all spin up at once (and cause a power spike).

# Run short test on /dev/sda at 1 AM daily
0 1 * * * root smartctl -t short /dev/sda

# Run long test on /dev/sdb every Sunday at 2 AM
0 2 * * 0 root smartctl -t long /dev/sdb

# Log SMART status to a file every 6 hours
0 */6 * * * root smartctl -a /dev/sda >> /var/log/smartctl-sda.log

Then read that log with something like:

# Show only reallocated sectors and pending sectors
grep -E "Reallocated_Sector|Current_Pending" /var/log/smartctl-sda.log

Integrating with Prometheus

If you’re running Prometheus (for a home lab this is overkill, but mention-worthy), use the smartctl_exporter:

# Install prometheus smartctl exporter
git clone https://github.com/prometheus-community/smartctl_exporter
cd smartctl_exporter
make build
sudo cp ./smartctl_exporter /usr/local/bin/

Set up a systemd service:

[Unit]
Description=Prometheus smartctl exporter
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/bin/smartctl_exporter
Restart=on-failure
RestartSec=5s

[Install]
WantedBy=multi-user.target

Add to your Prometheus config:

scrape_configs:
  - job_name: 'smartctl'
    static_configs:
      - targets: ['localhost:9633']

Now you can graph SMART attributes over time and set alerts when reallocated sectors or pending sectors climb.

What To Do When A Drive Starts Failing

You saw it coming. Maybe Current_Pending_Sector jumped from 0 to 47. Maybe Reallocated_Sector_Count started climbing. What now?

Don’t panic. You have time. That drive isn’t dead yet. But it will be.

Step 1: Verify It’s Really Failing

Run the long test and wait for results:

smartctl -t long /dev/sda
sleep 2h  # Wait for test
smartctl -x /dev/sda  # Check results

If the long test itself throws errors or the drive doesn’t complete the test, that’s a bad sign. The drive is struggling.

Step 2: Back Up Everything It Holds

If this drive is in a RAID array, stop here for a moment. You have options:

RAID 1 (mirror): The other drive has everything. You’re fine.
RAID 5 or 6: Start a rebuild now before the second drive fails. Yes, rebuild is stressful, but it’s better than hoping.
ZFS: If you’re running ZFS (Linux), use zpool replace to swap in a new drive. ZFS will resilver intelligently and you can watch it:

zpool replace poolname /dev/sda /dev/sdc  # Replace /dev/sda with /dev/sdc
zpool status -v  # Watch resilver progress

If the drive is a standalone backup or data drive, just copy everything off to another disk.

Step 3: Order a Replacement

Don’t wait. Buy the replacement drive now. Expect 3-7 business days. Your failing drive will probably last that long, but you don’t want to be surprised.

Step 4: Replace and Retest

Once the new drive arrives:

# Shut down gracefully
sudo shutdown -h now

# Physically swap the drive
# Power back on

# For ZFS pools:
zpool replace poolname /dev/sda

# For RAID:
sudo mdadm /dev/md0 --add /dev/sda
sudo mdadm /dev/md0 --remove /dev/sdb  # Remove failed drive

# For standalone drives, just copy data back
rsync -av /backup/ /mnt/newdrive/

Run the long test on the new drive to make sure it’s healthy:

smartctl -t long /dev/sda

Common Gotchas

“SMART says OK, but the drive is failing.” SMART status is binary. It lags reality. Watch the attributes, not the status.

“I don’t see any SMART data.” Some systems require elevated privileges, or the drive doesn’t support SMART (rare). Try sudo smartctl -a /dev/sda. If you get “Unknown USB bridge” or “No SMART device” it might be behind a controller that doesn’t expose SMART data.

“smartd won’t start.” Check /etc/smartd.conf for syntax errors. Run sudo smartd -D -d 1 to run smartd in debug mode and see what’s wrong.

“My NVMe drive shows no SMART data.” Some controllers don’t expose NVMe SMART over the standard interface. Try nvme smart-log /dev/nvme0n1 directly, or check if the drive manufacturer has their own monitoring tool.

“Reallocated sectors jumped overnight. Am I losing data?” No, not yet. The drive found bad sectors and moved the data to spares. You have days or weeks. Start the replacement process calmly. Panicking at 2 AM doesn’t help.

The Real Talk

SMART monitoring is boring. It’s the kind of thing you set up once and then ignore for years. That’s exactly when it’s working. The moment you see an alert about rising pending sectors or offline uncorrectable errors, you’ll be glad you bothered.

For a home lab or small NAS, this setup takes maybe 30 minutes:

Install smartmontools (apt install smartmontools)
Edit /etc/smartd.conf to monitor your drives and log to syslog
Enable smartd (systemctl enable smartd)
Set a cron job to run long tests weekly
Glance at journalctl -u smartd once a month

That’s it. You’re now catching disk failures weeks or months before they destroy your data. Your future self, the one at 3 AM when a drive dies, will thank you profusely.

Disks fail. But you’ll know when they’re about to.

Full example: If you’re running this on a Proxmox cluster or bare-metal Debian, the config above works verbatim. For other systems (UnRaid, TrueNAS, etc.), check their docs, they often have built-in SMART monitoring that’s already wired up. Don’t reinvent the wheel there.

SMART Disk Monitoring with smartmontools

Disks Fail. The Question Is Whether You’ll Know in Time.

What SMART Actually Measures (And Why “OK” Doesn’t Mean OK)

Getting Started with smartctl

Basic Commands

Installing smartmontools

Setting Up smartd for Continuous Monitoring

Automating Tests with Cron

Integrating with Prometheus

What To Do When A Drive Starts Failing

Step 1: Verify It’s Really Failing

Step 2: Back Up Everything It Holds

Step 3: Order a Replacement

Step 4: Replace and Retest

Common Gotchas

The Real Talk

Responses from around the web

Discussion

Related Posts

SnapRAID: Parity Without Real-Time RAID

RAID 50/60: Nested Parity Done Right

Borg vs Duplicacy: Dedup Backup Wars

rclone vs Restic: Sync vs Backup

SMART Disk Monitoring with smartmontools

Disks Fail. The Question Is Whether You’ll Know in Time.

What SMART Actually Measures (And Why “OK” Doesn’t Mean OK)

Getting Started with smartctl

Basic Commands

Installing smartmontools

Setting Up smartd for Continuous Monitoring

Automating Tests with Cron

Integrating with Prometheus

What To Do When A Drive Starts Failing

Step 1: Verify It’s Really Failing

Step 2: Back Up Everything It Holds

Step 3: Order a Replacement

Step 4: Replace and Retest

Common Gotchas

The Real Talk

Related Reading

Responses from around the web

Discussion

Related Posts

SnapRAID: Parity Without Real-Time RAID

RAID 50/60: Nested Parity Done Right

Borg vs Duplicacy: Dedup Backup Wars

rclone vs Restic: Sync vs Backup