Disks Fail. The Question Is Whether You’ll Know in Time.
Here’s the thing: every hard drive and SSD you own will fail eventually. Not metaphorically. Physically. And when it does, you want to know about it—ideally before your NAS starts rebuilding a RAID array at 3 AM, or worse, before you discover a silent data loss in your backup.
That’s where SMART comes in.
SMART stands for Self-Monitoring, Analysis, and Reporting Technology. It’s been baked into every modern drive for decades. Your drive is constantly measuring things: temperature, seek errors, sector reallocations, command timeouts. The problem? Most monitoring setups are completely useless. Your NAS tells you “SMART OK” and you assume everything’s fine. It’s not. It’s lying to you.
This guide shows you how to read what your drives are actually saying, configure smartmontools to actually catch failures before they wreck you, and integrate that data into your monitoring stack so you can sleep at night.
What SMART Actually Measures (And Why “OK” Doesn’t Mean OK)
SMART is an old standard. It predates SSDs. It was designed by drive manufacturers to tell you (the user) that a drive is about to die, not to give you deep insight into drive health. This is important. SMART status is binary: PASSED or FAILED. That “OK” badge you see is just the PASSED state. It tells you almost nothing.
Here’s the trap: a drive can have several hundred reallocated sectors and still report “OK.” It can be losing sectors in real time and report “OK.” The SMART FAILED state is more like a dead-man’s switch—by the time it trips, you’ve usually got hours to days before total failure, not weeks or months of warning.
Backblaze, the cloud backup company, analyzed petabytes of real drive telemetry. They found that specific SMART attributes correlate with failure rates. Most attributes? Useless noise. The ones that matter—the ones that actually predict failure—are:
- Reallocated_Sector_Count (5): The drive found bad sectors and moved the data to a spare pool. One or two reallocations might be normal wear. More than that? Your drive is degrading.
- Current_Pending_Sector (197): Sectors the drive suspects are bad but hasn’t reallocated yet. These will become reallocated sectors. If this number is rising, your drive is failing.
- Offline_Uncorrectable (198): Sectors the drive can’t read even offline. Game over. This should always be zero.
- Reported_Uncorrectable_Errors (187): Drive firmware couldn’t correct errors on read. Should be zero.
- Command_Timeout (188): Drive didn’t respond to a command in time. A few timeouts over months? Meh. Dozens in a week? Replace the drive.
Everything else—Power_On_Hours, Temperature (within normal ranges), Spin-up time—is mostly decorative. Your 5-year-old drive running at 45°C is fine. Power-on hours don’t kill drives; degradation does.
Getting Started with smartctl
smartmontools gives you two tools: smartctl for one-off queries, and smartd for continuous monitoring. Start with smartctl to get comfortable reading your drives.
Basic Commands
# Get overall health statussmartctl -a /dev/sda
# Get detailed info and firmwaresmartctl -i /dev/sda
# Run a comprehensive test (takes ~10 mins)smartctl -t short /dev/sda
# Run the long test (takes 2+ hours)smartctl -t long /dev/sda
# Check test resultssmartctl -x /dev/sdaThe -a flag (all) is your main weapon. It dumps the whole SMART table: current values, thresholds, worst values. Read it top to bottom. The attributes that matter have non-zero raw values when failing.
For NVMe drives (the -x flag is your friend):
# NVMe-specific detailssmartctl -x /dev/nvme0n1NVMe attributes are different. Look for:
- Critical_Warning: Should be 0. Anything else means the drive is about to give up.
- Available_Spare: How much spare capacity is left. SSDs use this for wear leveling. Below 10%? You’re getting close.
- Media_Errors: Errors on the flash cells. Should be trending toward zero or staying stable, not climbing.
Installing smartmontools
On most distros, it’s trivial:
# Debian/Ubuntusudo apt install smartmontools
# RHEL/Rocky/CentOSsudo dnf install smartmontools
# Archsudo pacman -S smartmontoolsOn macOS (if you’re doing this locally):
brew install smartmontoolsAfter install, check that smartd isn’t auto-running:
sudo systemctl status smartdIf it’s not enabled, that’s fine. We’ll configure it properly next.
Setting Up smartd for Continuous Monitoring
smartctl is great for poking at a drive once. But you need something running 24/7 to catch degradation in real time. That’s smartd.
The config file is /etc/smartd.conf. Out of the box, it’s often commented out or pointing to all drives without useful alerts. Let’s fix that.
# Monitor all SATA drives with aggressive attribute monitoring/dev/sda -a -o on -S on -n standby,q -s (S/../.././02|L/../../6/03) -W 4,45,50 -m [email protected] -M exec /path/to/alert-script.sh
# NVMe drives/dev/nvme0n1 -a -n standby,q -M exec /path/to/alert-script.sh
# Watch specific attributes that predict failure/dev/sda -l selftest -l errorlogBreaking this down:
-a= monitor all attributes-o on= turn on automatic offline testing-S on= enable automatic attribute autosave-n standby,q= don’t spin up the drive for testing if it’s in standby (quiet mode)-s (S/../.././02|L/../../6/03)= run short tests every day at 2 AM, long tests every Saturday at 3 AM-W 4,45,50= temperature warning at 45°C, critical at 50°C-m [email protected]= email alerts to this address (requires mail setup)-M exec /path/to/alert-script.sh= execute a custom script on alerts
The script part is where the real magic happens. Email often doesn’t work in home labs (no MTA). Instead, use an exec script to send alerts to your monitoring system.
Example alert script:
#!/bin/bashDEVICE="$1"MESSAGE="$2"SEVERITY="$3"
# Send to syslog so systemd-journald picks it uplogger -t smartd -p "user.${SEVERITY:-warning}" "[$DEVICE] $MESSAGE"
# Or send to a webhook/Prometheus pushgatewaycurl -s -X POST http://localhost:9091/metrics/job/smartd/instance/${DEVICE} \ --data-binary @- << EOF# HELP smartd_alert_count Number of SMART alerts# TYPE smartd_alert_count countersmartd_alert_count{device="${DEVICE}",severity="${SEVERITY}"} 1EOFStart smartd:
sudo systemctl enable smartdsudo systemctl start smartdsudo systemctl status smartdCheck the logs:
sudo journalctl -u smartd -fAutomating Tests with Cron
smartd can handle scheduled tests, but for more control, run them via cron. This is useful if you want to stagger tests across multiple drives so they don’t all spin up at once (and cause a power spike).
# Run short test on /dev/sda at 1 AM daily0 1 * * * root smartctl -t short /dev/sda
# Run long test on /dev/sdb every Sunday at 2 AM0 2 * * 0 root smartctl -t long /dev/sdb
# Log SMART status to a file every 6 hours0 */6 * * * root smartctl -a /dev/sda >> /var/log/smartctl-sda.logThen read that log with something like:
# Show only reallocated sectors and pending sectorsgrep -E "Reallocated_Sector|Current_Pending" /var/log/smartctl-sda.logIntegrating with Prometheus
If you’re running Prometheus (for a home lab this is overkill, but mention-worthy), use the smartctl_exporter:
# Install prometheus smartctl exportergit clone https://github.com/prometheus-community/smartctl_exportercd smartctl_exportermake buildsudo cp ./smartctl_exporter /usr/local/bin/Set up a systemd service:
[Unit]Description=Prometheus smartctl exporterAfter=network.target
[Service]Type=simpleExecStart=/usr/local/bin/smartctl_exporterRestart=on-failureRestartSec=5s
[Install]WantedBy=multi-user.targetAdd to your Prometheus config:
scrape_configs: - job_name: 'smartctl' static_configs: - targets: ['localhost:9633']Now you can graph SMART attributes over time and set alerts when reallocated sectors or pending sectors climb.
What To Do When A Drive Starts Failing
You saw it coming. Maybe Current_Pending_Sector jumped from 0 to 47. Maybe Reallocated_Sector_Count started climbing. What now?
Don’t panic. You have time. That drive isn’t dead yet. But it will be.
Step 1: Verify It’s Really Failing
Run the long test and wait for results:
smartctl -t long /dev/sdasleep 2h # Wait for testsmartctl -x /dev/sda # Check resultsIf the long test itself throws errors or the drive doesn’t complete the test, that’s a bad sign. The drive is struggling.
Step 2: Back Up Everything It Holds
If this drive is in a RAID array, stop here for a moment. You have options:
- RAID 1 (mirror): The other drive has everything. You’re fine.
- RAID 5 or 6: Start a rebuild now before the second drive fails. Yes, rebuild is stressful, but it’s better than hoping.
- ZFS: If you’re running ZFS (Linux), use
zpool replaceto swap in a new drive. ZFS will resilver intelligently and you can watch it:
zpool replace poolname /dev/sda /dev/sdc # Replace /dev/sda with /dev/sdczpool status -v # Watch resilver progressIf the drive is a standalone backup or data drive, just copy everything off to another disk.
Step 3: Order a Replacement
Don’t wait. Buy the replacement drive now. Expect 3-7 business days. Your failing drive will probably last that long, but you don’t want to be surprised.
Step 4: Replace and Retest
Once the new drive arrives:
# Shut down gracefullysudo shutdown -h now
# Physically swap the drive# Power back on
# For ZFS pools:zpool replace poolname /dev/sda
# For RAID:sudo mdadm /dev/md0 --add /dev/sdasudo mdadm /dev/md0 --remove /dev/sdb # Remove failed drive
# For standalone drives, just copy data backrsync -av /backup/ /mnt/newdrive/Run the long test on the new drive to make sure it’s healthy:
smartctl -t long /dev/sdaCommon Gotchas
“SMART says OK, but the drive is failing.” SMART status is binary. It lags reality. Watch the attributes, not the status.
“I don’t see any SMART data.” Some systems require elevated privileges, or the drive doesn’t support SMART (rare). Try sudo smartctl -a /dev/sda. If you get “Unknown USB bridge” or “No SMART device” it might be behind a controller that doesn’t expose SMART data.
“smartd won’t start.” Check /etc/smartd.conf for syntax errors. Run sudo smartd -D -d 1 to run smartd in debug mode and see what’s wrong.
“My NVMe drive shows no SMART data.” Some controllers don’t expose NVMe SMART over the standard interface. Try nvme smart-log /dev/nvme0n1 directly, or check if the drive manufacturer has their own monitoring tool.
“Reallocated sectors jumped overnight. Am I losing data?” No, not yet. The drive found bad sectors and moved the data to spares. You have days or weeks. Start the replacement process calmly. Panicking at 2 AM doesn’t help.
The Real Talk
SMART monitoring is boring. It’s the kind of thing you set up once and then ignore for years. That’s exactly when it’s working. The moment you see an alert about rising pending sectors or offline uncorrectable errors, you’ll be glad you bothered.
For a home lab or small NAS, this setup takes maybe 30 minutes:
- Install smartmontools (
apt install smartmontools) - Edit
/etc/smartd.confto monitor your drives and log to syslog - Enable smartd (
systemctl enable smartd) - Set a cron job to run long tests weekly
- Glance at
journalctl -u smartdonce a month
That’s it. You’re now catching disk failures weeks or months before they destroy your data. Your future self—the one at 3 AM when a drive dies—will thank you profusely.
Disks fail. But you’ll know when they’re about to.
Full example: If you’re running this on a Proxmox cluster or bare-metal Debian, the config above works verbatim. For other systems (UnRaid, TrueNAS, etc.), check their docs—they often have built-in SMART monitoring that’s already wired up. Don’t reinvent the wheel there.