Your hard disk is where your data lives, so it is prudent to monitor it with periodic self testing.
For this task we’ll be using smartmontools. I use Linux and these instructions reflect that, but smartmontools can also be used on other operating systems.
Set up postfix or another local mail transport agent. The hard disk self tests we will set up in a moment will notify us of problems via local mail.
Use your distribution’s package manager to install smartmontools. I also like to install gsmartcontrol, which lets you view the self test results in an easy to use GUI.
SET UP HARD DISK SELF-TESTING
Follow these instructions to set up daily automatic testing of the hard disk and have smartd mail you if a test detects a problem. In doing so, take note of the following:
- When you first edit /etc/smartd.conf, you should confirm that error messages will be sent and received correctly. Do so by adding -M test to the end of the smartd configuration line. Then restart smartd (as root, service smartd restart. You should immediately receive a test error message. Remove -M test once all is working.
- The instructions above add -M exec /usr/share/smartmontools/smartd-runner to smartd.conf; I find it unnecessary even on Debian. Your mileage may vary.
- Once smartd.conf is working to your liking, add it to your backup routine.
And you’re done. For reference, here’s my smartd.conf:
# Schedule short tests daily at 8 am and long tests Monday at 1 pm
/dev/sda -a -o on -S on -s (S/../.././08|L/../../1/13) -m warren@verdi
RESPONDING TO ERROR MESSAGES
Once set up, no news is good news. But one sad day you may receive a message like this:
From: root <email@example.com>
Subject: SMART error (CurrentPendingSector) detected on host: verdi.home.invalid
Date: Tue, 18 May 2010 17:56:45 -0600
This email was generated by the smartd daemon running on:
host name: verdi.home.invalid
DNS domain: home.invalid
NIS domain: (none)
The following warning/error was logged by the smartd daemon:
Device: /dev/sda, 1 Currently unreadable (pending) sectors
What to do? First, view the test results. You can do this from a terminal (as root, smartctl -l selftest /dev/sda, changing the path to the device as appropriate) or (again, as root) with the GUI tool gsmartcontrol. In the particular case cited above, I saw:
# smartctl -l selftest /dev/sda smartctl version 5.38 [i586-mandriva-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF READ SMART DATA SECTION === SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 17664 - # 2 Short offline Completed without error 00% 17663 - # 3 Extended offline Completed without error 00% 17615 - # 4 Extended offline Completed without error 00% 17435 - # 5 Extended offline Completed without error 00% 17287 - # 6 Extended offline Completed without error 00% 17163 - # 7 Extended offline Completed without error 00% 17016 - # 8 Extended offline Completed: read failure 80% 16867 43131442 # 9 Extended offline Completed: read failure 80% 16837 43131442 #10 Extended offline Completed without error 00% 16686 [snipped rest of tests]
This is a list of the 21 most recent self-tests, most recent first. This particular report shows that two tests had errors, but since then seven more recent tests completed without error. So this particular report does not worry me. Some errors, however, are serious. Make sure your backups are up to date and consult the smartmontools FAQ for information on specific errors.
Each time a self-test is run, the most recent test becomes #1 in the list and the oldest test is discarded. So in the example above the two failed tests will be flushed from the list over time. Be forewarned that when the last failed test is removed, a bug in smartmontools generates the warning message “new Self-Test Log error at hour timestamp 0”. This message can be safely ignored.
I don’t use it, but there is smart-nofifier, which when added to the session of the user will display hard disk error messages on the user’s screen — something that would be easily missed if the computer is running unattended.
Self testing your hard disk is just one aspect of hardware monitoring. Don’t overlook the bigger picture.
Google’s Disk Failure Experience
These notes were last updated 26 August 2014 with reference to smartmontools 6.2.