If you want a reliable box, you need to monitor your critical hardware. Here’s how to keep an eye on your hard disk, processor, and fans. Overclockers will also want to watch their voltages; I do not address that but the same procedure applies.
TODO: This article is a mess. Make it an easily understood overview, spinning details off into separate logical articles that this links to. Worse, it implements monitoring in userspace. Better, do the monitoring at the system level.
We will do this in three parts. First, we gather needed information on our hardware. Second, we install and configure back-end utilities to monitor the hardware of interest. Finally, we install and configure a front end application to display the sensors values.
Before we begin, I should reveal my priorities. Although processor temperature is important, I do not share some users’ obsession with it. My utmost concern is for the hard disk. As Bill Clinton would have said, “It’s the data, stupid.”
I originally wrote these notes from the perspective of monitoring, but this information can also be used in a diagnostic context. It could help you determine, for example, if your intermittent problem is correlated with overheating — but to do that you first need to have been watching your hardware since before the problem began. So let’s start monitoring:
Job one is to identify your hard disk and processor and determine what their minimum, maximum, and recommended operating temperatures are.
Hardware identification is straightforward: the command line tool lshw (must be run as root) or the graphical tool hardinfo usually tell you all you need to know. They may return some data in different terms than the manufacturer uses. For example, lshw reported my processor’s version as “6.15.13”; the final number corresponding to Intel’s designation of stepping M0 for this processor (M being the 13th letter of the alphabet). Sometimes you may just have to open the case and peek inside: lshw could tell me many things about my hard disk but not the model number.
Pinning down a chip maker on a maximum temperature for a product can be a challenge. Intel hides such information so well that your best bet may be to consult a reputable third-party site such as CPU World. You may find, as I did, that the maximum temperature of your processor depends upon which stepping you have, in which case lshw is again your friend.
Thankfully, the hard disk manufacturers I have consulted all openly provide thermal ranges for their products.
Minimum and maximum are one thing; recommended temperature is something else. Curiously, I have found few formal studies addressing this, and like this one they examine datacenters rather than workstations and so ignore the stresses of thermal cycling and drive spin-up/spin-down. Lacking objective data, I must grudgingly settle for the anecdotal advice floating around the net, which is:
- Processors: It is generally accepted that there is no recommended operating temperature for processors; just don’t exceed the rated maximum. A greater concern is to minimize thermal cycling. On a constantly running server this is of course not an issue. On a workstation this can be addressed by not turning it on until you need it, and then leaving it on until you are finished with it for the day.
- Hard disks: It is commonly believed that the warmer your disk runs, the shorter its expected life, even if well below the drive’s rated maximum operating temperature. The most common recommendation is 45°C or less.
POLL THE PROCESSOR AND FANS
Install lm_sensors. On some distributions, the package is named lm-sensors or sensors.
As root, run sensors-detect and follow the instructions; the default values are usually okay. Older versions of sensors-detect required you to manually edit a configuration file, but modern versions should create /etc/sysconfig/lm_sensors for you.
The lm_sensors service should now be running and detected; confirm this by running as user sensors. You should see something like this:
$ sensors w83627dhg-isa-0290 Adapter: ISA adapter VCore: +1.18 V (min = +0.00 V, max = +1.74 V) in1: +10.61 V (min = +0.00 V, max = +3.38 V) ALARM AVCC: +3.38 V (min = +1.28 V, max = +2.19 V) ALARM 3VCC: +3.38 V (min = +0.08 V, max = +0.77 V) ALARM in4: +1.85 V (min = +0.02 V, max = +1.09 V) ALARM in5: +1.26 V (min = +0.06 V, max = +1.02 V) ALARM in6: +0.31 V (min = +0.51 V, max = +0.00 V) ALARM VSB: +3.28 V (min = +0.06 V, max = +0.03 V) ALARM VBAT: +3.17 V (min = +1.02 V, max = +0.77 V) ALARM Case Fan: 0 RPM (min = 0 RPM, div = 128) ALARM CPU Fan: 1480 RPM (min = 4963 RPM, div = 8) ALARM Aux Fan: 0 RPM (min = 0 RPM, div = 128) ALARM fan4: 0 RPM (min = 10546 RPM, div = 128) ALARM fan5: 0 RPM (min = 10546 RPM, div = 128) ALARM Sys Temp: +64°C (high = +26°C, hyst = -109°C) [CPU diode ] ALARM CPU Temp: +50.0°C (high = +80.0°C, hyst = +75.0°C) [CPU diode ] AUX Temp: +127.0°C (high = +80.0°C, hyst = +75.0°C) [thermistor] ALARM vid: +0.000 V coretemp-isa-0000 Adapter: ISA adapter Core 0: +59°C (high = +100°C) coretemp-isa-0001 Adapter: ISA adapter Core 1: +58°C (high = +100°C)
If needed, edit the sensors’ minimum or maximum values with an lm_sensors configuration file. While it is not strictly necessary, it is best practice: sensor alarm messages are logged to /var/log/messages every minute, and spurious messages make the log needlessly hard to read.
Use your distribution’s tools to confirm that the lm_sensors service is configured to start on boot.
We have lm_sensors running, but the raw data it is reporting might need to be calibrated. To do this we shall assume that your BIOS hardware monitor reports correct data which we can use as a reference.
Print out the output of sensors (as user, sensors | lp). Immediately reboot and quickly enter the BIOS hardware monitor. Compare the sensors and BIOS values. Do not expect the values or even the names to be identical, but they should be similar. If they are not, continue with the reboot and edit the lm_sensors configuration file to calibrate its values per its documentation.
Keep in mind as you do this that processor temperature fluctuates greatly and quickly depending on load, and that your processor is probably idle while you are in the BIOS. A typical and reasonable processor temperature reported by the BIOS would be somewhat lower than what lm_sensors reported a moment ago (because the processor is now idle) and would be falling fast.
Once rebooted, insure that hddtemp and sensors are operating properly, giving again the commands hddtemp /dev/sda and sensors as user.
Display sensor values and act on alarms
Now that your hardware of interest is being polled, you will want to display the collected data in a monitoring application. There are several options available; some are:
- Many desktop environments come with their own monitors, often designed for use in the tray or panel. This would be the easiest solution. One example of this is the xfce4-sensors-plugin.
- Conky is configurable and scriptable. Displays on the root window. Requires (sometimes extensive) configuration; doesn’t work out of the box. For configurability it is unmatched.
- Gkrellm is dockable, skinnable, and comes with too many plugins to count. This is another easy to implement solution that works out of the box for many users. Considered best of breed by many.
I consider it essential that a monitoring application is able to run a script on an alarm condition, for example to hibernate an unattended box if the hard disk overheats. Conky and gkrellm can do this. Without this ability, a monitor is just eye candy. For this reason, popular projects like adesklets, gdesklets, and screenlets don’t make my list.
Test your processor cooling and voltages
Once you have monitoring and alarms in place, you can now safely test your cooling and voltages. Do so and correct any problems detected.