Linux monitoring tools to keep your hardware cool

No readers like this yet.
Iceburg with a cycle symbol

Opensource.com

Have you ever noticed that light bulbs (the incandescent ones especially) seem to burn out most frequently at the instant they're turned on? Or that electronic components like home theater systems or TVs worked fine yesterday but don't today when you turn them on? I have, too.

Have you ever wondered why that happens?

Thermal stress

There are many factors that affect the longevity of electronic equipment. One of the most ubiquitous sources of failure is heat. In fact, the heat generated by most electronic devices as they perform their assigned tasks is the very heat that shortens their electronic lives.

When I worked at IBM in Boca Raton at the dawn of the PC era, I worked as part of a group that was responsible for the maintainability of computers and other hardware of all types. One task of the labs in Boca Raton was to ensure that hardware broke very infrequently, and that when it did, it was easy to repair. I learned some interesting things about the effects of heat on the life of computers while I was there.

Let's go back to the light bulb because it is an easily visible, if somewhat infrequent, example.

Every time a light bulb is turned on, electric current surges into the filament and heats it very rapidly from room temperature to about 340° Fahrenheit (the temperature depends upon the wattage of the bulb). This causes thermal stress through vaporization of the metal of which the filament is made, as well as rapid expansion of the metal just caused by heating. When a light bulb is turned off, the thermal stress is repeated—though less severely—during the cooling phase as the filament shrinks. The more times a bulb is cycled on and off, the more the effects of this stress accumulate.

The primary effect of thermal stress is that some small parts of the filament—usually due to minute manufacturing variances—tend to become hotter than the other parts, causing the metal at those points to evaporate faster. This makes the filament even weaker at that point and more susceptible to rapid overheating in subsequent power-on cycles. Eventually, the last of the metal evaporates when the bulb is turned on and the filament dies in a very bright flash.

The electrical circuitry in computers is much the same as the filament in a light bulb. Repeated heating and cooling cycles can damage the computer's internal electronic components just as the filament of the light bulb was damaged over time.

Cooling is essential

Keeping computers cool is essential for helping to ensure that they have a long life. Large data centers spend a great deal of energy to keep the computers in them cool. Without going into the details, designers need to ensure that the flow of cool air is directed into the data center and specifically into the racks of computers to keep them cool. It is even better if they can be kept at a fairly constant temperature.

Proper cooling is essential even in a home or office environment. In fact, it is even more essential in those environments because the ambient temperature is so much higher (as it is primarily for the comfort of the humans).

Temperature monitoring

One can measure the temperature of many different points in a data center as well as within individual racks. But how can the temperature of the internals of a computer be measured?

Fortunately, modern computers have many sensors built into various components to enable monitoring of temperatures, fan speeds, and voltages. If you have ever looked at some of the data available when a computer is in BIOS configuration mode, you can see many of these values. But this does not show what is happening inside the computer when it is in a real world situation under loads of various types.

Linux has some software tools available to allow system administrators to monitor those internal sensors. Those tools are all based on the lm_sensors, Smart, and hddtemp library modules, which are available on all Red Hat based distributions and most others as well.

The simplest tool is the sensors command. Before the sensors command can be used, the sensors-detect command is used to detect as many of the sensors installed on the host system as possible. The sensors command then produces output including motherboard and CPU temperatures, voltages at various points on the motherboard, and fan speeds. The sensors command also displays the range of temperatures considered to be normal, high, and critical.

The hddtemp command displays temperatures for a specified hard drive. The smartctl command show the current temperature of the hard drive, various measurements that indicate the potential for hard drive failure, and, in some cases, an ASCII text history graph of the hard drive temperatures. This last output can be especially helpful in some types of problems.

When used with the appropriate library modules, the glances command can display hard drive temperatures as well as all of the same temperatures provided by the sensors command. glances is a top-like command that provides a lot of information about a running system including CPU and memory usage, I/O information about the network devices and hard drive partitions, as well as a list of the processes using the highest amounts of various system resources.

There are also a number of good graphical monitoring tools that can be used to monitor the thermal status of your computers. I like GKrellM for my desktop. There are plenty of others available for you to choose from.

I suggest installing these tools and monitoring the outputs on every newly installed system. That way, you can learn what temperatures are normal for your computers. Using a tool like glances allows you to monitor the temperatures in real time and understand how added loads of various types affect those temperatures. The other tools can be used to take snapshot looks at your computers.

Taking action

Doing something about high temperatures is pretty straightforward. It is usually a matter of replacing defective fans; installing newer, higher-capacity fans; and reducing the ambient temperature.

When building new computers or refurbishing older ones, I always install additional case fans or replace existing ones with larger ones where possible. Maximum airflow is important to efficient cooling. In some extreme environments, such as for gamers, liquid cooling can replace air cooling; most of us don't need to take it to that level.

I also typically replace the standard CPU cooling units with high capacity ones. At the very least, I replace the thermal compound between the CPU and the cooling radiator. I find that the thermal compound from the factory or computer store is not always evenly distributed over the surface of the CPU, which can leave some areas of the CPU with insufficient heat dissipation.

I have a large room over my attached garage that my wife and I use for our offices. Altogether I have 10 running computers, two laser printers (in sleep mode most of the time), multiple external hard drive enclosures with from one to four drives each, and six uninterruptible power supplies (UPS). These devices all generate significant amounts of heat.

Over the years I have had to deal with several window mounted air-conditioning units to keep our home office at a reasonable temperature. A couple years ago our HVAC unit died and it made sense to install a zoning system so that the upstairs office space would be cooled directly and the remaining cool air, being denser than any warm air downstairs, would flow down to the lower level. This works very well for me and keeps me and the computers at a comfortable temperature.

It is also possible to test the efficacy of your cooling solutions. There are a number of options and the one I prefer also performs useful work.

I have BOINC (Berkeley Open Infrastructure for Network Computing) installed on many of my computers and I run Seti@Home to do something productive with all of the otherwise wasted CPU cycles I own. It also provides a great test of my cooling solutions. There are also commercially available test suites that allow stress testing of memory, CPU, and I/O devices, which can be used to test cooling solutions as a side benefit.

So keep cool and compute on!

David Both
David Both is an Open Source Software and GNU/Linux advocate, trainer, writer, and speaker. He has been working with Linux and Open Source Software since 1996 and with computers since 1969. He is a strong proponent of and evangelist for the "Linux Philosophy for System Administrators."

8 Comments

Lately, I've been having some problems with excessive overheating of my notebook (brand omitted). Playing 1080p (full HD) videos from Youtube made this machine jump to 86°C (186°F, detected with the "sensors" command), dangerously near the hardware limit (90°C/194°F).

Among other things, I tried to restrict the CPU to low speeds (around 1GHz), thinking that it could run cooler. It didn't work and I tried the opposite (around 2.5GHz, all the time), which also didn't work.

I then noticed that the kernel didn't have SMP configured (but the CPU has). For a variety of reasons, I chose a 32-bit kernel instead of a 64-bit one. And that particular distro (Sparkylinux with Xfce: excellent) offered a 586 kernel without SMP (for older CPUs, it seems). In other words, that meant only one processor was in use.

After upgrading the kernel to one called "linux-image-686-pae" there was a 20°C/68°F drop in the temperature while playing a 1080p video. Quite impressive.

It seems to run a little cooler for normal tasks, too.

Living in a country way hotter than the US, sometimes off-the-shelf parts cannot endure the higher ambient temperature during Summer (even inside).

That is very interesting. Did you use something like top to see what processes were taking up the most CPU cycles? I can see that a different, perhaps more efficient, kernel could make a difference, but that sounds like a very large change in temperature.

Just for a comparison, the ambient temperature this morning in my computer room is about 23.3C and my computers - the ones that have high computing loads, are running anywhere from low 40sC to high 60sC. I run all 64-bit kernels.

In reply to by Unbeknownst (not verified)

First of all, for the sake of scientific reproducibility :), let me make clear I was comparing apples and oranges: there was a great drop in temperature from playing the video with flashplayer-plugin in FF (actually Iceweasel) to playing the same video with mplayer.

The flashplayer has been so much lambasted I don't even need to comment on its ability to fry computers and mplayer is a jewel of high-quality programming, as far as I can say as layman.

Nevertheless, I had formerly tested the same situation with just one processor (i.e., without SMP) and the difference was not as meaningful (some 5°C or about 40°F). My theory is that about the same heat is being produced but now in a more distributed way.

With that out of the way, it's a 5-year old notebook, probably with lots of dust. It has an i3 processor and integrated intel video; it was my daughter's, used with Windows, and now I'm trying to make it work because it has more RAM than my other (all Linux) computers. I also am not very used to dealing with thermal paste (I don't even have it at home) and I suppose it may be quite dry after all these years.

I used top and it shows firefox being responsible for most of the CPU use when playing 1080p videos (with the flashplayer). 720p doesn't help much, but 480p is quite easy on the CPU. Again, this notebook needs to go thru some serious maintenance. I tried using a vacuum (some people say it can damage fans) but there was not much apparent dust outside.

Some rare sites have many small animations and they also cause overheating (e.g. the news/email portal www.ig.com.br).

Amazingly, some little programs like a dialog tool ("yad") also made the CPU reach a high-temp quickly. Seems to be this one: http://sourceforge.net/projects/yad-dialog/

> Just for a comparison, the ambient temperature this morning in my computer room is about 23.3C and my computers - the ones that have high computing loads, are running anywhere from low 40sC to high 60sC. I run all 64-bit kernels.

I have a tower desktop on the same table. Right now, sensors show:

Core 0: +42.0°C (high = +76.0°C, crit = +100.0°C)
Core 1: +37.0°C (high = +76.0°C, crit = +100.0°C)

It also plays 1080p very well (and it's a 2009 core2 duo computer, Mint KDE i686 SMP).

The notebook is using mplayer to play a 1080p video with the SMP kernel (it shows 4 processors, probably two dual-cores). I have a window with a bash one-liner repeating sensors and sleep 1. Temperature varies from 58°C to 62°C (136 to 144°F). Pausing mplayer makes the temperature cool down to 43~45°C (110~113°F), which is comfortably low.

Today is colder than yesterday. It's cloudy, 22°C outside, probably some 24°C in this room now. There's a fan under the notebook and I got one at the ceiling, too. Getting fresh air from outside apparently makes the computers a little cooler (but not on sunny days!).

Iceweasel uses the most CPU 9% until I unpausing mplayer, when it starts using 33% CPU (playing separate video and audio files simultaneously) and, curiously, iceweasel jumps to 12% (how does top calculate these percentages?)

In reply to by dboth

Another thing is to leave the equipment powered on all the time, the thermal stress is thereby considerably removed.

Many years ago I visited the Multics development centre in Cambridge Mass. I asked why they powered the development system down every night and back up in the morning.

The point was precisely to stress the components. The unreliable ones tended to fail very quickly. Which was good, because when you are writing OS code the last thing you need is a flakey CPU.

Although I did not explicitly spell it out, I knew that most folks will understand that I advocate leaving computers on 24/7 to extend their lifetime.

And thanks for the information about Multics. I was unaware that they used thermal cycling to intentionally fail the unreliable components. That is a good thing to know, and especially cool coming from one who was there.

Thanks!

In reply to by dgrb (not verified)

Thanks David! I installed the software on my laptop and now I can begin to learn how to use it.

I'm not sure where you got your figure of 340° Fahrenheit for the temperature of a light bulb filament. This seemed exceptionally low to me and upon checking I see that it's "roughly 2550 degrees Celsius, or roughly 4600 degrees Fahrenheit".

I think you are looking at color temperatures rather than thermal temperatures;the two are vastly different. I got my temp info here: https://en.wikipedia.org/wiki/Incandescent_light_bulb but it is the surface of the glass bulb and not the filament temperature.

So you are very likely correct that the *filament* temp would be significantly higher than that. The following article provides a color chart that provides the approximate thermal temp of white light - the surface of the filament, for example - at about 1300 Celsius.

https://en.wikipedia.org/wiki/Incandescence

Thanks for pointing that out.

In reply to by John Pennifold (not verified)

Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.