Tips for troubleshooting with Linux
A troubleshooting process for Linux problems
Although it would be nice to believe that cars, home theater systems, computers, and Linux never break, the reality is that they do.
Many people have no problems with Linux, but those who do want the best information and guidance possible. You can obtain professional help from a number of places. For example, if you purchased Linux from a major vendor such as Red Hat, you are entitled to some level of service from that vendor. In fact, what you are actually purchasing is the service. Other help is available on the internet on various web sites and forums. Local user groups may also be available in your geographical area, and you may even have some friends who use Linux and are willing to offer a hand. Do not hesitate to use any and all resources available to you.
Most of the time those of us who use Linux prefer—even enjoy—doing our own troubleshooting.
Solving problems of any kind is an art and a science. Solving technical problems, such as those that occur with computers, requires a good deal of specialized knowledge as well.
Any approach to solving problems of any nature—including problems with computers and Linux—must include more than just a list of symptoms and the steps necessary to fix or circumvent the problems which caused the symptoms. This so-called "symptom-fix" approach looks good on paper to the old-style managers (those managers who do not participate in The Open Organization) but sucks in practice.
There are five basic steps that are involved in the problem solving process that I use:
You probably already follow these steps when you troubleshoot a problem but do not even realize it. If you follow these steps each time you engage in solving a problem, you should be successful most of the time. These steps are universal and apply to solving most any type of problem, not just problems with computers or Linux.
I used these steps for years in solving electronic and computer problems without realizing it. Having them codified for me made me much more effective at solving problems because when I became stuck, I could review the steps I had taken, verify where I was in the process and restart at an appropriate step if necessary.
You may have heard a couple other terms applied to problem solving in the past. The first three steps of this process are also known as problem determination, that is, finding the cause of the problem. The last two steps are problem resolution which is actually fixing the problem.
Knowledge of the subject in which you are attempting to solve a problem is the first step. You must be knowledgeable about Linux at the very least, and even better, you must be knowledgeable about the other factors that can interact with and affect Linux, such as hardware, the network, and even environmental factors such as how temperature, humidity and the electrical environment in which the Linux system operates can affect it.
Knowledge can be gained by reading books and magazines about Linux and those other topics. You can attend classes, seminars, and conferences. You can also just set up a number of Linux computers in a networked environment and through interaction with other knowledgeable people.
My personal preference is to play—uh, experiment—with Linux or with a particular piece such as networking, and then take a class or two to formalize the knowledge I have gained.
Remember that without knowledge, "resistance is futile," to paraphrase the Borg. Knowledge is power.
The second step in solving the problem is to observe the symptoms of the problem. It is important to take note of all of the problem symptoms. It is also important to observe what is working properly.
This is not the time to try to fix the problem; merely observe.
An important part of observation is to ask yourself questions about what you see and what you do not see. Aside from the questions you need to ask that are specific to the problem, there are some general questions to ask:
- Is this problem caused by hardware, Linux, application software, or perhaps by lack of user knowledge or training?
- Is this problem similar to others I have seen?
- Is there an error message?
- Are there any log entries pertaining to the problem?
- What was taking place on the computer just before the error occurred?
- What did I expect to happen if the error had not occurred?
- Has anything about the system hardware or software changed recently?
Other questions will reveal themselves as you work to answer these. The important thing to remember here is to gather as much information as possible. This increases the knowledge you have about this specific problem and aids in finding the root cause.
Use on-line resources to search for similar bugs. Perhaps this problem has already been reported and there is a fix for it.
As you gather data, never assume that the information obtained from someone else is correct. Observe everything yourself. This can be a major problem if you are working with someone who is at a remote location. Careful questioning is essential and tools that allow remote access to the system in question are extremely helpful when attempting to confirm the information that you are given.
When questioning a person at a remote site, never ask leading questions; They will try to be helpful by answering with what they think you want to hear.
At other times the answers you receive will depend upon how much or how little knowledge the person has of Linux and computers in general. When a person knows—or thinks he knows—about computers, the answers you receive may contain assumptions that can be difficult to disprove. Rather than ask, "Did you check..." it is better to have the other person actually perform the task required to check the item. And rather than telling the person what he or she should see, simply have the user explain or describe to you what he or she sees. Again, remote access to the machine can allow you to confirm the information you are given.
The best problem solvers are those who never take anything for granted. They never assume that the information they have is 100% accurate or complete. When the information you have seems to contradict itself or the symptoms, start over from the beginning as if you have no information at all.
Deduce from your observations of the symptoms what the problem might be.
This is where art applies to problem solving. The art of deducing from your observations of the problem and your knowledge and past experience is where art, and perhaps a bit of magic, mix with science to produce inspiration, intuition, or some other mystical mental process that provides some clue to the root cause of the problem.
In some cases this is a fairly easy process. You can see an error code and look up its meaning from the sources available to you. You can then apply the vast knowledge you have to deduce—the artful part—the cause of the problem. In other cases it can be a very difficult part of the problem determination process.
It helps to remember that the symptom is not the problem. The problem causes the symptom. You want to fix the true problem not just the symptom.
Now is the time to perform the appropriate repair action. This is usually the simple part. The hard part is what came before—figuring out what to do. After you know the cause of the problem it is easy to determine the correct repair action to take.
This might be to replace a defective hard drive or motherboard, or it might be necessary to upgrade or even fix some software.
For software with bugs, if you do not have the skills to fix it yourself or within your organization, the very least you should do is to report the bug using the appropriate means. I have reported a few bugs to Red Hat using Bugzilla. Anyone can create a Bugzilla account and search for existing similar bugs or report a new bug.
After taking some overt repair action the repair should be tested. This usually means performing the task that failed in the first place or something that exercises the broken bit.
If the repair action has not been successful, you should begin the procedure over again starting with the observed symptoms. It is possible that they have changed due to the action you have taken and you need to be aware of this in order to make informed decisions during the next iteration of the process. Even if the problem has not been resolved, the altered symptom could be very valuable in determining how to proceed.
One example of solving a problem from my own experience occurred a few years ago in my role as a part time Linux System Administrator in a test lab environment. It is fairly simple but can illustrate the process flow of the steps I have outlined.
I received an email from one of our testers indicating that an application he had installed as part of a test was crashing. It was giving error messages indicating that it was out of swap space. This is the initial observation performed by the user and transmitted to me.
My knowledge told me that the system that was being used for testing this application has 16GB of RAM and 2GB of swap space. Previous experience (Knowledge) told me that swap space in these computers is almost never touched and RAM usage is typically far below 25% of the 16GB of RAM in these boxes.
At this point I deduced that the problem was not really a problem with swap space as that would seem highly improbable. I could still hold that possibility open, though only very slightly. You will find that many error messages provided by programs can be quite misleading and user observations can be even more so.
I made some observations of my own. I logged into the box and used the free command as a tool to view memory and swap space. Lots of free RAM and swap space usage was at zero. I Know that if swap space usage is actually zero, then it is very likely that none of the available swap space has never been allocated and no paging has occurred since the last boot.
I also deduced from previous experience (knowledge) that there might be a kernel of truth in that error message. That being it was very likely to be out of some resource or other. The other primary consumable resources are CPU cycles and disk space.
This did not seem like a CPU problem, so I observed disk space using the df command which showed that the /var filesystem was full. I deduced that the full filesystem was the cause of the problem.
All of our systems were kickstarted with a /var filesystem of 1.5GB. Our policy was to install application programs in /opt which is where the ones we tested are designed to be installed.
I discussed this with the tester and was told that he had indeed installed the application in /var. I told him to uninstall from /var and install the application in /opt where it belonged. After taking this action, I had him test the corrective application by performing the operation that had previously failed. The test was successful and the problem solved.
As you work through a problem it will be necessary to loop back through at least some of the steps. If, for example, performing a given corrective action does not resolve the problem, you may need to try another action or you may need to go back to the observation step and gather more information about the problem.
Analyze your process
I have been teaching people to repair both hardware and software for many years. I think that many of us use some form of problem solving process whether it has been formalized or not. When I was taught about this process it enabled me to understand when and where it was breaking down for me as I worked to solve problems. That allowed me to analyze where I was going wrong and to get back on track.
Your process may be different, and you may not realize that you actually have a describable and repeatable process. But, if you are successful at solving computer problems, you do. Awareness of that process, whatever it may be for you, can help you in resolving future problems.