My first sysadmin mistake

How to focus on finding a solution amid the panic.
287 readers like this
287 readers like this
red pen editing mistakes

Opensource.com

If you work in IT, you know that things never go completely as you think they will. At some point, you'll hit an error or something will go wrong, and you'll end up having to fix things. That's the job of a systems administrator.

As humans, we all make mistakes. Sometimes, we are the error in the process, or we are what went wrong. As a result, we end up having to fix our own mistakes. That happens. We all make mistakes, typos, or errors.

As a young systems administrator, I learned this lesson the hard way. I made a huge blunder. But thanks to some coaching from my supervisor, I learned not to dwell on my errors, but to create a "mistake strategy" to set things right. Learn from your mistakes. Get over it, and move on.

My first job was a Unix systems administrator for a small company. Really, I was a junior sysadmin, but I worked alone most of the time. We were a small IT team, just the three of us. I was the only sysadmin for 20 or 30 Unix workstations and servers. The other two supported the Windows servers and desktops.

Any systems administrators reading this probably won't be surprised to know that, as an unseasoned, junior sysadmin, I eventually ran the rm command in the wrong directory. As root. I thought I was deleting some stale cache files for one of our programs. Instead, I wiped out all files in the /etc directory by mistake. Ouch.

My clue that I'd done something wrong was an error message that rm couldn't delete certain subdirectories. But the cache directory should contain only files! I immediately stopped the rm command and looked at what I'd done. And then I panicked. All at once, a million thoughts ran through my head. Did I just destroy an important server? What was going to happen to the system? Would I get fired?

Fortunately, I'd run rm * and not rm -rf * so I'd deleted only files. The subdirectories were still there. But that didn't make me feel any better.

Immediately, I went to my supervisor and told her what I'd done. She saw that I felt really dumb about my mistake, but I owned it. Despite the urgency, she took a few minutes to do some coaching with me. "You're not the first person to do this," she said. "What would someone else do in your situation?" That helped me calm down and focus. I started to think less about the stupid thing I had just done, and more about what I was going to do next.

I put together a simple strategy: Don't reboot the server. Use an identical system as a template, and re-create the /etc directory.

Once I had my plan of action, the rest was easy. It was just a matter of running the right commands to copy the /etc files from another server and edit the configuration so it matched the system. Thanks to my practice of documenting everything, I used my existing documentation to make any final adjustments. I avoided having to completely restore the server, which would have meant a huge disruption.

To be sure, I learned from that mistake. For the rest of my years as a systems administrator, I always confirmed what directory I was in before running any command.

I also learned the value of building a "mistake strategy." When things go wrong, it's natural to panic and think about all the bad things that might happen next. That's human nature. But creating a "mistake strategy" helps me stop worrying about what just went wrong and focus on making things better. I may still think about it, but knowing my next steps allows me to "get over it."

photo of Jim Hall
Jim Hall is an open source software advocate and developer, best known for usability testing in GNOME and as the founder + project coordinator of FreeDOS. At work, Jim is CEO of Hallmentum, an IT executive consulting company that provides hands-on IT Leadership training, workshops, and coaching.

12 Comments

I think we've all deleted something we shouldn't have. My worse was deleting the product file on a live point of sale system. Most of the clients had several copies of the product file in several places on their systems and I was asked to dial in and delete the copies as they were running out of disk space. So I deleted a few copies and did one more with cd / rm / pwd (see what I did wrong there?) I immediately saw my error and phoned the client, who were phoning us to say that a product they knew they had loads of in stock, was missing off their system.

I was very lucky - I deleted the file at 16:45 and the business closed at 17:15 so they only had to hand write orders for 30 mins. Had I done it in the morning, they would not have been best pleased, as the restore took 2 hours from tape. I had to stay late to fix my mess.

Moral of the story: always make backup. ?

-and backup everything, not just your data directories. Having a backup of /etc would have made recovery much easier in my case (and yes, I learned, and made backups of everything after that).

Also: test backups regularly to make sure they work! :-)

In reply to by Shawn H Corey

thanks for sharing this. I deleted first under /vars before when I started.

My first "oh no, Benjamin, what have you done?!" was when I was moving data around to upgrade disks on a storage array. I did an `rsync --delete`....in the wrong direction. Fortunately, it was on the newest of our arrays, so it only impacted a dozen or so users (me included, which helped with the apology). It took the better part of a day for all the restores to finish.

I've been a contractor for years. I've made a fair number of mistakes, but eventually settled into patterns like canary'ing (apply changes to a throwaway server first, then health-checking).

I once deleted the MySQL data directory because I was tired and a forum told me it would make MySQL faster. Btw thanks trolls, for giving me a reason to work more reasonable hours.

The worst I've seen from another contractor was from a former Yahoo Employee I still occasionally work with. I only know they used to work for Yahoo, because they told me "don't worry, I fixed your code for you", then told me when I asked them not to modify my code "I've forgotten more since we last spoke than you know" or other snotty words to that effect. I was not very happy.

I know it's wrong, but I was so happy to learn they had wiped out 48 hours of server time, and I was asked to fix due to them being smarter than me. The lesson is that we all make mistakes, and that big company success doesn't mean you're never wrong, and it's a good idea to not climb too high.

Work out a process before the disaster, like backups of mysql data folders or committed code to a repo. Then it's muscle memory to the rescue and you get to work normal hours too!

How not to do things. This is valuable. Especially when one goes deep into analyzing what went wrong, outlining it and proposing better ways to do things. Hopefully, these will get adopted by others.

I think the best part of this post is in how the supervisor reacted, this is how you create an atmosphere where the employees are not afraid to raise their hand if they do make a mistake and by doing that greatly improve the speed of recovery from those mistakes, not to mention the fact that the supervisor coached instead of moving the problem to someone more experienced.

I was a system administrator of some Novell NetWare 6.0 and 6.5 systems back in 2003. We used Tivoli Storage Manager to backup our users files. Because we was just beginning to use TSM - and we hadn't done the training yet - we didn't know how to set up the log rotation correctly, so we used to just delete the logs when they got too big. There were also other services that filled the HDs with logs, but TSM was by far the most eager.

The fact is that it was a Friday night, I just had to clean up some logs before I could go and get a beer with my friends, so I used a tool to summarize the used space on a NetWare volume on the server to measure the amount of free space left on the SYS volume (something like C:\Windows for Windows systems).

Before I hit the "delete" key on the keyboard, I hit "F5" to refresh the contents of the screen, but when I refreshed the screen, it also reseted the selection, selecting the root directory of the SYS volume.

I was just waiting for it to finish deleting when I saw that some NLM files where being deleted. NLM files are the executable binaries in a NetWare system. I immediately cancelled the deletion and thought about what should I do. I then realized that I could use the "salvage" option. "Salvage" works like the Windows recycle bin, but way better, because it kept multiple versions of the same files and tracked time, date and user who deleted. I just had to filter for recent files deleted by me so I could restore them back with no downtime to the system.

Of course that I was lucky because The salvage was enabled (why do they even had the option to disable it?) and because the server didn't stop, so it wouldn't crash at the boot.

It was a bad time, but in the end it became a funny story to tell my friends at the bar, and we laughed...

For configfiles I keep a /root/archive folder, and have a little script that copies files there with a datetimestamp appended. I mentally work through a roll back plan before implementing any changes. I also ask myself what the impact is if I do the change successfully, and also if I break it.

For rm, especially rm -rf, I use absolute paths, AND I visually confirm path -- I use default RHEL .bash_profile, which includes current host and dir name as part of prompt. If the prompt isn't specific enough, or if I feel like I've lost track of where I am, I use pwd.

Worst break by me was implementing a broken SELinux module. I use automated tools to generate a new module. It was empty, and when it was executed, it added a new context that matched "/.*". Then I relabeled and, predictably enough, stuff started breaking. In our organization, I'm a sysadmin and we have engineers who build the new stuff. I asked a couple of them who have been doing this longer than I what they would do, and their answer was restore from backup, or rebuild from scratch.

Well, this server is an admin server. It hosts a yum repo, a file share for random files that I need as admin for my little network, and serves as my host for Ansible. As such, I did not have a full backup, actually, I didn't have any since there's zero user impact if the server dropped dead. Or so I thought.

As soon as I broke it, I got a request from a user to install a piece of software that had recently been approved for the network. Easy enough, pop it in a custom repo and push it with Ansible. Except, oops, admin server is Tango Uniform. Rebuilding would take a while, because I needed to move the yum repo (four full RHEL repos, and the network is offline) off the server then rebuild. So I estimated a day and a half of waiting, plus a couple hours of kicking things off and reconfiguration.

So, I took a deep breath and did what I should have done in the beginning. I looked in the logs and sourced the problem. I found a line in the SELinux contexts file. First line actually. I tried to use the utilities to remove the line, but it was malformed and they wouldn't touch it. So I backed up the file (LOL) and then deleted the line. Then kicked off a relabel and everything was fixed.

Moral of the story is that even more experienced Linux gurus don't always have the best answers, and always, ALWAYS, go to the logs first when troubleshooting.

Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.