5 sysadmin horror stories

6 readers like this.
5 sys admin horror stories

Opensource.com

Happy System Administration Appreciation Day!

The job ain't easy. There are constantly systems to update, bugs to fix, users to please, and on and on. A sysadmin's job might even entail fixing the printer (sorry). To celebrate the hard work our sysadmins do for us, keeping our machines up and running, we've collected five horror stories that prove just how scary / difficult it can be.

Do you have your own sysadmin horror story? Let us know in the comments below.

Screech! Crash! Boom.

from David Both

Back in the late 1970s I was working for IBM as a customer engineer in a small town in northwest Ohio. There were a lot of what were, even then, very old unit record devices like keypunches, card sorters and other, similar devices that I worked on quite frequently. There were also some more modern mid-range and mainframe computers that we serviced. On one late summer evening, I was working on one of the keypunches and was more or less a witness to the worst thing that can happen to a business.

It seems that this company had hired a new night operator who had been on the job for only a couple weeks. He was following the instructions in the run book to the letter while running payroll and loaded the payroll diskpack on one of the large IBM disk drives, probably an IBM 3350, and started it up. At that point the newly minted operator heard a very loud screeching sound and the disk failed to come on line.

As a more experienced operator would have known, the drive had suffered a head crash, or what IBM called Head-Disk Interference (HDI). This meant that both the heads and the disk itself were damaged.

The new operator then placed the same disk pack on a different drive unit with exactly the same result. He knew that was not good, but he had been told where the backup payroll disk pack was located, so he proceeded to load that onto the first, already damaged drive unit. When he tried to load it, that combination also resulted in the same bone-chilling screech. He now figured that he should call the lead operator who immediately rushed on-site and, after hearing what had happened, fired the poor newbie on the spot.

It only took the IBM field engineer a few hours to rebuild the two damaged drive units, but it took the company weeks to recover all of the lost data by hand. Sometimes a single backup is not enough, and complete operator training is of paramount importance.

The accidental spammer

An anonymous story

It's a pretty common story that new sys admins have to tell: They set up an email server and don't restrict access as a relay, and months later they discover they've been sending millions of spam email across the world. That's not what happened to me.

I set up a Postfix and Dovecot email server, it was running fine, it had all the right permissions and all the right restrictions. It worked brilliantly for years. Then one morning, I was given a file of a few hundred email addresses. I was told it was an art organization list, and there was an urgent announcement that must be made to the list as soon as possible. So, I got right on it. I set up an email list, I wrote a quick sed command to pull out the addresses from the file, and I imported all the addresses. Then, I activated everything.

Within ten minutes, my server nearly falls over. It turns out I had been asked to set up a mailing list for people we'd never met, never contacted before, and who had no idea they were being added to a mailing list. I had unknowingly set up a way for for us to spam hundreds of people at arts organizations and universities. Our address got blacklisted by a few places, and it took a week for the angry emails to stop. Lesson: Ask for more information, especially if someone is asking you to import hundreds of addresses.

The rogue server

from Don Watkins

I am a liberal arts person who wound up being a technology director. With the exception of 15 credit hours earned on my way to a Cisco Certified Network Associate credential, all of the rest of my learning came on the job. I believe that learning what not to do from real experiences is often the best teacher. However, those experiences can frequently come at the expense of emotional pain. Prior to my Cisco experience, I had very little experience with TCP/IP networking and the kinds of havoc I could create albeit innocently due to my lack of understanding of the nuances of routing and DHCP.

At the time our school network was an active directory domain with DHCP and DNS provided by a Windows 2000 server. All of our staff access to the email, Internet, and network shares were served this way. I had been researching the use of the K12 Linux Terminal Server (K12LTSP) project and had built a Fedora Core box with a single network card in it. I wanted to see how well my new project worked so without talking to my network support specialists I connected it to our main LAN segment. In a very short period of time our help desk phones were ringing with principals, teachers, and other staff who could no longer access their email, printers, shared directories, and more. I had no idea that the Windows clients would see another DHCP server on our network which was my test computer and pick up an IP address and DNS information from it.

I had unwittingly created a "rogue" DHCP server and was oblivious to the havoc that it would create. I shared with the support specialist what had happened and I can still see him making a bee-line for that rogue computer, disconnecting it from the network. All of our client computers had to be rebooted along with many of our switches which resulted in a lot of confusion and lost time due to my ignorance. That's when I learned that it is best to test new products on their own subnet.

Licensing woes

Another anonymous story

Working at a small non-profit organisation, the CEO of the company would only pay for software by the company he owned stock in; everything else, he had the IT department use illegally (purchase one copy, distribute many). He did this by making it a requirement of the job that certain software was on every computer, but he never authorised the purchase of a a site license or more licenses than what we had to begin with.

I was new to IT and had a grand scheme of how I'd convince people to use free and open source versions of the software, but when the company's CEO and culture explicitly permits illegal use of software, open source can be a tough sell (aside from when it fills in the gaps that the closed source software can't do anyway, but then it's not replacing anything, so the problem remains).

I left the job after it became clear that management truly understood what they were doing and why it was wrong, and had no intention of ever rectifying it. I did this partly because I didn't approve of the ethics (if you're going to use software that requires a license, then pay the licensing fee; that's part of the deal), and partly because I was pretty sure that if the lawyers came knocking, the organization was not going to indemnify the IT department (more likely, they'd throw us under the bus).

Sure enough, about a year after I'd left, they got hit with a lawsuit from one of the companies they were using illegally. I moved on to a company that uses about 90% open source software (some of it paid, some of it $0).

Cover the hole!

from Don Watkins

It was early 2004 and I had recently attended Red Hat system administration training. I was looking for ways to apply my new found knowledge when the Western New York Regional Information Center began looking for a pilot school to try Lotus Notes on a Linux server. I volunteered our school district for the pilot.

Working with a Linux experienced microcomputer support specialist supplied by the regional information center, we used a spare rackmount server and installed Red Hat Enterprise Linux on it. As part of our installation, we configured the server to use an included DDS3 tape drive to backup the email server once a day. Each day I would simply insert a tape marked for each of the five days of the week for a two week cycle we used ten tapes. Everything worked well for a period of time until our tape drive ceased to work properly. Email is mission critical. What were we going to do with a non-functioning tape drive?

Necessity is frequently the mother of invention. I knew very little about BASH scripting but that was about to change rapidly. Working with the existing script and using online help forums, search engines, and some printed documentation, I setup Linux network attached storage computer running on Fedora Core. I learned how to create an SSH keypair and configure that along with rsync to move the backup file from the email server to the storage server. That worked well for a few days until I noticed that the storage servers disk space was rapidly disappearing. What was I going to do?

That’s when I learned more about Bash scripting. I modified my rsync command to delete backed up files older than ten days. In both cases I learned that a little knowledge can be a dangerous thing but in each case my experience and confidence as Linux user and system administrator grew and due to that I functioned as a resource for other. On the plus side, we soon realized that the disk to disk backup system was superior to tape when it came to restoring email files. In the long run it was a win but there was a lot of uncertainty and anxiety along the way.

User profile image.
Opensource.com publishes stories about creating, adopting, and sharing open source solutions. Follow us on Twitter @opensourceway.

4 Comments

Ouch - first one made me wince all over again. I don't think there is a single data centre which had the old top loading disk packs that hasn't had this happen (usually only once as the number of disks and drives that get wrecked tends to grow in the telling - pretty sure that our mid 80s horror story was up to three drives and four disk packs by the time I retired - two and two was the real story). We didn't sack the culprit, we promoted him out of the computer hall.

To add one of my own mainframe horror stories involving operators. I once had the dubious pleasure of helping my boss recover from this one. Middle of the night, print queue deactivated and a bored operator was wondering what would happen if you detatched the last printer on the system and then reactivated the queue -so he tried it!! As the last printer on the system was also the halt/load printer what he got was a full system dump for each of the first tweny jobs in the queue, after which the system ran out of space on Dumpdisk, Swapdisk and main memory and froze absolutely solid. We had to do a hardware level halt/load to get out of that one. Mt suggestion on the SA team response to the incident report that we should cut the operator's fingers off so he couldn't do it again was endorsed by my boss and reluctantly deleted by the team leader.

So question - what is more dangerous on the night shift - inexperienced but well meaning, or experienced but bored?

I have a horror story from another IT person. One day they were tasked with adding a new server to a rack in their data center. They added the server... being careful to not bump a cable to the nearby production servers, SAN, and network switch. The physical install went well. But when they powered on the server, the ENTIRE RACK went dark. Customers were not happy:( IT turns out that the power circuit they attached the server to was already at max capacity and thus they caused the breaker to trip. Lessons learned... use redundant power and monitor power consumption,
Another issue was being a newbie on a Cisco switch and making a few changes and thinking the innocent sounding "reload" command would work like Linux does when you restart a daemon. Watching 48 link activity LEDs go dark on your vmware cluster switch... Priceless

I have two:

I was the Windows admin for a school district, and the Linux admin left abruptly. About 3 days after I took over the Linux admin duties, I came into work and the web server was down. I jumped on the console and saw hundreds of disk read errors for the RAID array. This particular server had Linux on one drive, and all the home dirs and web dirs on a 3 disk array. I thought no big deal it was only barking about one drive, and I have a cold spare HDD for the server. So I got it out of the box and went to the console to verify which drive to pull, and that's when I noticed the RAID level was 0! After a the "oh crap!" moment, I thought, no big deal, the array is gone anyway, so I will just replace the failed disk and rebuild it in RAID5 and restore the latest backup. I couldn't find a tape for the server, so I looked in cron to see the destination location of the backup job and found that he was tarring up the home dirs and web dirs to a file that lived on the same array!!! This scenario went from bad to worst case just that fast. In the end, and $7,000 dollars later, we had to send the array to Drive Savers to get the data back.

Years later, one of the UPS's in the server rack died and while the replacement was on order the battery failure light and beeper on the working UPS started going off. Perfect! :-( We got the first replacement UPS in and when we unplugged the failed one to replace it, we found that the 220 outlet was melted and charred black. Upon inspection we found that the UPS's built-in power cord was not screwed down to the terminals properly and that had caused the power to internally arc periodically until the power cord and outlet were melted to the point of no connection. We replaced the outlet and got the new UPS racked up. Keep in mind that this whole time the rack is running on a single UPS with a failed battery.

I was reaching down to power up the new UPS as my guy was stepping out from behind the rack and the whole rack went dark. His foot caught the power cord of the working UPS and pulled it just enough to break the contacts and since the battery was failed it couldn't provide power and shut off. It took about 30 minutes to bring everything back up..

Things went much better with the second UPS replacement. :-)

This one seems to be a classic too:

Working for a large UK-based international IT company, I had a call from newest guy in the internal IT department: "The main server, you know ..."

"Yes?"

"I was cleaning out somebody's homedir ..."

"Yes?"

"Well, the server stopped running properly ..."

"Yes?"

"... and I can't seem to get it to boot now ..."

"Oh-kayyyy. I'll just totter down to you and give it an eye."

I went down to the basement where the IT department was located and had a look at his terminal screen on his workstation. Going back through the terminal history, just before a hefty amount of error messages, I found his last command: 'rm -rf /home/johndoe /*'. And I probably do not have to say that he was root at the time (it was them there days before sudo, not that that would have helped in his situation).

"Right," I said. "Time to get the backup."

I knew I had to leave when I saw his face start twitching and he whispered: "Backup ...?"

==========

Bonus entries from same company:

It was the days of the 5.25" floppy disks (Wikipedia is your friend, if you belong to the younger generation). I sometimes had to ask people to send a copy of a floppy to check why things weren't working properly. Once I got a nice photocopy and another time, the disk came with a polite note attached ... stapled through the disk, to be more precise!

Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.