System administrator responsibilities: 9 critical tasks

Sysadmins are responsible for a wide range of duties, but these are the most essential.
572 readers like this
572 readers like this
Tools for the sysadmin

opensource.com

System administrators are critical to the reliable and successful operation of an organization and its network operations center and data center. A sysadmin must have expertise with the system's underlying platform (i.e., Windows, Linux) as well as be familiar with multiple areas including networking, backup, data restoration, IT security, database operations, middleware basics, load balancing, and more. Sysadmin tasks are not limited to server management, maintenance, and repair, but also any functions that support a smoothly running production environment with minimal (or no) complaints from customers and end users.

Although sysadmins have a seemingly endless list of responsibilities, some are more critical than others. If you work in a sysadmin role (or hope to one day), make sure you are ready to follow these best practices.

Documentation

Documentation is how sysadmins keep records of assets, including hardware and software types, counts, and licenses. Should there be any issues in the production environment, documentation helps identify the hardware, virtual machine, appliance, software, etc., that may be involved.

Hardware inventory

Maintain lists of all your physical and virtual servers with the following details:

  • OS: Linux or Windows, hypervisor with versions
  • RAM: DIMM slots in physical servers
  • CPU: Logical and virtual CPUs
  • HDD: Type and size of hard disks
  • External storage (SAN/NAS): Make and model of storage with management IP address and interface IP address
  • Open ports: Ports opened at the server end for incoming traffic
  • IP address: Management and interface IP address with VLANs
  • Engineering appliances: e.g., Exalogic, PureApp, etc.

Software inventory

  • Configured applications: e.g., Oracle WebLogic, IBM WebSphere Application Server, Apache Tomcat, Red Hat JBoss, etc.
  • Third-party software: Any software not shipped with the installed OS

License details

Maintain license counts and details for physical servers and virtual servers (VMs), including licenses for Windows, subscriptions for Linux OS, and the license limit of hypervisor host.

Server health checkup

  • Running processes: Check for processes that are consuming more resources than expected, and take action to fine-tune the applications (with the help of the application team).
  • CPU utilization: Consistently monitor and check the CPU utilization of the critical process like "java", "http", "mysql" etc. to ensure that these are not consuming the CPU resources more than expected. If it is so, then coordinate with the application team to check it at application level  and fine tune the same. Parallely analyse the OS parameters like "Ulimits".
  • Memory utilization: Check memory utilization and clear the cache, if required.
  • Zombie processes: Check for processes where the PID still exists in the process table after it is terminated. Zombie processes degrade server performance, so find and kill any that exist.
  • Load average: If you're having performance issues, check the load average and tune the server for performance.
  • Disk/SAN/NAS utilization: Check the I/O reports for externally attached storage to track and check the speed of read/write operations. If you find any issues, coordinate with the storage and network teams immediately to correct them.

Backup and disaster recovery planning

Communicate with the backup team and provide them the data and client priorities for backup. The recommended backup criteria for production servers is:

  • Incremental backups: Daily, Monday to Friday
  • Full backup: Saturday and Sunday
  • Disaster recovery drills: Perform restoration mock drills once a month (preferably, or quarterly if necessary) with the backup team to ensure the data can be restored in case of an issue.

Patching

Operating system patches for known vulnerabilities must be implemented promptly. There are many types and levels of patches, including:

  • Security 
  • Critical 
  • Moderate

When a patch is released, check the bug or vulnerability details to see how it applies to your system (e.g., does the vulnerability affect the hardware in your system?), and take any necessary actions to apply the patches when required. Make sure to cross-verify applications' compatibility with patches or upgrades.

Application compatibility

Before going live with any application, check its compatibility with your hardware and operating system, and make sure to do load testing (with the support of application team).

Server hardening

Linux:

  • Set a BIOS password: This prevents users from altering BIOS settings.
  • Set a GRUB password: This stops users from altering the GRUB bootloader.
  • Deny root access: Rejecting root access minimizes the probability of intrusions.
  • Sudo users: Make sudo users and assign limited privileges to invoke commands.
  • TCP wrappers: This is the weapon to protect a server from hackers. Apply a rule for the SSH daemon to allow only trusted hosts to access the server, and deny all others. Apply similar rules for other services like FTP, SSH File Transfer Protocol, etc.
  • Firewalld/iptables: Configure firewalld and iptables rules for incoming traffic to the server. Include the particular port, source IP, and destination IP and allow, reject, deny ICMP requests, etc. for the public zone and private zone.
  • Antivirus: Install antivirus software and update virus definitions regularly.
  • Secure and audit logs: Check the logs regularly and when required.
  • Rotate the logs: Keep the logs for limited period of time like "for 7 days", to keep the sufficient disk space for flawless operation.

Windows:

  • Set a BIOS password: This prevents users from altering BIOS settings.
  • Antivirus: Install antivirus software and update virus definitions regularly.
  • Configure firewall rules: Prevent unauthorized parties from accessing your systems.
  • Deny administrator login: Limit users' ability to make changes that could increase your systems' vulnerabilities.

Use a syslog server

By configuring a syslog server in the environment to keep records of system and application logs, in the event of an intrusion or issue, the sysadmin can check previous and real-time logs to diagnose and resolve the problem.

Automation

Many sysadmin tasks (such as server health checkups, resource utilization, backup triggers, transfer files and logs, etc.) must be done at specific times. Therefore, the sysadmin must write scripts or use external tools and configure them as cron jobs to do the tasks automatically at the proper time.

Monitoring tools

Install and configure live monitoring tools like Nagios, HP, etc., to monitor your IT infrastructure and issue alerts about potential problems.

Conclusion

While these are the most important tasks a sysadmin is responsible for, there is much more to the role than the duties on this list.

For example, the sysadmin must coordinate with multiple teams to resolve issues, communicate with and update customers, maintain 100% uptime, hold discussions with the audit team, prepare weekly/monthly/quarterly reports, do continuous monitoring of servers and services using appropriate tools, and maintain the hardware console and respond to any triggered alarms.

The sysadmin is always a single point of content (SPOC) in the data center or network operations center for issues related to web hosting, application and server outages, and other critical IT operations problems.

What other tasks or best practices do you think are essential for sysadmins? Please share your opinion in the comments.

What to read next
Tags
Alok Sharma was born and raised in Jaipur (India). Alok has Masters in VLSI Deisgn (M.Tech.) from Malaviya National Institute of Technology (M.N.I.T), Jaipur. Alok has worked with the Higher Technical Institution in Reseach Lab and other IT organization as a Linux System Administrator and managed Server Team as a Lead.

7 Comments

Very nice list, thank you. It is a great template for new sysadmins to use as a starting point.

To be honest, I think this article is not exactly on target these days.

For example, "Maintain lists of your servers?" That's an OK approach if you're maintaining one server and maybe a dozen PCs. Even a small shop has too many devices to manage by hand any more. Besides, what about the need to maintain container images that are deployed to third party services? A far better recommendation would be, "Learn how to use auto-discovery tools and how to integrate them with your CMDB."

Automation -- "...write scripts or use external tools..." A far better recommendation would be, learn the concepts around Infrastructure As Code. Learn to use tools like Ansible instead of doing any task by hand. Learn how to use version control systems like git to manage your playbooks, config files, scripts, etc.

The point I'm trying to make here is that the world of IT is changing incredibly rapidly. Agile and DevOps practices are penetrating into every organization. The practices that are outlined in this article simply don't align with that new world. Sysadmins have to adapt if they are going to remain relevant.

Thanks for the feedback.
I would like to mention that the not every system admin is expert/cabable in Ansible, deveops etc. because generally there are L1, L2, L3 level of System admins with different level of responsibilities like documentation, monitoring, ticket handling, escalation handling etc.
And the System Admin on client locations must have to follow the instructions from the client with constraints and minimum tools and resources. Like, in many organizations the Patching is manual and in some patching may be automated.
Autodiscovery tools are an option but inventory can be done simply by exporting the data from Virtual environment (like VMware) and same can be done for physical servers. Still there are multiple organizations, which are not using containeraized environment, Instead they are using private cloud only.
Integration, autodiscovery, what to use, how to use, when to use are only decided by customer and the System Admin must have to bind and follow the Instructions to make the things better with his best effort.
So, the point is, that the article is general for 'To Dos' as a System Admin, it is not for specific or customized environment.
How to do can only be decided by Service provider and Customer agreements by keeping the view of limitations of resources.

In reply to by sgtrock

I both agree and disagree with you. :) Yes, a contractor sysadmin has to adapt to a client's practices. However, to write an article such as this without mentioning what is happening to the industry and the transformative affect it is having on the job leaves the reader with at best an incomplete picture.

For example, I work for a fairly large U.S. based corporation (approaching 100,000 employees). We are just taking baby steps towards adopting DevOps practices today but it's clear to just about all of us if you're not on the Agile/DevOps train you're likely to be run over. At that, we're very late to the game with many of our competitors and suppliers years ahead of us.

In other words, I think that any article about sysadmin practices that doesn't discuss the impact of the truly transformative shift in the way that they will be doing their jobs in the very near future is doing the readers a disservice.

Does that all make sense?

In reply to by Alok Sharma

Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.