A guide to scientific computing system administration

Image by:

Opensource.com

When developing applications for science there are times when you need to move beyond the desktop, but a fast, single node system may also suffice. In my time as a researcher and scientific software developer I have had the opportunity to work on a vast array of different systems, from old systems churning through data to some of the largest supercomputers on the planet.

Single systems

Due to licensing restrictions, budgets for hardware, and a number of other factors single systems are pretty common, and they are administered in a fairly ad-hoc fashion. They may require special drivers for NVIDIA graphics cards, but other than that they are installed and configured to run a few applications well. They also rely on ad-hoc sharing of resources between a small number of researchers, and are updated and maintained by a student, postdoctoral researcher, or in some cases research staff. This was one of the things that got me interested in Gentoo: squeezing maximum performance out of a system. Back then 64 bit processors were still quite new, so running 64 bit was fancy.

The systems are often set up to run without a monitor attached, and Linux is an obvious choice with SSH for remote access and administration. A small number of people, or in my case often just me, had root, and other people would log in and do their work. They were only exposed to the local network, and files were transferred over SCP. One or two had Samba shares to make it easier for the Windows and Mac users to share files using their own familiar tools. Single systems allow for a lot of specialization for a particular purpose, and are quite effectively shared in research group contexts.

Small research clusters

This setup is quite common, a mixture of dedicated clusters for a particular research group, or buying an additional set of nodes to add to a larger departmental cluster. The machines may be administered by dedicated staff, or it may fall to research group members. You are often afforded greater access to the machines purchased through a higher priority in the scheduler, exclusive access for a period of time, the ability to install custom software, or to optimize the hardware for particular tasks.

The first cluster I administered in an academic context was an Apple XServe cluster, a now failed experiment by Apple to make an impact in the server space. Most other groups had Linux-based clusters, but the decision was made to go with the XServe cluster in that case. Most of my previous experience was in Linux so it was quite a learning curve. I had installed and administered the Sun Grid Engine on a cluster of Sun machines about five years prior to this, and so I had experience on other Unix flavors.

Typical network topology

The typical set up for a cluster, and this generally holds even up to supercomputers, is that users log in to the head node (or login nodes). These are the only nodes accessible from the network, and the only nodes where users can directly log in and interact. The bulk of the cluster is made up of compute nodes, and these are on a private network with scheduling controlled by some kind of job scheduler such as Portable Batch System (PBS), Sun Grid Engine, Torque, or Slurm. User accounts are common to all nodes, a shared file system makes it easy to get data to and from the compute nodes, and many compute nodes have local disk storage that is faster, but only accessible from the node.

If you need to compile custom code, that is typically done on the head nodes. On some of the systems that I used the architecture did not match the compute nodes, so we had to use cross-compilation, but that is less common these days. The compute nodes often use a fast network fabric, such as InfiniBand, and it is necessary to link to custom Message Passing Interfaces (MPI) and communication libraries in order to get the best performance. The machines are generally designed to be as heterogeneous as possible, often with older Linux distributions and outdated libraries. This is mitigated to a degree by the use of environment modules. The module system enables you to execute commands that add custom libraries, compilers, or whatever you need to your environment. These are used on larger clusters and supercomputers quite frequently, and smaller clusters may make use of them if the software requires mutually exclusive library versions.

Running jobs

One of the things I have been working on in the last few years is making it easier for scientists to use clusters and supercomputers in their work. It is still quite typical for researchers to log in to the head/login node, schedule their jobs, and monitor them using command line tools. Some of the larger projects script the execution of jobs, but it is still a manual process. This is often made more difficult by two-factor authentication systems that are tied to SSH and other security restrictions, which do not lend themselves to being used by any higher-level systems. Allocations of time are usually made based upon sharing agreements, or proposals for larger supercomputers.

The sysadmins ensure that the batch scheduler is set up to record use, and enforce allocations made to the various system users. Supercomputers are generally set up to run enormous jobs in parallel, which utilize a large portion of the machine through the use of distributed memory programming models. More and more these also require the use of local shared memory parallelism, and in the most extreme cases can require three models for parallelism: distributed memory, shared memory on CPU, and GPGPU shared memory. There is a range, and the OpenACC specification offers some automated parallelism of code when the right compilers are used for the cluster or supercomputer. Supercomputers and clusters are becoming more complex, and it is increasingly important to parallelize on multiple levels. A lot of investment in the network fabric enables low latency and high bandwidth communication between nodes. This leaves disk I/O as a huge bottleneck that is just getting worse.

Exascale

The current push in the supercomputing community is to exascale computing. This is the next leap in computational power, requiring huge amounts of power and improvements in both power efficiency and fault tolerance. As the systems become larger, and disk bandwidth is at a premium, novel approaches to alleviate these issues are being explored such as burst buffers, non-volatile RAM, and in-situ processing techniques that minimize what needs to be written to disk.

Open source way

Open source has been enormously important for the supercomputing community, and to a lesser extent clusters. Licensing models often do not scale well to these applications, and make it difficult to run large jobs. A larger advantage of open source, beyond licensing fees and models, is the ability to modify the source, optimize it for the supercomputer or cluster, recompile against custom MPI libraries, math libraries, etc., and share those results openly with other computing centers. This has enabled rapid progress, and unhindered sharing of both the major successes and some of the failures to optimize code for supercomputers.

As the community moves to exascale I think that this will become increasingly important. Libraries that were not necessarily developed with supercomputers in mind can be patched, updated, and in many cases these changes can be merged. Supercomputers share some of the problems the wider community is concerned with, such as compute nodes with relatively small amounts of memory running many copies of the code in a distributed fashion that make both memory footprint and startup time critical points that must be optimized. These optimizations can also benefit the embedded applications of the same libraries such as mobile phones, tablets, and system on chip (e.g. Raspberry Pi).

Supercomputers and clusters often require unusual build environments, porting to less common compilers, and working to get the most out of specialized hardware. I have had to compile pretty large software stacks in the past in order to get code working on scientific computational resources, but things are changing as containers, virtualization, and other solutions are being explored. Cloud resources are also supplementing, or in some cases replacing, computational needs in academia/research but often suffer from reduced communication speed and increased latency between nodes/instances.

Hopefully this whistle stop tour of my experiences in system administration and software deployment on these systems has provided useful insights.