The beauty of building extra-large Linux clusters is it's easy. Hadoop, OpenStack, hypervisor, and high-performance computing (HPC) installers enable you to build on commodity hardware and deal with node failure reasonably simply. Learning and managing Linux administration on a small scale involves basic day-to-day tasks; however, when planning and scaling production to several thousand node clusters, it can take over your life, including your weekends and holidays.
Specific requirements for encrypting people-related data in transit and at rest have been heavily discussed elsewhere, so I won't be covering them here. Rather, we'll focus on preparations to keep an audit off the backs of your Linux admin team.
1. Fundamentals: Connecting your cluster to the world
It's tempting to build a cluster on a standalone network with admin access on a second corporate LAN interface. Like Oracle databases in the past, Hadoop and HPC clusters tend to execute all running tasks in a cluster with a single user identification (UID) account (e.g., "hadoop").
Audit needs to prove not only how personal data is stored, but also how data is manipulated, aggregated, or anonymized, and that includes who can create, change, or log in these application-specific accounts. That's you and your admin team in the spotlight.
2. Don't let software installers create accounts or Linux groups
Use your favorite configuration manager or identity manager to create needed accounts on each cluster node (or directory) first. If the Hadoop account and group already exist, the cluster software installer will use those instead. There are several reasons we want this behavior, as outlined below in the next three steps.
3. Maintain UID/GID consistency everywhere
For traceability later, ensure your organization has a consistent UID/group identification (GID) strategy—a way individuals and groups can be identified within the system. For your cluster's software, the unique application UIDs and GIDs need to fit into that matrix across the organization's infrastructure, not just in your cluster.
4. Sudo, not Sudon't
If you are manually distributing sudoers files into your cluster or managing a site-specific scripting environment, it will be up to you and your team to prove you know exactly the state the sudoers file on cluster node 47 was in at a moment several weeks in the past. That is a headache we can all do without.
For self-protection, your team needs to have a strategy to make this centrally managed and under version control. This can be achieved by using a tool like Ansible during node OS setup or versioning machine images for auto-deployment.
5. Attach the cluster to your organization's SIEM
Clusters can generate a large wave of log files. For example, the Hortonworks distribution of Hadoop generates hundreds or thousands of "su hadoop" messages in a few minutes. Security information and event management (SIEM) platforms (open source or not) are a fantastic way to make sense of correlated events. SIEM systems provide quicker identification, analysis, and recovery of security events. For example:
- David logged into the corporate network from home via a VPN using multifactor authentication (MFA)
- David SSHed into the production jumpstart server
- David SSHed into cluster node 47, then SUed to root
- David changed the UID of the Hadoop account from 10011 to 13011
- The cluster ran 138 SU jobs as the Hadoop account on node 47 until 18:00
An operating system, application, or cluster manager's log viewers may show you only slices of this picture. Sending everything to a SIEM is safer, more complete, and frankly becomes another team's responsibility to create reports. Auditors actually prefer the hands-off model, where someone separate from your Linux admin team is proving what happened.
6. Get the right training and tools
A big sign that your team is overwhelmed is if a team member is taking more than four days per audit cycle to help auditors. Something is broken and/or not obvious. Your ideal is two days maximum (and one day if possible). Obtaining sufficient training is critical for productivity.
For example, if your cluster processes people data, get some operations-focused training on each compliance regime where your cluster's operation will run. Make sure there is an exam—achieving certification for a specific version of the requirements is not just good for your resume; it makes an audit review of your team a quick checkbox.
Finally, know when to ask your boss to open the wallet for a commercial tool when pure open source won't pass audit. If you process people data, all the tools need maintenance contracts. There are good reasons open source vendors have commercial distributions and why companies pay for maintenance. First, ask your boss if she or he would like to go on vacation around audit time. I think we all know that answer.
7. Keep the auditors' needs in mind
Operational time-saving techniques are great for optimizing efficiency, but it's important to maintain records in case of an audit.
Specifically for audit trails, delegate reporting tools running against your company's SIEM to another team and have them build reports. Most open source and commercial SIEM systems have interactive reporting capabilities, and there are robust third-party reporting tool vendors, often specializing in specific market sectors. Providing meaningful daily/weekly/monthly operational information from the SIEM to your own data and system holders is an excellent side effect. Again, it should not be your admin team doing the grunt work. Also, have your team incorporate configuration management products, such as Puppet or Ansible.
What tips do you have for avoiding auditors' wrath? Please share your ideas in the comments.
Comments are closed.