How containers and DevOps transformed Duke University's IT department

When Duke started looking into taming its VM sprawl, it became obvious that not just its infrastructure but its entire culture would need to change.

Image by:

Opensource.com

It's difficult, even in retrospect, to know which came first for us: containers or a shift towards DevOps culture.

At Duke University's Office of Information Technology (OIT), we began looking at containers as a way to achieve higher density from the virtualized infrastructure used to host websites. Virtual machine (VM) sprawl had started to become a problem. We favored separating each client's website onto its own VM for both segregation and organization, but steady growth meant we were managing more servers than we could handle. As we looked for ways to lower management overhead and make better use of resources, Docker hit the news, and we began to experiment with containerization for our web applications.

For us, the initial investigation of containers mirrors a shift toward a DevOps culture.

Where we started

When we first looked into container technology, OIT was highly process driven and composed of monolithic applications and a monolithic organizational structure. Some early forays into automation were beginning to lead the shift toward a new cultural organization inside the department, but even so, the vast majority of our infrastructure consisted of "pet" servers (to use the pets vs. cattle analogy). Developers created their applications on staging servers designed to match production hosting environments and deployed by migrating code from the former to the latter. Operations still approached hosting as it always had: creating dedicated VMs for individual services and filing manual tickets for monitoring and backups. A service's lifecycle was marked by change requests, review boards, standard maintenance windows, and lots of personal attention.

A shift in culture

As we began to embrace containers, some of these longstanding attitudes toward development and hosting began to shift a bit. Two of the larger container success stories came from our investigation into cloud infrastructure. The first project was created to host hundreds of R-Studio containers for student classes on Microsoft Azure hosts, breaking from our existing model of individually managed servers and moving toward "cattle"-style infrastructure designed for hosting containerized applications.

The other was a rapid containerization and deployment of the Duke website to Amazon Web Services while in the midst of a denial-of-service attack, dynamically creating infrastructure and rapidly deploying services.

The success of these two wildly nonstandard projects helped to legitimize containers within the department, and more time and effort was put into looking further into their benefits and those of on-demand and disposable cloud infrastructure, both on-premises and through public cloud providers.

It became apparent early on that containers lived within a different timescale from traditional infrastructure. We started to notice cases where short-lived, single-purpose services were created, deployed, lived their entire lifecycle, and were decommissioned before we completed the tickets created to enter them into inventory, monitoring, or backups. Our policies and procedures were not able to keep up with the timescales that accompanied container development and deployment.

In addition, humans couldn't keep up with the automation that went into creating and managing the containers on our hosts. In response, we began to develop more automation to accomplish usually human-gated processes. For example, the dynamic migration of containers from one host to another required a change in our approach to monitoring. It is no longer enough to tie host and service monitoring together or to submit a ticket manually, as containers are automatically destroyed and recreated on other hosts in response to events.

Some of this was in the works for us already—automation and container adoption seem to parallel one another. At some point, they become inextricably intertwined.

As containers continued to grow in popularity and OIT began to develop tools for container orchestration, we tried to further reinforce the "cattle not pets" approach to infrastructure. We limited login of the hosts to operations staff only (breaking with tradition) and gave all hosts destined for container hosting a generic name. Similar to being coached to avoid naming a stray animal in an effort to prevent attachment, servers with generic names became literally forgettable. Management of the infrastructure itself became the responsibility of automation, not humans, and humans focused their efforts on the services inside the containers.

Containers also helped to usher continuous integration into our everyday workflows. OIT's Identity Management team members were early adopters and began to build Kerberos key distribution centers (KDCs) inside containers using Jenkins, building regularly to incorporate patches and test the resulting images. This allowed the team to catch breaking builds before they were pushed out onto production servers. Prior to that, the complexity of the environment and the widespread impact of an outage made patching the systems a difficult task.

Embracing continuous deployment

Since that initial use case, we've also embraced continuous deployment. There is a solid pattern for every project that gets involved with our continuous integration/continuous deployment (CI/CD) system. Many teams initially have a lot of hesitation about automatically deploying when tests pass, and they tend to build checkpoints requiring human intervention. However, as they become more comfortable with the system and learn how to write good tests, they almost always remove these checkpoints.

Within our container orchestration automation, we use Jenkins to patch base images on a regular basis and rebuild all the child images when the parent changes. We made the decision early that the images could be rebuilt and redeployed at any time by automated processes. This meant that any code included in the branch of the git repository used in the build job would be included in the image and potentially deployed without any humans involved. While some developers initially were uncomfortable with this, it ultimately led to better development practices: Developers merge into the production branch only code that is truly ready to be deployed.

This practice facilitated rebuilding container images immediately when code is merged into the production branch and allows us to automatically deploy the new image once it's built. At this point, almost every project using the automatic rebuild has also enabled automated deployment.

Looking ahead

Today the adoption of both containers and DevOps is still a work in progress for OIT.

Internally we still have to fight the entropy of history even as we adopt new tools and culture. Our biggest challenge will be convincing people to break away from the repetitive break-fix mentality that currently dominates their jobs and to focus more on automation. While time is always short, and the first step always daunting, in the long run adopting automation for day-to-day tasks will free them to work on more interesting and complex projects.

Thankfully, people within the organization are starting to embrace working in organized or ad hoc groups of cross-discipline members and developing automation together. This will definitely become necessary as we embrace automated orchestration and complex systems. A group of talented individuals who possess complementary skills will be required to fully manage the new environments.