After helping to put the dot in .com by building and configuring enterprise class solutions with WorldCom as a Sun hardware and software engineer, Jason Smith went on to AAAS (The American Association for the Advancement of Science, and the publishers of the journal Science) to direct the technical needs of the education directorate.
Jason has built or architected solutions ranging from enterprise to small business class and has found in Drupal a flexible, scalable, rapid development framework for targeting all levels of projects. A long time beneficiary of the open source movement, Jason—now a senior software architect at The Weather Company—is an avid supporter of open source projects and believes strongly in giving back to the community that supported him.
I imagine that more active searches, such as New York during Hurricane Sandy, would generate a higher workload than a search on a quiet town in Idaho during Hurricane Sandy. What problems did this cause and how did you handle balancing the workload?
We built the site in such a way that we can heavily leverage intermediate and edge caches. The use of caching changes the whole dynamic of traffic management in a subtle way. Because pages can be cached for substantially longer times, we only get a few to origin no matter how many requests we get for the content.
Caching doesn't solve our problem completely, it shifts our concerns. No single resource is being requested more often than our TTL (Time To Live) describes, but if we have a million resources requested in a time period each of those /will/ go to origin. You can increase TTLs, but you'll hit a ceiling at some point imposed by your content and marketing teams.
We have millions of locations to support forecast pages for, but the pages differ only in the actual forecast data unique to the location. So part of the strategy was increasing the cacheability of our pages (increase TTL) but another part is leveraging client-side resources to build pages unique to user/location rather than calls to origin.
How did you leverage different caching layers to make Drupal scale?
One challenge with high TTLs is those times where you need to get content out quickly. Flushing Akamai caches per URL can take up to an hour under certain circumstances, and this will make an editorial team break out in hives. Clearly high TTLs could not be the entirety of the solution.
Varnish caches can be cleared much more quickly, as they are (generally) much closer to you, there are fewer of them, and you have more control over them. What you gain in flexibility managing cache lifetimes, you lose in distributed cache. So we find a way to win/win.
The trick is to use multiple caching layers, in our case the use of both Akamai and Varnish. In this setup, we can set our Akamai cache TTLs relatively low ~1-5 minutes and our Varnish TTLs at a much higher level. Since Varnish cache clears are simple and within our control, we get all the benefits of the distributed CDN and the ability to manage cache staleness a lot closer to origin.
How do you manage your content generation and workflow? (e.g. content from author to publishing in production)
The Weather Channel editorial team was adamant that there be as few hurdles to content development as possible. To this end, there are only two "workflow" states, published and unpublished. Editorial planning and workflow is managed as a task distinct from the content entry and publishing. Previewing of content changes and staging of content in the dev environments is a perennial challenge, but it is not a blocker to the ability to get things done.
Were there any unexpected hurdles encountered once development was under way and how did you overcome them?
There are always unexpected hurdles, but we planned for a certain degree of them. In our case the biggest hurdles were related to media management. Initially we had planned to treat it as a problem solved by a different platform. It became more and more difficult to balance timeline and the level of integration desired, so the base Drupal platform assumed a great deal of the media management responsibility.
Media management is a huge effort when you consider the needs of an enterprise organization: you must manage sharing of content, de-duplicating, expiry, translations, short/long term storage, CDN, and DRM (and enforcement), among many other challenging and prickly problems.
Continuous integration: How do you push out new features?
We've made some rather impressive jumps toward continuous integration, but are hitting the same hurdles many do with the needs of multiple active and parallel development teams in Drupal. Due to the enormous number of modules, pages, and permutations of behaviors our testing/regression suite is large and unwieldy. We also have challenges of individual plugins/modules not being as independent as we would need to be able to deploy (or rollback) in isolation or siloed enough to avoid requiring a full regression test. My focus is currently elsewhere on the project, but the QA (quality assurance) team is very capable and are exploring a number of options to close the gap.
This article is part of the Speaker Interview Series for DrupalCon 2015. DrupalCon 2015 brings together thousands of people from across the globe who use, develop, design, and support the Drupal platform. It takes place in Los Angeles, California on May 11 - 15, 2015.