Moving Faster, not Slower, to Minimize Outages and Deliver Innovation

All service providers agree that minimizing customer-impacting issues and outages is a top priority. However, with the advent of cloud computing and a growing of cloud-native service providers like Amazon and Netflix, there is disagreement on how the rate of change for applications and their environment affects customer-impacting outages.

Traditional service providers have long-held that in order to minimize customer issues and outages, they should minimize the rate of change to the applications and their environment. We see this in the use of “maintenance windows” that restrict changes to just a few hours each day, typically at a time when customers are expected not to be using the service, and in “blackouts” for any changes during peak expected use of the service. We also see this in the organizational separation between application development teams and application operations teams, where the dev teams are incentivized to make frequent changes to the application in order to deliver new features, but the ops teams are incentivized with minimizing changes to keep the application up and running. And since only the ops teams can make production changes, they often win the rate of change tug-of-war.

However, by treating change as a “necessary evil” to be minimized or avoided, operators continue to increase the technical debt of their production systems. Technical debt accumulates for each application based on missing updates for features, security, and runtime environments. When change is finally mandated to roll-out a new feature or security fix, the gap between current state and future state is so large that the change/update almost never happens without a hitch. So by minimizing rate of change we increase technical debt, making necessary changes very risky when they are performed.

Instead, service providers should view high-rates of change as the solution, not the problem. Consider Amazon, who in 2014 made over 50 million changes to their production applications, which is an average of more than one per second. Rather than creating problems, this extreme rate of change bestows two large benefits. First, lots of changes means smaller changes, typically just a single fix or feature add or security patch. This makes it far easier to test the change and the impact it will have before rolling it out to production, and also quickly rolling back the change if a problem does occur. Second, in order to make this many changes that rapidly, service providers must master serious automation to manage the entire deployment process, and automation is the key to successful change management.

There are many components that are “requirements” for building a production environment that can undergo near-constant change, but here are a few of the more important ones:

  • Automation for deployment, completely and thoroughly. No manual changes to production, no exceptions.
    Automation for testing, both in pre-production and production environments. Test before (pre-conditions) and after (post-conditions) each deployment.
  • A pre-production environment that matches your production environment, exactly. This is most easily achieved in a cloud platform where you can build and support multiple environments at the same time, on the same gear.
  • Cloud-based infrastructure supporting programmatic, on-demand capacity (can be private or public)
  • A very close relationship between the dev and ops teams to minimize the conflict over rate of change. Ideally one team (i.e. DevOps) but if two teams, then it is essential that their incentives are aligned.
  • Applications should be immutable, such that there are really no “updates”, just deploy new and throw away old. A Service Oriented Architecture (SOA) approach makes this most possible.

All of the “cloud-native” service providers already operate like this, moving at light speed, so if your enterprise is not then you’re losing the race on customer-facing innovation. Work is needed not just in a technology uplift for things like cloud and automation, but critically in cultural acceptance by the executives and the broader organizations. Of the two, the latter is much more difficult.

At IBB our Cloud and Software Transformation group has real-world experience in creating and running cloud platforms, continuous delivery automation, devops teams and processes, and preparing for the cultural and organizational challenges that these changes bring to your enterprise. Let us know how we can help you.


Get in touch