Before designing or implementing an Edge Cloud it’s important to understand the challenges of Edge and consider how you will address them.
There are three main categories of edge cloud challenges discussed below: Fault and Failure Handling, Infrastructure Automation and Security, and Resource Abstraction. Many of the specific challenges listed may span categories, and underlying all of them is simply Massive Distributed Scale. Over the past several decades, the number of systems that an IT infrastructure team was managing has grown exponentially, from tens to tens of thousands. With each jump in scale, the challenges grow. Manual processes that take a team member ten minutes may now logistically require weeks or months to complete. Response to failure which previously involved paging an engineer and a manual fix must now be self-healing. Teams that are used to specifying exactly where they want their workloads will not be offered to choose from any of tens of thousands of sites, unless those resources are abstracted.
Fault and Failure Handling
One consideration of fault handling in the edge cloud is heterogeneous hardware systems. In previous iterations of systems designs, teams might run many of the same servers or the same class of servers. In an edge cloud, teams will be managing systems ranging from rack-mount servers from major hardware vendors to stand-alone ARM based systems that are the size of a home router. Each of these types and the types in between have different mechanisms for management, different performance characteristics, and different rates of failure. Software teams must plan for the higher failure rates of commodity hardware and software must be designed with the ability to scale down to the smallest component of the edge cloud.
In addition to the heterogeneous commodity hardware, teams need to be wary about the environments that the edge cloud will run in. Edge cloud will not be running in a Tier 3+ data center with redundant power and network links. Edge cloud systems will more likely run in places with single points of failure in both power and network uplinks. Some of these will be locations that are remote and difficult to physically secure. The sites can range from racks in a data center to several servers at a radio site to a small box attached to a power pole. Due to the number of locations on the edge and the remoteness of many of the nodes, resources to do field work on these devices will be difficult and expensive. Therefore, remote lights out management is a hard requirement. Additionally, teams may want to design the systems with redundancy in some components (like storage) where failures are expected, as long as it is within cost requirements.
The two challenges listed previously, heterogeneous commodity hardware and running edge cloud in remote locations with non-redundant power and networking lead to a requirement about dealing with failures. This challenge is partially on the owners of the applications which will use the edge cloud, but teams running the edge cloud need to also consider this. Edge cloud nodes will fail, and some that fail may be offline for days or weeks before they can be repaired or rebuilt. In a system with 50,000 locations, each containing 1-10 servers, some will always be offline. Even if there are no hardware failures in a given day, in a system that large at least some nodes will be being rebuilt or upgraded. This means that the edge cloud API needs to deal with down sites without reduced service. The edge cloud API may also need to assist in moving workloads to other locations during an outage.
In a system with so many possible daily failures, teams will need to re-think how they respond to issues. Systems will need to self-heal whenever possible. When not possible to self-heal, simply automated procedures that can be run from push-button system in a NOC need to gracefully take the nodes offline. In both cases, software can help assess the impact to applications and end customers and help decide the severity of the incident and the escalation path. It is important that the focus when monitoring these systems should be on the impact and not the issue. A failed hard drive alarm doesn’t provide much context to what the impact is, but if instead we know that one edge site is at 70% capacity and running at reduced redundancy, workloads can be migrated, and maintenance planned.
A related issue to recovering from failure, but also applicable to new systems, is that edge cloud software needs to be “throw away” (also known as immutable). This means that it is built, deployed, and when changes are needed, the old installation is simply tossed out and a new install is done. This greatly simplifies the complex problem of upgrades, which are difficult to test and nearly impossible to account for corner cases in. Additionally, it allows easy integration with a CI/CD system, which will build this software and then perform rolling updates across the edge cloud. Due to the scale of the edge, teams cannot do semi-annual edge-wide 2AM upgrades, so planning a consistent rolling upgrade strategy is critical, along with the CI process to test it.
Although automated infrastructure is nothing new, teams need to be cognizant of automating the entire stack, not just the software and configuration. Upgrading a BIOS of a field unit, RAID controller firmware, or doing a full re-install of the operating system on bare metal are likely to occur. Failing to plan for these may leave a situation where the impossible task of field flashing systems at 30,000 locations presents itself.
Somewhat related to automation is security, which can be baked into the automation. By deploying well tested immutable machine images and by preventing unknown devices from joining the edge cloud, good automation plays a role in edge cloud security. And anytime you deploy a cloud system security should be a major consideration. This is especially true in edge cloud for three reasons. First, as was previously mentioned, edge nodes can run in difficult to secure environments. Imagine an edge cloud node sitting in a fiber node in someone’s backyard. If that node is processing PII data, considerations will need to be taken for what happens if the box is opened. Second, edge cloud, even if only built for internal use, will be multi-tenant. Companies will have different teams with different classes of applications running on the edge. Some of these apps will have different levels of security and different connections into the back-end systems at the company. If one app is compromised, having good multi-tenant design and security will prevent that application from reading data from others running in your cloud. Thirdly, edge cloud by definition will be the first hop from customer traffic into your network. Therefore, the edge cloud will need to handle rogue customer devices performing DDoS attacks or other activity detrimental to the rest of the network.
From an application owner’s point of view, an edge cloud is a great resource, but it can be difficult to use if not abstracted. Application owners should care about location, but only in a general sense. For example, an application owner does not want to specify which 300 radio sites on which to run their app. They’ll instead want to specify a general area and some latency and bandwidth constraints and let the system handle scheduling the workloads. This abstraction layer would also handle dealing with sites that go down, spinning up workloads at alternate sites if available or notifying the application owners that they are running at reduced capacity.
All of the challenges above have major implications for the workloads that you want to run on an edge cloud. The application owners need to understand that you are not running on five 9’s hardware and that in order to achieve that level of reliability they need to design their application in a way where it can seamlessly scale and migrate when failures occur. This can sometimes be difficult for applications that may have been migrated from other systems as app owners would always prefer that the underlying system never fail. Edge cloud teams can assist with this effort by providing guidelines, examples, and hooks to monitoring systems so that application owners know when a failure has occurred. The model of announcing outages via email two weeks in advance and then doing them in the middle of the night will not scale to this size.
In summary, edge cloud has many challenges that teams need to careful consider before deployment. The old ways of operating IT systems will not scale to the scope of an edge cloud, and so everything must be automated, and failures assumed. Both the team running the edge cloud and application owners must have software in place to respond to failures and handle workload migration and possible connectivity loss.
Summary of the challenges of edge cloud:
- Fault and Failure Handling – Given enough commodity hardware running at sites with single points of failure, errors will happen. Planning how the edge cloud and the apps running on the edge cloud respond is critical.
- Infrastructure Automation and Security – At the scale of edge cloud everything needs to be lights out managed and automated. It is critical not to leave out security considerations in the design and automation.
- Resource Abstraction – Latency and bandwidth requirements are what’s valuable to application owners, not which specific server hosts a workload. An abstraction layer isolates the hosted applications from caring about exactly where workloads are hosted.
Underlying each challenge is the Massive Distributed Scale that comes with running an Edge Cloud.
At IBB, our Cloud and Software Transformation group has real-world experience with the solutions and scale problems for Edge Cloud. Let us know how we can help.