Monitoring your application stack is a critical part of any software deployment. If done properly, monitoring can be a powerful tool for finding and fixing issues. Good monitoring can make it easier to meet Service Level Agreements (SLAs) and lead to quicker issue resolution. There are many good monitoring solutions available for cloud-based applications, with options ranging from hosted SaaS solutions to self-hosted open-source based solutions. Although picking the right technology is important, before selecting a monitoring solution it is critical to step back and figure out what you’re trying to achieve.
The first set of considerations for monitoring is to determine how the monitoring will fit into your organization.
Determining the audience for the application monitoring is critical. A good monitoring solution is not only useful for a DevOps team, but might be used by a Network Operations Center (NOC) or another operations team. The NOC or operations team may have less visibility into the application design and less software engineering expertise on staff, which means that some alerts may lack the context that the NOC needs to respond properly. Another example of a mixed audience is a cloud infrastructure team that exposes their infrastructure monitoring to application/DevOps teams in order to facilitate issue resolution for their applications. In some instances, the same monitoring tool can be used by all audiences, with different views and features exposed, and in other cases it makes sense to choose different tools for each audience.
Another organizational consideration is how the monitoring solution will interact with the on-call process. If there is an alert that causes a page-out, everyone in the page-out chain will need to be able to see, understand, and resolve the alert. Nobody wants to roll out of bed for an issue and then be unable to do anything about it.
Once the right audience has been determined, the next step is deciding what should be monitored.
The first, and not always obvious, thing to monitor is what the customer sees. If the customer follows a specific workflow in the application, writing something that monitors this will alert you to user-facing issues before your customers encounter them. However, this type of monitoring by itself is not sufficient, because it may not provide enough information about an issue to help the team resolve it quickly. For example, although an issue is indicated when a customer facing website will not load, it does not provide enough information about the web application stack to help the team quickly work towards a resolution.
Another key area to monitor is application performance. This should be measured “on the whole” for the application, but also monitoring individual components (such as database transactions) can highlight certain bottlenecks under load. Setting up this type of monitoring should include establishing a performance baseline and a threshold for alerting. The values and metrics selected will need to adapt as the application changes, and may also need to change if the application spans public and private cloud.
Security should be an important consideration in what to monitor as well, especially if the monitoring system is shared with other parts of the organization. Monitoring scripts that run with administrative privileges, scripts that can cause unnecessarily load, or scripts that can interfere with each other are some of the things that might cause issues in the application. The last thing the monitoring system
should be doing is to interfere with the application that it’s trying to monitor. Fortunately, many monitoring solutions offer role-based access which can be used restrict access while allowing useful information to be visible.
Finally, once a monitoring solution has been chosen and implemented, there are several key processes which need to be followed in order to ensure that the monitoring is successful.
The most common and important goal for an organization is to have a “green” status on the monitoring dashboard at all times. Allowing for constant alarms and warnings just leads to the monitoring system being ignored. If the same six (or sixty) warnings have been present for a month, then at a glance it’s difficult to notice that now there’s a seventh, and the seventh might be serious enough that you should care. In order to avoid this problem, all alarms need owners. Ownership can be either on a rotating basis (like an on-call rotation) or on an application by application basis. The owner is responsible for getting the alarm condition back to normal or, if needed, adjust the alarm thresholds if it is too sensitive.
A second process to consider is the breadth of coverage. A monitoring system is only useful if it’s monitoring all instances of the application. If an application is scaled up or moved to a new host, then the monitoring system needs to automatically update to include the instance(s) of the application. Some monitoring systems will do this automatically, while others will require configuration changes to the monitoring platform. If your solution is in the latter camp, ensure that these configueration changes are part of the automation for deploying and scaling your application. Relying on warm bodies to remember anything is a bad idea.
A third deployment process consideration is that monitoring solutions should be deployed to all application environments, including dev/test and staging environments. This allows the monitoring system to be tested along with the application and ensures that there are no problematic interactions. New alarms and modifications to thresholds also need to be tested before they are rolled to production.
In addition to the process steps we’ve covered, there are many other implementation processes to take into account when selecting and deploying a monitoring solution, including self-healing, availability, thresholding, and ensuring actionable alerts. Choosing a solution that takes into account all of these will ensure success. The bottom-line is that you need to plan for your monitoring solution as carefully as you plan for the application itself.
At IBB, our Cloud and Software Transformation group has real-world experience in scoping, implementing, and running monitoring systems for cloud infrastructure and cloud-based applications. We understand the importance of choosing the right solution and setting up processes that ensure success. Let us know how we can help you.