Improve Release Safety and Diagnostics Through Automated Canary Analysis for Spinnaker
Introduction
Spinnaker is a continuous delivery platform that is pioneering the ability to release software faster. It is allowing thousands of enterprises to achieve release velocity never seen before. The key to increasing the velocity is to have the ability to determine with confidence that the new release can be promoted across different testing stages and eventually to production through Canary, Red/Black (aka blue-green) or Rolling update release strategies.
Leading enterprises (like Netflix which deploys more than 4000 updates a day), has a proprietary decision engine to allow them to promote builds to production with confidence. However, most enterprises still are dependent on manual analysis and judgments to promote builds.
Manual judgment is error-prone as decisions are based on incomplete analysis and are time-consuming as the analysis are laborious. Bad builds in production introduce significant risks due to business disruptions and brand damage.
OpsMx Enterprise for Spinnaker is a real-time analytics platform for CI/CD pipelines that is designed to aid manual decision in promoting build across test and deployment to production. The OpsMx solution helps in reduce error and diagnostics time through complete, consistent real-time automated analysis for Spinnaker.
Practices for Promoting Builds To Production Today
Before we look at challenge and risks, let us review some of the enterprise practices of promoting builds to production:
- Checking key service performance metrics (e.g., orders or users served by the new instances or latency and error rates being consistent with the baseline version)
- Ensuring no system SLA violation alerts occurs with the new release of the service
- Additional checks with custom scripts created for each service.
- Release new release during the slow time of the day (night or holidays times) or to less critical customer base such as overseas countries or low volume or backup sites.
The core philosophy of the above strategies is to reduce the impact of bad deployments – How soon can one find out if the new service update is bad and how soon can one roll-back without causing too much business disruption. However, Ops teams face significant challenges to reliable validate builds as shown in Figure 1.
Risks with Incomplete Analysis With Current Manual Judgment Process
The manual judgment of new releases to be deployed into production introduces tremendous business risk. The manual analysis is inherently incomplete due to the complexity of the services, their interactions, changes to this build compared to the previous build and the sheer volume of metrics collected for any build during various pipeline stages. The incomplete analysis can be viewed in 3 specific dimensions as shown in Figure 2
-
- Metrics: Manual analysis as indicated earlier can look at crucial system metrics, but a typical service exhausts 1000+ metrics for a build. It is humanly impossible to detect deviations and trends to understand the relevance of any metric to your business requirements consistently.
- Application Complexity: Applications which are more complicated using multiple components (in-house or open source or 3rd party software components) are challenging to do a manual analysis. Also, applications behavior is unique, and they are continually evolving. It requires experts or application architects to understand the nuances of each service and the application overall. In the case of open source or 3rd party services, it is challenging to find an in-house expert to understand the expected behavior of application over various versions.
- The Rate of Change: Manual analysis may be sufficient initially, but as the rate of applications/services changes increases, the manual analysis tends not to keep up with the application behavior changes. The analysis becomes less reliable over time, and eventually, bad builds are likely to be promoted to production causing disruptions. Enterprises are increasingly tending towards multiple updates in a single day, and even lesser dynamic organizations need a weekly update to their applications.
Such an incomplete analysis leads to the following issues with the release validations:
- Error-prone
- Time-consuming
- Expensive to create a custom analysis for every new service
- The root cause debug is difficult for found issues
The above issues could cause significant business loss. A recent example of an improperly approved new version of software with bugs resulted in the grounding of flights of American Airlines for 6 hours or the case of Starbucks losing millions in revenue. There is a need for a more reliable data-driven approach to ensure consistency and fewer errors to improve safety and ease of diagnostics for Spinnaker minimizing business risk of new builds.
Improving Safety with Automated Canary Analysis
OpsMx Enterprise for Spinnaker is a CI/CD analytic platform that provides DevOps engineers an intelligent automated real-time actionable risk assessment to make a reliable judgment of a new release for production deployment. OpsMx compares the new release of the service to the baseline or production release for new validation. OpsMx leverages machine learning and Artificial Intelligence (AI) techniques to analyze 1000’s of metrics and perform in-depth analysis of architectural regressions, performance, scalability and security violations of new releases in a scalable way for enterprises. OpsMx seamlessly integrates with Spinnaker through existing Canary analysis service APIs. OpsMx address three prevalent use cases with Spinnaker:
- Automated Canary Analysis
Automated Canary Analysis is the most well-known use case for enterprises who are interested in canary deployment to reduce risk. If the Canary deployment analysis fails (Figure 3), the pipeline execution terminates for further diagnostics of the release.
- Red/Black Deployment Analysis
Red/Black is the most traditional deployment option in Spinnaker. In this case, OpsMx compares the new release with the production or baseline release. If the Red/Black deployment analysis fails (Figure 4), then release is roll backed either manually or in an automated fashion.
- Staging or Testing Deployments Analysis
In many cases, it is safest to avoid exposing the bad release in production even for a few hours. Performing the analysis in the staging environment is preferred. If the release passes then release is promoted to production via Red/Black deployment. If it fails in staging (Figure 5), bad deployment is averted without any production traffic exposure. OpsMx provide detailed diagnostics for each of the analyzed stages to further diagnose the issues in the release as shown in Figure 6
OpsMx Automated Canary Analysis Benefits
Validate and approve builds with low risk to production: With the OpsMx build risk assessment report for a new version of the service, Ops team have an accurate automated report on safety and readiness of the build. If the safety score is above the pass threshold, the Ops team can promote the build for further deployment. OpsMx compares the current build to production baseline characteristics with the score accurately reflecting the risks of the new build. OpsMx can be configured to do real-time canary or Red/Black or staging/testing analysis and provide safety scores.
- Identify root cause of issues with the build: OpsMx risk assessment report provides a very detailed sub-score for components of each build across various metrics group. If there are any significant deviation or issues found between the current build and the baseline version, OpsMx automatically flags the issue and provides root cause analysis including offending code commit. OpsMx does in-depth analysis including interactions between various services and transactions to narrow down the problematic service. OpsMx thus saves Ops team time with the fully automated issue and root cause identification.
- Automated, scalable and less error-prone: OpsMx risk assessment report is fully automated and can analyze 1000s of metrics for every build through integration into existing data monitoring and collection tools. OpsMx can analyze known services or unknown new services. OpsMx machine learns the service characteristics to evaluate new builds of the service. Since it is an automated tool, the OpsMx solution is scalable, consistent and less error-prone providing Ops engineers a very reliable method to assist judgment of new builds.
Summary
The OpsMx provides an effective data-driven solution to automate real-time judgment of new software releases by Ops teams using Spinnaker in an enterprise. The OpsMx solution integrates with Spinnaker for analysis during Canary, Red/Black or Staging/Test deployment stages. With the OpsMx solution, Ops team can reliably validate and approve builds with low risk for deployment, scale to validate multiple deployments a day, reduce the time for analysis and debugging issues and reduce human errors in release decisions. Overall, OpsMx solution lowers business risk due to bad deployments and makes release judgment safer. For more information about the OpsMx solution for Spinnaker or free trial, fill out the below form or email us at info@opsmx.com.
Trackbacks/Pingbacks