How Customers Improve CI/CD Velocity Using Autopilot
This is the third part of a three-part series. Part 1 discusses the need for an intelligence layer in the software delivery process. Part 2 provides a brief overview of OpsMx Autopilot, the layer of intelligence for your CI/CD process.
In this blog, we review two customer use cases to understand how Autopilot saves time on risk assessment and approvals, assisting developers and Ops team in diagnosing issues in the software delivery.
Case1: Reduce the Risk of Errors in Production
Challenges:
A destination web property, with more than 200M average monthly users, needed to roll out new features to remain competitive. The primary bottleneck in software delivery was a slow manual approval process required to promote changes from dev to test and from testing to production.
-
The large and complex environment made risk assessment difficult: Their website consists of many monoliths as well as microservices-based on Kubernetes, and they deploy roughly 1000 changes every month. Each change is reviewed manually to reduce errors in production. This process has become too slow as the number and complexity of changes have grown.
Further, it was difficult for less-experienced engineers to understand the service dependencies, test results and performance metrics, and logs. This meant that only the most senior engineers could effectively promote any change, causing delays in their most important projects. -
Cost of assessing risk was more than $1M per year: The risk assessment process requires the equivalent of six full-time expert engineers to gather and analyze vast amounts of data from various tools. There was a high cost involved in utilizing experts, and the company estimated that this direct cost was more than $1M per year.
Because the process was difficult to master and perform consistently, too many errors made their way to the production applications, which had a direct impact on revenue and customer satisfaction.
Solution:
The customer implemented OpsMx Autopilot and integrated it into their CI/CD toolchain in less than a day. Autopilot gathers logs from Elasticsearch, metrics from Datadog and Prometheus, test results from JMeter, and information from other software delivery tools. Using machine learning algorithms, Autopilot provides a confidence score at each decision stage of the delivery pipeline.
When the confidence score is high enough, pipelines automatically promote the deployment into the next stage, including automatic promotion into production. When the score is low, the deployment is automatically rejected and returned to the development team for correction. In cases where the decision is unclear, the deployment is flagged for the DevOps team to resolve. Autopilot highlights potential issues to the DevOps engineers, enables collaboration between teams, and finally learns from the DevOps decision on promotion. This improves Autopilot’s ability to make decisions in the future.
Results:
Accuracy in the promotion decision was a key success criterion for the company. Autopilot is able to catch deployment errors more consistently than the expert engineers because of the volume of data that needed to be analyzed each time. This has resulted in significant improvement in the performance and stability of the company’s website.
Additionally, Autopilot is able to reduce the number of times a deployment was mistakenly rejected. This means that more deployments are able to move to production without any human review.
Because Autopilot helped the company automate many deployment decisions, they were able to free the expensive resources consumed for cumbersome risk assessment activities. This saved more than $1M per year on direct costs of the verification process.
“We automatically verify more than 1000 releases each /month.”
Further, through automation, Autopilot has freed experts to spend more time on their core innovative activities. This improves velocity, as well as job satisfaction as key engineers, are able to concentrate on innovation rather than deployment validation.
Case2: Reduce Build Troubleshooting Time by 75%
Challenges:
A leading network and cybersecurity solution provider release software changes and modifications frequently to its existing customer base. They use Jenkins to execute more than one thousand builds per day.
Like all organizations, errors in the build process are inevitable, and these errors took too much time to identify and correct. This led to poor software delivery velocity that the company needed to improve.
- Manual verification was time-consuming: It roughly took an hour or more for multiple developers to manually verify each build error, i.e., to gather build logs from Jenkins, diagnose, find the root cause, apply the fix, and re-execute the build.
- High reliance on experts: The company relied heavily on senior developers with tribal knowledge to be able to traverse through a torrent of build logs and troubleshoot the suspected errors.
Both of these impacted productivities of the key players on the development team and reduced software velocity.
Solution:
The company made OpsMx Autopilot a part of their software delivery pipeline to automatically determine if there are any build errors and identify the appropriate resolution. Autopilot gathers build logs from Jenkins and uses AI/ML algorithms to identify genuine errors from the sea of data.
Some build errors are caused by transient infrastructure issues. Identifying these errors and automatically re-executing the build process typically resolves these issues, speeding the promotion process and reducing time spent reviewing builds.
Autopilot de-duplicates and aggregates critical errors finds out the root cause of errors ( like infrastructure or code issues), and notifies stakeholders through their collaboration platforms like PagerDuty and Microsoft Teams. This all happens automatically on every build and takes just a few minutes.
Results:
With OpsMx Autopilot, the company’s IT team has reduced the time required to identify and correct build errors by 75%. The process of build issue identification and sending feedback to developers has been completely automated.
Autopilot has saved the company the equivalent of more than five full-time developers to focus on the development of new features.
Conclusion
Autopilot is in use today at companies around the world that are improving their release cycles, reducing errors in production, and improving their governance processes.
Read more Autopilot user stories:
- Networking Leader Automates Build Analysis with OpsMx Autopilot
- Online Leader Accelerates Software Delivery
- Telecom Leader Accelerates Time to Market with OpsMx
If you want to know more about the Autopilot or request a demonstration, please book a meeting with us for Autopilot Demo.
OpsMx is a leading provider of Continuous Delivery solutions that help enterprises safely deliver software at scale and without any human intervention. We help engineering teams take the risk and manual effort out of releasing innovations at the speed of modern business. For additional information, contact OpsMx Support.