Online Leader Accelerates Software Delivery
OpsMx works with a company that provides a well-known destination site, founded in the early 2000s, with more than 200M unique monthly viewers worldwide. It has received more than 300M user reviews and is ranked as a leader in its segment.
Challenge
Accelerate Innovation by Reducing Software Delivery Time
Faced with stiff competition, the organization needed to increase the speed of delivering enhancements to their end customers while simultaneously reducing production failures caused by defective updates.
The primary bottleneck they faced was a lengthy manual approval process to move updates from staging to production. However, shortening the approval process had been proven to increase problems in production.
Their IT architecture is complex, aggravating the problem. They deploy a broad range of microservices-based applications on Kubernetes, as well as a large number of monolithic applications. Their CI/CD system was built using Jenkins, plugins, and custom scripts.
Moving more quickly was a key goal. They were able to process only 50 to 100 updates per month, and their goal was many hundreds of updates per month. Of course, like all companies, they are also under pressure to reduce costs. The process of verifying updates was estimated to have annual direct costs of more than $1M.
Every significant deployment is evaluated as they move to production. The analysis requires 3 expert engineers, including at least one technical lead and one product engineer, and it takes an hour or more to decide whether to move the deployment forward.
The analysis process for every update was time-consuming because of the mountains of data generated. With hundreds of thousands of concurrent users, there is a tremendous amount of metrics and logs created. Consistently finding the “needle in the haystack” indicates a potential problem is complicated, even for experienced engineers. As the team’s frequency of updates increased, the severity of the problem grew until they were nearly at a breaking point.
The analysis and decision process is a bottleneck
Gathering and filtering the metrics and logs was monotonous and time-consuming. The data analysis was slow and problematic because it depended on subtle differences that were hard to find. And many times the final approval decision was difficult because conflicting data would point to both promoting and rejecting the update.
The leader of the developer productivity team, who was responsible for improving this situation said it best. “Even though most updates should be moved to production, you can’t assume that everything works all the time. We strive for uniformly fast AND reliable releases.”
Too many errors
Any impact to the availability and performance of the customer-facing applications has a direct impact on the company revenue and a large indirect impact on the image and reputation of the brand. Even so, too many updates were being approved incorrectly. Worse still, errors that should have been caught – because they had occurred before – continued to be made.
Requires Expensive Experts
In order to reduce the chance of an incorrect decision, the most senior engineers conducted most reviews and analyses. It was challenging to train inexperienced engineers due to the time pressure, the complexity and subtlety of the analysis, and the limited number of people who were qualified to train.
Solution
Autopilot: A Layer of Intelligence for Deploying with Jenkins
The best way to increase the speed and reliability of a process is to automate it using machine intelligence. The team had the vision to improve productivity: use ML to automate the verification and approval process in the deployment cycle.
To enable this vision, the solution needed to be as accurate and consistent as a human team of experienced experts. Any errors – either rejecting an update that should be approved or accepting an update that was later determined to be faulty – would have large consequences, so any solution needed to perform better than the human experts.
“Autopilot is our layer of intelligence that makes continuous delivery effective.” – Director of Developer Productivity
After a thorough evaluation of potential solutions, including trying to build the solution on their own, they chose to work with OpsMx and implement OpsMx Autopilot. Autopilot is an intelligence layer for software delivery, integrating with any CI/CD platform. It uses AI/ML to automate verification (refer the screenshot below) and approvals, provide continual governance, and create visibility and insights into operations and best practices.
In this case, Autopilot gathers and evaluates logs stored in Elasticsearch and metrics from Datadog and others. Using natural language processing, statistical analysis, and machine learning algorithms, Autopilot analyzes every deployment and assigns a confidence score. The pipelines are configured to automatically promote updates to production when they are very likely to be successful, and reject them and return them for rework if the confidence score is too low.
Results
Faster and More Reliable CI/CD Pipelines
Since the deployment of Autopilot, this company has seen significant improvements in software delivery velocity. Most production approvals now require zero time from an engineer; even decisions that need to be reviewed are completed more quickly because the data is gathered and initial analysis is completed automatically. Additionally, the history of similar errors is automatically retrieved, along with the appropriate corrective action, speeding the resolution of issues.
With Autopilot, the number of updates has increased from 100 per month to more than 1000, and errors in production has decreased as well.
The system has also improved the quality of the approval decision, both approving acceptable updates more quickly and rejecting more errors before they reach production. This improvement in accuracy is especially important in their most mission-critical applications – some applications run the Autopilot verification and approval process more than five times a day.
Overall, they have been able to increase the update velocity thanks in large part to reducing the approval cycle. They have moved from 100 updates per month to more than 1000, enabling them to more quickly respond to their customers.
The leader of the developer productivity teams says “Autopilot has really helped us by automating the analysis of our deployments. It is very reliable in finding potential issues and has proven itself to be better than our experts at evaluating risk. Because it is automated, it is very consistent – we don’t worry that it will have a bad day and miss an issue.”
Because the system continually learns, expert engineers can train Autopilot. This means that over time, Autopilot is able to dramatically reduce the time they spend analyzing updates. This allows them to work on higher value activities.
“Autopilot is more effective than our experts at evaluating updates.” – Director of Developer Productivity
Overall, the new system has improved production reliability and has enabled faster development of new capabilities, adding the equivalent of more than six full-time senior engineers to the team.
The deployment of Autopilot is now moving to its second phase: automatic policy checking. For example, to more easily meet SOX regulatory compliance, the person implementing any given change can not approve moving the change into production. Similarly, a QA manager must approve all significant updates. These policies and many others can be validated before an update is considered for promotion to production, saving even more time.
These policy checks will pay off in terms of faster releases and better compliance, which in turn generates higher-quality releases. The productivity team leader concluded, “We’re glad to be partnering with OpsMx and believe that Autopilot is the layer of intelligence that makes our continuous software delivery system effective.”
Read more Autopilot user stories:
- Telecom Leader Accelerates Time to Market with OpsMx
- Networking Leader Automates Build Analysis with OpsMx Autopilot
- How Customers Improve CI/CD Velocity Using Autopilot
If you want to know more about the Autopilot or request a demonstration, please book a meeting with us for Autopilot Demo.
OpsMx is a leading provider of Continuous Delivery platform that helps enterprises safely deliver software at scale and without any human intervention. We help engineering teams take the risk and manual effort out of releasing innovations at the speed of modern business. For additional information, contact us.