Optimizing Our CI Pipeline
My team is in the process of releasing a new service and as a part of that we’ve set up a continuous delivery pipeline using Jenkins. We use our pipeline to automatically run tests against each commit and once approved, deploy that commit to each of our environments with as little risk as possible. After our initial pipeline implementation, we quickly realized that our pipeline was much slower than we wanted, leading to all of us spending incredibly large amounts of time waiting for things to run. I looked into what was taking so long and how we could speed up our pipeline — ultimately decreasing our pipeline time by around half. While I’m not going to get into all of our nitty gritty details, I did want to share a few of the things I learned.
Consider Value vs Time
There’s always a trade-off. In the case of a pipeline, the trade-off is often some additional level of assurance that you won’t break something for some amount of time. For example, you can be sure you didn’t break any test if you run all of them rather than just a subset, but it also takes longer to run. While it can be easy to end up with the knee-jerk reaction of always picking speed or always picking robustness, the best answer is likely somewhere in the middle.
I took the time to evaluate each and every step in our pipeline to decide the value of each. For example, we were originally deploying to sandboxed instances of each environment before deploying to the actual environment. So we had a step to deploy to an isolated dev environment before hitting our shared dev environment and likewise an isolated staging environment before our shared staging environment. For the isolated dev environment, this is our chance to run our code in a container and is the first step in our pipeline after local tests, so it has a decent chance of catching things. Running in the isolated staging environment, however, was likely only going to catch problems with environment specific configuration. These changes should be very rare, so the chance that this environment will catch anything is unlikely. As a result, I decided to cut the isolated staging environment with the understanding that we will monitor the shared staging environment and fix it immediately if we break it.
Each additional step in the pipeline adds both time and some assurance that you won’t introduce a specific type of bug. It’s important to consider exactly what you’re hoping to prevent with a particular pipeline step. Once you know that, you can determine the likelihood of the scenario, the severity of the situation if it does happen and difficulty to fix the situation. Once you have some idea of these, a judgement can be made about which steps are worth keeping.
Speed Up Your Tests
In my first perusal of our pipeline, my initial thought was that of course we had to keep all of our tests and that there likely wasn’t anything I could do to speed them up without cutting things. This wasn’t actually the case. As it turns out, for a number of our integration and end to end tests, we make calls to an external system to set up data. Because these calls are across a network, they can be slow. When I looked into it, I found that we were setting up this data between every test despite the fact that we weren’t modifying the data in any of the tests. As it turned out, we could just reuse the same data in all of our tests without any impact to the quality of the test runs. By turning a call per test into a call per test class, we managed to cut a significant amount of time.
While you may not have a situation exactly the same, it is worth considering what pieces of your tests are slow and if there’s any way to speed them up without impacting the integrity of the test. Are you creating actual data in a case where you could use a mock? Are you making an external call when that’s not what you’re trying to test? Are you setting up more data than you need to for the test? While it’s easy to not think that a small amount of time in a single test matters, these small amounts can quickly add up.
Look For Overlap
When building each step of the pipeline, it’s easy to think about that step in isolation. For example, when we publish to an environment, we want to run a smoke test as a quick gut check that everything is in place. For us, that was a call to make sure the expected version was deployed, verification that the health endpoint says it’s healthy and one quick happy path end to end test. If I’m running this smoke test manually, these steps make complete sense. However, I noticed that when we run the smoke test in the context of our pipeline, we always run this test right before we run our full suite of end to end tests. That final end to end test in the smoke test no longer makes sense because it’s going to be immediately run again. In our case, we didn’t want to actually get rid of that check in case someone was using the smoke test outside of the context of the pipeline, so we added a parameter to the smoke test to skip the end to end test that we pass in that parameter.
This overlap can often be hidden inside other tasks and may not be immediately obvious, so it’s good to actually dig into each step. While I may not want to cut out an entire step, is there a piece of it that doesn’t make sense in the context of the pipeline? Does that piece make sense when run separately from the pipeline? If so, can I find a way to only skip it for the pipeline?
Branch Runs vs. Master Runs
The way our team works, we run the first part of our pipeline against every branch that’s pushed for code review. Then, after the code is approved and it has been merged to master, we run our entire pipeline — including merging to the shared environments. At first, we included full runs of everything on each branch run. However, while we’re actively working on a branch, we find ourselves actively waiting for the tests to finish. Once we’ve merged to master, we’re often moving on to something different and only want the verification notification if there was a problem or if everything succeeded. Therefore, it actually resulted in much less wasted time if we sped up the branch runs, while speeding up the master run impacted us less.
To that end, we looked at each step we were running against each branch and made a similar value assessment of what we were hoping to get out of a branch run and how bad the consequences would be if we didn’t learn about failures until after a merge to master. We ended up doing a few things including running reduced performance tests against the branches and removing a few static analysis-type checks. In the end, this is similar to the question of whether a step is needed in the pipeline at all. We’re considering the value of the step as a pre-merge step vs the time it takes to run each time.
Parallelize All The Things!
Our original pipeline was beautiful. It had some things running in parallel, but they were all grouped by task type. We had our tests running in one step and our code analysis tasks in another. We had a separate step for the build and yet another for some security checks and so on. I quickly realized that while this might allow us to have a clean progression of x passed so now it makes sense to run y, it wasn’t as efficient as we could be.
As a result, I worked to try to parallelize as much as possible. Some things still obviously don’t make sense. For example, we can’t run our check for code coverage until after the tests run since the test runs are what generate the individual reports on how much code is covered by those tests. However, there was a lot we could parallelize. In one case, I also found two steps that appeared impossible to parallelize, but when I dug into them it turned out that one small subtask was the only thing that couldn’t be run in parallel, so I was able to separate that small piece out and run everything else together. It’s especially valuable to look at what the slowest steps are and make sure that we’re running as much as possible at the same time as those slow steps.
Leave No Stone Unturned
When I was talking through many of these changes, I would get responses from people along the lines of ‘yeah, but that will only save 30 seconds.’ It’s easy to try to only go after the big things and ignore all of the small things. However, as it turned out, I was able to make a lot of those small changes and in the long run, they added up to a lot.
For a few other steps, we quickly realized that they were obviously slow. These were steps owned by other teams and based on my initial conversations with others, it didn’t seem like there was much chance they could be sped up. The steps had to be run and there was no way to get rid of them. However, at some point, I actually talked to the team who owned those tasks and while there was no way to remove them, they did have ideas on how to make them a lot faster.
In both of these cases, if I had stuck with my initial assumptions about what was possible or what was worth following up, I would have missed out on a lot. The vast majority of the time I was able to save in the pipeline was based on small improvements that I was able to find by digging into everything.
Our pipeline could still be improved and as we add more to it, this will likely need to be an ongoing process. While any of the individual things I changed may not apply to me in the future or to any other pipeline, many of the principals remain the same. When approaching optimizing a pipeline, it’s important to carefully consider the value trade-off of each task, make sure you’re avoiding overlap and parallelizing everywhere possible.