Photo by Der Berzerker via Creative Commons

Twice as Many Tests in a Quarter of the Time

How Adwerx engineering doubled its test suite and optimized the CI build to run in only 5 minutes

Published in

Adwerx Engineering

13 min readNov 2, 2018

Not that long ago, the Adwerx engineering team had a problem. Our test suite was taking 25 minutes (or more) to run through our continuous integration system. While this is something that could perhaps be tolerated, the fact was that our tests were just plain slow. Running even a single example would take at least 45 seconds. This was increasingly becoming a serious bottleneck that slowed down our entire engineering team, wasting time and money, and encouraged developers to cut corners when it came to writing and running tests.

This is the story of how Adwerx managed to take a 25 minute build and bring it down to as little as 6 minutes, while doubling the size of the test suite and making the tests more reliable and easier to read. We certainly haven’t achieved perfection, but read on to learn how we tackled this problem.

Adwerx uses CircleCI for continuous integration. At the time, we had 12 containers and each build used three. That meant we could only have four builds running at the same time. CircleCI is configured to run builds when code is pushed to Github. So, if you were the fifth developer to push to Github at any point in time, you might have to wait up to 25 minutes before your 25 minute build would start to run. Also, we had a number of tests that would fail intermittently, meaning any one of these 25 minute builds could fail arbitrarily, sending you to the back of the CI line. Unfortunately, this was a situation we ran into frequently. To make matters somewhat worse, running the entire test suite locally would take more than an hour and was generally impractical.

The Problems With Our Rails Test Suite

In November of last year we had 388 spec files containing 51,000 lines of code. Builds of our master branch would fail 21% of the time due to issues like order of execution, and data precision. We use RSpec, and our tests made heavy use of let, before, and subject, which made our specs very convoluted and difficult to understand

Additionally, we had an in-house tool we created called Archetypes. Archetypes are basically factories, but much more rigid, convoluted, and specific to Adwerx. Like most things, they started out as a good idea: we should have a way to easily create instances of our model objects for use in testing. But, as with most simple things in software development, they rapidly became complex as they tried to meet the needs of every test case in our test suite. Initially we just wanted to create basic objects like a Campaign. But later, we needed to create a basic campaign that was slightly different. Then we wanted a Campaign that was significantly different. And so forth. This led to an extremely complex class with more than 100 interdependent methods, all of which read and write to the database. Calling a single method to create a campaign could result in dozens of database interactions.

At that time, Adwerx had about 13 engineers. Each of us worked in different branches, rapidly making changes. Like good developers should, we’d commit our code early and often, pushing changes to Github. This triggered CI builds, meaning builds were almost always queued up during the workday.

There was one thing we knew for sure was a problem: running db:seed for the test suite. At Adwerx we use Rails seed data in a fairly atypical way. Rather than using seeds to ensure that we have the minimum set of data required for our application to run, we used them almost as a form of configuration. For example, we have 342 different subscription plans at Adwerx. These are all generated by a 1300 line seed file. We have dozens of similar seed files. All together, it could take 45 seconds or more to seed the database for testing. Unfortunately, our tests depended on seed data to run successfully. This meant that running even one example would take at least 45 seconds. As you can imagine, this was a big deterrent to TDD.

Introducing Spec Redux

Now that we had acknowledged a problem, we needed to find a solution. One thing we couldn’t do was throw away the old test suite. Despite the cruft, we had lots of valuable tests that we didn’t want to lose. We couldn’t simply rip out seeds and archetypes without breaking the entire test suite, and we didn’t have the time to rework everything. So, what to do?

In the end, we decided to make a clean break with the past. We created a brand new test suite that was separate from our original test suite. We would create new specs in this new suite and endeavor to apply modern testing best practices. We called this new suite spec_redux.

Factories

The first thing we did in spec_redux was stop automatically seeding our database before running tests. Now we only needed to wait for Rails to start up. We also introduced Factory Bot as a replacement for our archetypes system. These two things knocked 45 seconds off the startup times for our tests, making TDD feasible again!

With Factory Bot, we made every effort to use the build and build_stubbed methods where possible. The build method creates a full-fledged ActiveRecord object, but doesn’t save it to the database by default. build_stubbed will create a stub object that cannot actually be persisted. Our team now defaults to using these two methods, only switching to create when the other two won’t meet our needs. This saves us the time and expense of unnecessary database interactions.

Flattening Out Our Tests

Around this time we also started to rethink our standards related to testing. If we were going to create a whole new test suite, we should probably make every effort to do our best work. One of the primary decisions we settled on was to flatten out our test structure. What this means is that we decided to stop using let, before, and after in our specs. We also stopped using subject. We did so because these all effectively obscure information and operations that are vital to the tests working. Furthermore, they can be difficult to untangle mentally. Consider this example extracted from a real spec:

describe AnxCampaign do
  # a few let blocks...

  before :all do
    # do some stuff...
    @campaign =...# create a campaign
    @campaign...# configure some campaign details
    @anx_campaign =... # create an anx campaign
  end

  before :each do
    first_pass = first_pass?(__FILE__, __LINE__)
    # do some stuff...
    @campaign = @campaign.requery unless first_pass
    # do more stuff...
    @anx_campaign = @anx_campaign.requery unless first_pass
    # do some other stuff...
  end

  after :all do
    # do yet more stuff...
    @anx_campaign.try(:destroy)
    # and more stuff...
  end

  describe '#initialize_remote_object_to_inactive?' do
    context 'when campaign is a RealtorCampaign' do
      let(:campaign) { @campaign }
      subject { @anx_campaign }

      before(:each) do
        allow(subject).to receive(:campaign).and_return(campaign)
      end

      # context block with a few before blocks and some examples...

      context 'and campaign is a first-time purchase' do
        before(:each) do
          allow(campaign).to receive(:renewal?).and_return(false)
        end

        it 'returns true' do
          expect(subject.initialize_remote_object_to_inactive?).
            to be(true)
        end
      end
    end
  end
end

There are numerous problems with this example:

Hiding Context: Let’s say this test fails at the expect. What is subject? Note that a developer might look at the class name specified by the describe block and miss the subject block hidden in the first context block. Also, how does subject relate to campaign or anx_campaign? My point here is that a developer has to stop and mentally unpack subject.
Cognitive Complexity: What order does this code run in? Perhaps it’s bad that we even have to think about this?
Sharing State Across Tests: The first before :each is ensuring that we reload campaign and anx_campaign. This is presumably because it’s reused across multiple examples. This creates the potential for order dependent test failures, not to mention allowing different examples to impact each other. At the same time, we now need to have an after :all that deletes the anx_campaign.
Everything All Over the Place: One thing not shown here is that in our real code, there are 170 lines above the expect function, meaning you can’t see all of the context at once! These lines contain a number of unrelated and unused let blocks, instance variables, contexts, and examples.
Mocking The Subject Under Test: This test is supposed to be confirming that the AnxCampaign class behaves correctly, but we’re modifying its behavior while testing it.
Difficult To Edit: Adding tests to such a specific and convoluted setup is surprisingly difficult. It’s very challenging to refactor the subject, before blocks, lets, and so forth for a new test case.

This example shows that it It can take a lot of cognitive effort to unpack even a one line example.

To address these problems, we decided to adopt the four-phase test pattern. With this pattern, each test has distinct sections dedicated to setting up dependencies, exercising the code, asserting that the code behaved correctly, and, if needed, cleaning up after ourselves.

To adopt this pattern we had to make a conscious decision to not make the “dryness” of our specs a priority. Instead, we would create all of the context needed for a specific example in that example, using factories as much as possible, even if very similar code appeared in another example. Now, everything we need for a test is in one place, and it’s much easier to see what influences what.

In spec_redux, the above example might look something like this:

RSpec.describe AnxCampaign do

  describe '#initialize_remote_object_to_inactive?' do
    context 'when campaign is a RealtorCampaign' do
      context 'and campaign is a first-time purchase' do
        it 'returns true' do
          # setup dependencies
          campaign = build(:campaign, :some_trait,...)
          anx_campaign = build(:anx_campaign,
                               campaign: campaign,...)
          allow(campaign).to receive(:renewal?).and_return(false)


          # exercise the code
          result = anx_campaign.
            initialize_remote_object_to_inactive?

          # assert the code behaved correctly
          expect(result).to be(true)
        end
      end
    end
  end
end

This is significantly easier to understand!

Making Things Easier For Developers

Since we had two test suites, our developers needed to know how to run one or the other. To address this, we create a simple binstub for RSpec:

#!/usr/bin/env ruby

ENV['RAILS_ENV'] = 'test'

require 'bundler/setup'

# shim for spec_redux
if ARGV.grep(/spec_redux/).any? || ENV.fetch('SPEC_REDUX', false)
 ENV['SPEC_REDUX'] = '1'
 puts 'RSpec: Running in redux test mode'
else
 puts 'RSpec: Running in legacy test mode'
end
# end shim for spec_redux

load Gem.bin_path('rspec-core', 'rspec')

This script checks to see if we’re trying to run tests in our old test suite, spec, or in spec_redux. In the case that we’re running in spec_redux, we set an environment variable SPEC_REDUX that tells our application to disable database seeding.

Now developers don’t need to think about what test suite they’re working in. To run a test in the old spec test suite they’d use:

bin/rspec spec/path/to/some_old_spec.rb

For the new spec_redux suite they’d use:

bin/rspec spec_redux/path/to/fancy_new_spec.rb

Test Days

Once we defined our new testing standards and created spec_redux, we needed to begin using it. We started by instructing developers to only create new specs in spec_redux. If a developer was creating a new spec in spec_redux that already existed in spec, we encouraged them to try to migrate the old examples from spec to spec_redux. More often than not, this was not a trivial task and we had to leave the old examples where they were. This meant that we’d often have specs for the same class in both spec and spec_redux. This was OK. We just considered the old specs to be legacy.

An astute reader may have noticed that we haven’t really done anything about the overall runtimes yet. For the most part, all we’ve done is add yet another test suite on top of an already slow test suite. To address this, we used the CircleCI API and our build artifacts to collect runtime and reliability statistics about our test suite. We identified the 20 slowest spec files, the 20 slowest individual examples (which might not be in the slowest specs), and the examples that failed the most frequently.

Armed with that information, we set aside a day — Test Day — where our entire team focused only on addressing these problems. Our goal was to improve the overall performance, reliability, and readability of our test suite. In addition to addressing as many of the problematic specs as possible, we pruned useless tests and pointlessly repeated tests. For example, there were some test that iterated over all of our partners and ran the exact same test.

Test Day was a resounding success! We ended up with builds as short as 17:43! One day led to saving up to seven minutes per build. Our failure rate also dropped from 21% to 13%, which was a huge boon to productivity.

We learned a few things of note from Test Day. First off, some tests are much easier to migrate than others. It took some experimentation to find strategies that worked well. Also, some of our most problematic specs use shared examples. We’ve learned to dred it_behaves_like. In most cases we were unable to untangle these specs in only one day. We also learned that some of our team members didn’t feel as strong with testing as they’d like.

Given our success, we decided to hold a second Test Day — Test Day Redux. Again we collected information about the slowest and most frequently failing tests. But, this time we gave our engineers two options, they could either focus on addressing the problematic tests, or they could attend a day’s worth of training on testing. About half the team went to training.

At the end of Test Day Redux we had knocked another three minutes off our builds. We were now down to about 15 minutes per build and had leveled up our team’s testing skills. All in all, we’d knocked 10 minutes off our test suite and greatly improved its reliability, while simultaneously becoming better overall at testing.

Continuous Integration Updates

After holding two Test Days, we felt that we’d hit most of the low hanging fruit. It was going to be difficult to knock more minutes off our build times. At the same time, we needed to migrate from CircleCI 1.0 to 2.0, since 1.0 it was reaching its end of life. This presented the perfect opportunity to reevaluate our CircleCI configuration.

Our CircleCI 1.0 configuration provided 12 containers. Each build would use three containers and distribute tests sequentially across them, not taking into account runtimes.

The first thing we did was double the number of containers at CircleCI from 12 to 24 and dedicated six containers per build. Surprisingly, this didn’t produce significantly different test results. There were two primary reasons for this:

Each spec takes a different amount of time to run. Distributing the specs sequentially across six containers still resulted in some containers that would take much longer to finish than others. We needed a better way to balance tests across containers.
You might remember that our original test suite (AKA spec) requires us to seed the database, which takes about 45 seconds. Since spec and spec_redux would run on the same containers, we would have to setup the database, seed it for spec, run spec tests, drop and recreate the database, and finally run spec_redux tests. That means we still had the 45 second penalty for s.

A new CircleCI 2.0 feature is what they call workflows. Workflows allow you to define multiple jobs that can run at the same time on different containers. That meant that we could configure different jobs to run spec and spec_redux in parallel. Additionally, CircleCI has a command line tool that can be used to identify test files and balance them across containers based on their runtimes. Using this, we dedicated 4 containers to running spec and 2 to running spec_redux. Again, we realized significant improvements on our runtimes.

But still, there were areas we could improve. For example, we knew that two of our old spec files would take several minutes each to run. We decided to isolate these two bad actors to their own job that would also run in parallel to spec and spec_redux. We dubbed these spec_slow, though the files were still in spec. We dedicated two more containers to spec_slow.

We also addressed how we were building assets and running our JavaScript tests. With CircleCI 1.0 we would arbitrarily run them on only the first container. This meant that the first container always had a few minute penalty that the other containers didn’t. There was no reason to do this, so we created yet another job, yarn and ran it on a single container.

After various other optimizations, our container usage now looked something like this:

`spec`

Containers: 8
Runtime: ~7:30

`spec_slow`

Containers: 2
Runtime: ~6:45

spec_redux

Containers: 2
Runtime: ~5:30

yarn

Containers: 1
Runtime: ~3:00

Since these jobs all run in parallel, our builds now took around 7 minutes and 30 seconds to complete, assuming there was no queuing. That’s a 70% improvement from where we started! And, since some of these jobs complete early, those containers are immediately freed up for other builds.

Where We Are Today

The work described above was done in small projects spread over the last year. Here’s a comparison of where we were in November 2017 compared to October 2018:

November 2017

Number of Spec Files: 388
Lines of Testing Code: 51,436
Build Time: > 25 minutes
Failure Rate: 21%

October 2018

Number of Spec Files: 664
Lines of Testing Code: 82.651
Build Time: < 6 minutes
Failure Rate: 12% (including cancellations)

Additionally, our focus on improving our test suite has led to 95% coverage of new code.

At Adwerx, we continue to diligently improve our test suite. We’ve eliminated almost all of our intermittent failures and migrated many more specs to spec_redux. In fact, spec has started to take less time than spec_redux. We’ve been forced to reallocate containers to our jobs to balance things out again. These days our builds can take as few as 5 minutes and 30 seconds and our median queue time is only 2 seconds.