Four load testing mistakes developers love to make

As I’ve argued previously, it makes a lot of sense to have developers load test their own code. However, developers are not typically experienced performance testers. They’re typically optimistic about how their service will perform under load, and are keen to move on to building the next cool thing. (This is in contrast to dedicated performance testers who sometimes feel like they only show their value when they prove that something is wrong.)

This combination of inexperience and optimism can lead to developers taking the wrong load testing shortcuts, and letting their performance tests give them a false sense of security. Here are four load testing mistakes I’ve seen developers (mostly me) try to make:

Running a load test that is too short

A developer new to performance testing may believe that if their service can sustain high loads for a few minutes at a time, then it’ll be able to handle that load indefinitely. This is generally not the case as the ability to handle short spikes is built into modern systems on many levels: your virtual hardware might be optimised for burstable performance, your database likely has buffers and the ability to postpone certain types of work until the spike is over and your web server likely has some in built queueing. If your load tests are too short then these measures will pick up the slack.

So how long should your load tests be? The typical advice is ‘until they appear to be in a steady state’. However, waiting for your app’s response time to stabilise is not enough. You’ll also want to confirm that resource usage (CPU, memory, disk throughput and latency etc.) on your app servers, databases and other dependencies has stabilised.

When you are working in a cloud environment with components designed for burstable performance it can be especially hard to know if you’ve reached a steady state. Let’s take a couple of examples from AWS: If you are using EC2 with T2 instances then you’ll see a sudden and dramatic performance drop (the burst credit cliff of doom) when they run out of CPU credits. If you are using General Purpose SSD storage then you’ll see something similar when you run out of I/O credits. Because the performance drop is so sudden the only way you’ll know it is coming is by checking your burst credit usage during the tests. RDS instances using General Purpose SSD storage are especially problematic as, at the time of writing, you cannot directly view their I/O credit balance.

Beware the burst credit cliff of doom (font credit: xkcd)

Another factor to keep in mind is that your service may need to compete for resources with other processes running on the server. Cron jobs are likely to impact your application’s performance, so make sure that your performance tests are long enough to take this into account.

As a general rule, when you are first getting to know a system it’s worth making your performance tests a little longer than you think necessary. Once you start to understand your system’s performance characteristics better you’ll learn when a shorter test could be appropriate.

Ignoring warning signs

If you’re too focussed on answering one particular question then it can be easy to ignore warning signs. If you notice your application’s behaviour changing as you increase load then it’s important that you understand why. I saw an example recently that illustrates this nicely:

One of the services I worked on was switching from serving Klarna’s newer markets (UK, US) to serving all of Klarna’s markets. This was shortly before Black Friday, when the number of people buying online more than doubles. These factors combined meant that we needed to prepare for a dramatic increase in traffic. The service is a part of Klarna’s purchase flow so we had to be absolutely sure that it would stay up and stay fast.

One of our experienced developers was asked to answer the question “can it sustain X requests per second with an acceptable response time?” He configured our tests to run load that was a little higher than the maximum we expected over Black Friday. When he ran the tests the app’s queueing time increased over the first couple of minutes, but soon stabilised at a response time that was slower than typical, but within acceptable limits. He also saw that there was 100% CPU usage and that JMeter was delivering a few percent under the target throughput.

What the developer saw (font credit: xkcd)

Focussing on the question he’d been asked to answer, the developer concluded that service performed well and would be able to handle the required load with some minor, acceptable, slowdown. He assumed that the machine running JMeter wasn’t quite powerful enough to deliver the intended load, but it came close enough for the test to be valid.

In reality the service was buckling under the load. If JMeter had been able to maintain the intended load then the queuing time would have grown rapidly, and the servers would have stopped responding to health checks from the load balancers, and would eventually have gone offline. Fortunately the developer shared his finding with the rest of the team and the warning signs were spotted. Increasing the instance count (and repeating the test!) was a quick win that allowed the service to perform well over Black Friday.

So why did the service survive the load test? JMeter tests are configured with a fixed number of threads. The default behaviour is for a thread to send a request to the web server and wait for a response. After two minutes of testing, all the threads were busy waiting for the web server to respond (mostly waiting in the queue). At this point JMeter could only send new requests as fast as old requests completed, so the web server was effectively applying back-pressure on JMeter. This was why JMeter could not generate the target load. If the thread count had been much higher, or the threads had been set to timeout, then JMeter would have been able to generate the load. The developer would have seen a much more dramatic increase in queue time and if he was lucky, he’d also have seen the servers fail.

The take home message here is that it is always worth investigating surprising results, even if they aren’t what you originally set out to test.

Reusing test data

Generating test data to use in your requests can be quite a slow process, so developers are often tempted to reuse the same data for consecutive tests. This is problematic for a number of reasons. Applications often have radically different logic when writing data that has been seen before, and database will behave differently for values that already appear in their indexes. There are plenty of places where reads may have been cached: The database could have cached the query, or the page on disk with the relevant data. The application could have caches for the data, or in the case of a web app, parts of the page.

In short, you should either generate new data each time you run the test, or have some way of spinning up a new environment for each test.

Assuming the best

When developers draw conclusions from performance tests they tend to assume that the production environment is permanently, perfectly, healthy. We all know that this isn’t the case, and that in distributed systems failure happens all the time. By deliberately causing things to go wrong during your load test you can find out how your service will perform during these failures, and how quickly it can recover from them.

Picture yourself as Godzilla in a lab, smash things and take careful note of what happens. (Credit: Yatir Keren)

There are plenty of things you can deliberately break during your tests. You could tell databases to fail over, halt all your instances in an Availability Zone or cause dependencies to become slow, buggy or go offline. Picture yourself as Godzilla in a lab, smash things and take careful note of what happens. This is one of the areas where I’ve seen the most performance testing wins.

Have no fear at all

Don’t let these examples scare you; you wrote your application and you’re best placed to understand its performance. The more you test it, the better you’ll understand it. Even an imperfect load test will teach you something.

That said, if you have experienced testers in your organisation, don’t hesitate to ask for their input when you are getting started. They can give you some great pointers and help you make sense of what you are seeing. They’re also usually very happy to see developers testing their own applications so that they can focus on system level performance tests.

Finally, don’t forget that production is the ultimate load test. Examine your application’s production metrics and make sure that they make as much sense to you as the metrics from your load test.

Have fun!

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.