by

**Walker Rowe**

Here we are going to explain one way that you can analyze load performance test results using analytics. We pick the simplest possible example.

Supposed your load performance test fails, because there was 450 ms latency on the mobile app screen response time. Is this statistically significant?

Well, first of all, how do you know if your test “failed” due to latency? You must have set some threshold. Where did you get that number? Was it based on service levels? Was it a hunch? Do you have a notion of what is “slow,” “fast,” or “unacceptable”? When are such results “out of bounds,” meaning they are not to be trusted because the chance of that happening again is remote?

The analytics way to do this is to use statistics to determine whether results are statistically significant. To do it any other way is no better than guessing. (You might remember that was the theme of the Brad Pitt/Robert Redford movie “Money Ball.” The stubborn baseball scout preferred to guess. He was fired when proven wrong.)

**The Signal-to-Noise Problem**

So what does “statistically significant” mean?

Flagging anomalies is what we are interested in here. When a load performance value is out of bounds, we call that an anomaly. That means it is not statistically significant. Another term for this is “outlier.”

From the field of radio communications we have the analogy of the signal-to-noise problem when we look at data using statistics. This answers the question: what data points in a sample should we reject because they are just noise? These are outliers that would skew our measurement if we took them into consideration.

To understand this thinking, you have to look back at the fog of college and see if you can remember what a standard deviation is. You probably have heard this term used in, for example, political surveys. A Gallup poll might say “Senator Mark Rubio leads in polling at 23% with a margin of error of +- 3%” or “Senator Clinton is polling at 24% with a margin of error of +- 2%.” The margin of error is the uncertainty of the statistic due to the size of the sample.

Political polling is based up the idea of a normal distribution. That is also called the Bell Curve, because it looks like a bell. This says that if you look at lots of data points, the average value will tend to hover around a central point called a mean. How scattered the test results are is called the variance. The standard deviation is the square root of the variance. (A square root is used since the variance, the difference between the average and a data point, can be positive or negative depending on whether it is above or below the average. Squaring all of those makes them positive so they do not cancel each other out.)

This idea is represented in the drawing of the bell curve below. In this drawing, this data is normalized, meaning the average is subtracted from all data points so that we can operate on the simple scale of 0, 1, 2, 3, and so forth. In other words, if the average latency in your load performance test is 100, then the center point here is 100-100=0 when you have a test result of 100. We do that because it’s easier to deal with a small number like 0 than a large one like 100.

The numbers -3, -2, -1, 0, 1, 2, 3 are standards deviations. The numbers under the curve are the probability associated with each. For example, the probability that a data point is between 3 and 0 standard deviations is 2.5% + 13.5% + 34% = 49%. If you are wondering why that chance does not sum to 50%, it is because the probability of a data point being more than 3 standard deviations away from the mean is less than 1%. In other words, any measurement more than 3 standard deviations above or below the mean is unlikely.

You might have heard this concept in business when people talk of Sigma 6, as in 99.99% uptime promised by cloud vendors. The probability that something is 6 standard deviations away from the mean is 0.0000034, or hardly none at all. So if your cloud vendor says they operate at Sigma 6, you should doubt that.

**T-Test**

That was a lot of mathematical mumbo jumbo. How do we put this to practical use?

There are lots of different statistical algorithms you can use to determine whether a data point is an outlier. Most are complicated to everyone but a statistician. But here we pick one that most IT people can understand. (We assume since you work with computers you can count.) This is called the T-Test.

The T-Test tests what is the probability that two data sets are different given their mean and variance. If that probability is very small (0.05 is the threshold derived from the idea of the standard deviation) then that point is an outlier, meaning that result should be thrown out.

Here we apply the T-Test to the problem of load testing. We want to know whether one test is in line with the others or is it a statistical fluke to be dismissed. To do this, In Excel we use the t.test() function. In order to do that, you have to add Analysis Tool Pack add-in to Excel. It ships with the product; you just need to turn it on.

**Sample Load Test Results**

Suppose our load performance test includes 6 runs of the same set of programs. We are keeping a tally of our results (latency in milliseconds) over time and saving them in column B.

We just ran another test cycle and saved the results in column C. We want to know if this test cycle should be considered an outlier. In other words, given the average latency in milliseconds over time and the results of the latest test in column C, does this mean test C failed? Or can this be written off because the chance of such results are highly unlikely.

The arguments to the Excel t.test function are t.test(control range, new test range, 2, 1). You can read the Excel help if you want to understand why we chose the values 2 and 1 for the last two arguments. That part is complicated. For the sake of simplicity let’s just say that we chose this combination because: (1) the data came from the same population (i.e., the 6 tests are all of the same type and not some others) and that (2) we are not testing some hypothesis that the latest test is slower than the average tests (the control sample) as in we changed something.

Below are the results.

So we see that for the control set in column B and the test we just ran in column C, the t-test is 0.0493 which is < 0.05. So this test is not statistically significant. We should not include it in our collection of valid tests.

Someone reading this might say, “What’s he talking about? If one test is slower than another then we have to see why.” Maybe so. But in end-to-end performance testing there are many components. Internet latency is one item that you cannot control. If the test is one SQL statement then sure dive in and see why it was so slow. It might have been a blip with the disk controller or fiber channel connection in which case you would be chasing the wrong problem if you are look to, say, index a field.

Our goal here is to to get you to understand the analytics tools that the salesmen have sold you so that you can learn when to trust them or doubt them. That makes such tools more valuable and makes you a better load testing analyst.