Thursday, April 14, 2016

Using the Presidential Election to Understand Google Analytics Sampling

When I coach people on how to use Google Analytics, I see one issue come up over and over - not understanding sampling.  In many cases Google Analytics will use a sample of your data instead of all your data when you run a report.  Is that bad?  The answer is not necessarily.  However, you need to understand sampling to be aware of the potential impact.

The way I explain sampling to folks is that when you see poll results for who is leading in the presidential election, the pollsters are only calling a sample of the 220 million eligible voters in the U.S.  They often only call several thousand people.  As long as those people are representative of the overall voter base in the U.S., you can accurately predict election results with a very small sample.

If you have a large site, you have a lot of data in Google Analytics.  Some reports would take a very long time to run if Google were to use all your data.  Hence the use of sampling.

One of the big issues with sampling is that Google doesn't tell you the statistical significance when it uses sampling.  Think back to the election polls.  If you read the fine print, you will almost always see something that says, "Margin of error for this poll is +/- x%."  Often the margin of error is just a few percent, and it gives you confidence in the results.  However, Google Analytics doesn't tell you the margin of error.  If you have an event that doesn't happen very often, the margin of error can be huge.  One of the first things that a lot of people notice when they have sampling is that their results are very inconsistent.  They run a report one day and then the next day and get very different results.  Similarly, if they pull the same data using two slightly differently approaches in GA and get very different results, they likely have a lot of sampling going on and a large margin of error in the results.

The first thing to check is how much sampling Google is doing.  If you see a message like this in the upper right hand corner of your report, you have sampling going on.



So how much sampling is bad?  A lot of people say that below 5-10% and you have a problem.  However, it really depends.  In the presidential polls, they only call a couple thousand people out of 220 million people.  That is ~0.001% sampling, well below the rule of thumb some people call for in Google Analytics.  However, in the election polls the statisticians can calculate the margin of error.

The margin of error depends a lot on how often the event you are measuring happens in GA.  For example, let's say you are looking at device category under the mobile overview.  Because there are only three choices in that report (desktop, mobile, and tablet), each session has a device, and each category shows up fairly often, you can have a low sampling percentage and you can still be confident in your results.

However, let's say that Google Analytics is using 10,000 data points in your report, and you are trying to estimate the frequency of a rare error message on your site.  Let's say your error only happens 1 in every million times although you don't know that yet.  The problem is that in this case GA is almost always going tell you that this event happens 0% of the time because the odds of the event happening in your 10,000 data points are very low (10,000 / 1,000,000).  When GA does see the event in your sampled data, it is usually going to say the event happens 1 in every 10,000 times because it has 10,000 data points and it sees one occurrence. However, both answers, 0 and 1 in 10,000, are way off from the truth.  Sampling is a huge problem in this case.

In the next post we will talk about your options if you suspect sampling is impacting your GA results.  Until then watch for the yellow sampling message in the upper right hand corner of your GA reports to see how much sampling you are experiencing.