Tuesday, 29 October 2013

The Problem with Small Sample Sizes

If you've ever read through the comment section of an online science article or entered into a discussion on recent research amongst science enthusiasts, then undoubtedly you've heard the complaint of a study being flawed as a result of having a small sample size. Recently, this has been particularly true for fields like neuroscience where costly and time-consuming techniques like fMRI limit your subject pool (for example: "Power failure: why small sample size undermines the reliability of neuroscience").

The knee-jerk response of rejecting studies with small sample sizes has become quite common, in a way that I'll argue is similar to the way the popular notion of "correlation does not equal causation" is used. To be clear, such responses aren't wrong because they are never accurate but rather they are wrong in their flippancy and blanket-use. What this means is that when writers like Steven Novella criticise the use of "correlation does not equal causation", they aren't saying that it's always wrong to point out the problems of trying to determine causation from a simple correlation but rather it is wrong to reject the importance that correlations play in the determination of causation.


It's true that there are problems with small sample sizes, however, what many people fail to recognise is that practically all studies suffer from small sample sizes. In any given study there are a number of relevant samples that can be limited in size; 1) the subject pool, 2) the stimuli, and 3) the behaviors/responses. The first case is the most obvious and it is usually what people have in mind when they criticise small sample sizes, but the second presents the lesser known problem of biases in sample materials as described in this study: "Treating stimuli as a random factor in social psychology: a new and comprehensive solution to a pervasive but largely ignored problem".

As explained by Neuroskeptic here, this describes the problem where the advantages of having a large subject pool can be offset by limitations in the size of the sample materials:

Going with the blonde vs. dark example, suppose you take 1000 volunteers, show them some pictures of blonde guys and dark guys, and get them to rate them on trustworthiness. You find a significant difference between the two groups of stimuli. You conclude that your volunteers are hair-bigots and submit it as a paper. The reviewers think, 1000 volunteers? That’s a big sample size. They publish it.  
Now that study I just described might be perfectly valid. But it might be seriously flawed. The problem is that while your sample size may be large in terms of volunteers, it might be very small in another way. Suppose you have just 10 photos per group. Your ‘sample size’, as regards the sample of stimuli, is only 20. And that sample size is just as important as the other one.  
It might be that there’s no real hair difference in perceived trustworthiness, but there are individual differences – some men just look dodgy and it’s nothing to do with hair – and in your stimuli, you’ve happened to pick some dodgy looking blonde guys. Or whatever.  
Now you can run your statistical analyses taking these possible stimulus variation effects into account. But according to Judd, Westfall and Kenny, authors of this paper, this is rarely done. They show with both real and hypothetical data, that unless you take care of this, you can find “statistically significant” differences from pure random noise. This is not a new argument, but they say it’s been ignored for too long.

The third problem with sample size which is often discussed informally amongst some researchers but rarely communicated to the public, is the problem with the number of behaviors and responses collected. This distinction can be seen in the debate between using "large-N vs small-N" statistics or between using "between-subjects vs within-subject" designs, with the former categories referring to the research most people are familiar with where you gather together large numbers of people, place them into different conditions, and then compare the results from the two different groups. The latter categories refer to the less common research that involves using only a few people in the study who are placed in multiple conditions, and the results are analysed by comparing their own responses across conditions.


At this point, the problem might not to be apparent so I'll explain it in terms of the idea of a "learning curve". This is a concept that everybody is familiar with: when you start learning a new skill, you will experience a gradual increase in ability until it plateaus at some level. Makes sense, right? Well actually, it doesn't. Think about some time when you learnt a new skill, like playing a sport. Your improvement doesn't follow an increasing curve but rather it will often be composed of disjointed periods of steep increases and periods of little-to-not improvement, which makes the "learning curve" look more like a "learning staircase".

In other words, it will look a little like this:

(Yes, that graph was made in Paint. What of it?).

The red line represents what we think of as the "learning curve", whereas the blue line and dotted green line represent the "learning staircases" of two hypothetical subjects. So, if people learn in 'staircases', why does the myth of the curve persist? The answer can be found in the graph above: when we average out the results of two (or more) people, these observable differences in individuals are ironed out - and this is the learning curve.

Some people understand concepts best by looking at dry graphs (even beautifully coloured ones like above), and some learn best by relating them to an everyday situation so I'll illustrate the basic issue by retelling a joke about statistics:

A physicist, a biologist, and a statistician go hunting one day. They're creeping around the forest until they come across a deer in the distance, so they crouch down and all three take aim. The physicist is the first to fire but his shot veers off to the left by about 5 metres. The biologist takes his turn and his shot also misses by 5 metres, but this time it goes off to the right of the deer. At this point the statistician yells, "We hit it!".

As much as I am loath to explain a joke, the relevance to this discussion is in the fact that even though data such as averages can represent useful and important information to us, we also have to recognise that averages applied to individuals can lead to erroneous (and frankly ridiculous) conclusions. The statistician suggesting that they've hit (and thus killed) a deer between two missed shots average out to a direct hit is just as wrong as a statistician looking at the averages of thousands of people learning a new skill and suggesting that we progressively learn in the form of a curve.


Ideally, we would have massive studies consisting of thousands of subjects, thousands of stimuli, and thousands of behaviors/responses taken from a number of conditions. Practically speaking, however, this is not really going to happen so our next best step is to view samples within context of the research they appear in.

This means that our first response to a study utilising a small subject sample shouldn't be something along the lines of, "...but we can't conclude much given the small sample" and instead we should be asking whether the small sample affects the results or whether the appropriate statistical designs and analyses were used.

The conclusion being that large and small studies both have their advantages and disadvantages. The large studies allow us to accurately identify trends and generalise results to populations, whereas the small studies allow us to examine the behavior of individuals to a much greater degree. So whilst large studies consisting of few responses from each subject can be useful, they don't always supplant or eliminate the need for small studies consisting of many responses for a few subjects.


  1. Hi Mike, been following the blog for several months but this is my first time posting.

    You are totally right about the sample size of stimuli being an underappreciated issue in most research practice. Most researchers are not accustomed to thinking about sampling stimuli in as clear and privileged a way as they are likely to think about the sampling of participants. This extends to the common thinking about what are reasonable sample sizes of stimuli.

    There are some very concrete and potentially dire statistical power consequences of this. Judd, Kenny and I are right now writing a follow-up paper that discusses power analysis and optimal design in experiments with multiple random factors (e.g., participants and stimuli). We find that when the number of stimuli is small, studies will tend to be badly underpowered even for detecting true large effects. The main reason we don't see as many failures to detect effects in these situations as we know we should be seeing, based on the power results, is because researchers are routinely applying the RM-ANOVA analysis that we have long known is badly biased in a crossed random effects context. (This bias issue was the subject of our 2012 paper.)

    One interesting and important thing to note about the lack of power when stimulus sample size is small is that this will tend to be true regardless of how many participants you end up recruiting! Because it turns out that for any fixed sample of stimuli, power does *not* approach 1 as the number of participants goes to infinity, but instead asymptotes at some smaller value that depends on the number of stimuli and the degree of stimulus variation. And in many realistic situations this maximum asymptotic power can be quite low. So researchers really need to think hard about their samples of stimuli *before* data collection begins, at which point the stimulus sample becomes effectively fixed and you are stuck with at most whatever asymptotic power the stimulus sample implies.

    1. Hi Jake, thanks for commenting and I'm pleased to hear that you're not new to my blog. I appreciate you taking the time to provide some interesting detail on your study.

      I remember reading Neuroskeptic's comments on the paper and thinking that it's such an obvious possible flaw in many studies. It was baffling that practically nobody had dedicated any serious thought to it, but I guess that's the benefit of hindsight.

      Jake Westfall says: "One interesting and important thing to note about the lack of power when stimulus sample size is small is that this will tend to be true regardless of how many participants you end up recruiting! Because it turns out that for any fixed sample of stimuli, power does *not* approach 1 as the number of participants goes to infinity, but instead asymptotes at some smaller value that depends on the number of stimuli and the degree of stimulus variation. And in many realistic situations this maximum asymptotic power can be quite low."

      This doesn't surprise me at all, it's basically a specific instance of "garbage in, garbage out", isn't it? It doesn't matter how awesome your car stereo is if you're still using the stock speakers in your 88' Honda Civic, there will be a maximum level of sound quality possible that is bottlenecked through the substandard speakers.

      Thanks again for the comment, it gives me hope that I'm not just talking to myself here.

  2. Need to come back and read more closely later, but... In animal behavior there was a big crisis 15-20 years ago about "pseudo replication". This is exactly what you describe when they treat 1000 subjects as 1000 replications, but they have a very limited range of stimuli. I'm not sure how psychology has avoided this crisis so far.

    1. Yeah, I hadn't thought of it that way but I can definitely see how pseudoreplication would fit into what I'm discussing above. As for how pseudoreplication is dealt with, have you read this paper: Pseudoreplication is a Pseudoproblem?

    2. So, I've heard people in ecology and related fields talk about "pseudo-replication" before... to be clear, this is just a slightly provocative term for what everyone else calls non-independence, right? Maybe I am missing something. But yeah, what it basically comes down to is that in data where participants are crossed with stimuli, the traditional analysis accounts only for one source of "pseudo-replication," namely the non-independence due to subjects, but ignores the other major source of pseudo-replication, namely the non-independence due to stimuli (two responses made to the same stimulus are usually more similar than two responses made to different stimuli).

    3. Jake, reading quickly, I think the answer is "Yes". So lets say, example, I want to study song learning in birds, and I do a study with 100 baby birds, in which I play an adult song to the developing binds, and determine that the birds replicate the short thrill notes almost perfectly, but that they do not reliably replicate the long melodic notes as well. The 100 baby birds would seem to make us very confident in our conclusion, right? But what if there was only one adult song I played to all of them? Well, then I don't get to make conclusions about all birds and all songs; I only get to have confidence that this is how a typical bird responds to this single adult call.

  3. Jeff Schank's paper?!? He was my original grad school mentor at UC Davis ;- )