It is not enough that your data are sampled from a population. Statistical tests are also based on the assumption that each subject (or each experimental unit) was sampled independently of the rest. Data are independent when any random factor that causes a value to be too high or too low affects only that one value. If a random factor (one that you didn’t account for in the analysis of the data) can affect more than one value, but not all of the values, then the data are not independent.
The concept of independence can be difficult to grasp. Consider the following three situations.
- You are measuring blood pressure in animals. You have five animals in each group, and measure the blood pressure three times in each animal. You do not have 15 independent measurements. If one animal has higher blood pressure than the rest, all three measurements in that animal are likely to be high. You should average the three measurements in each animal. Now you have five mean values that are independent of each other.
- You have done a biochemical experiment three times, each time in triplicate. You do not have nine independent values, as an error in preparing the reagents for one experiment could affect all three triplicates. If you average the triplicates, you do have three independent mean values.
- You are doing a clinical study and recruit 10 patients from an inner-city hospital and 10 more patients from a suburban clinic. You have not independently sampled 20 subjects from one population. The data from the 10 inner-city patients may be more similar to each other than to the data from the suburban patients. You have sampled from two populations and need to account for that in your analysis.
How you can use statistics to extrapolate from sample to population
Statisticians have devised three basic approaches to make conclusions about populations from samples of data:
The first method is to assume that parameter values for populations follow a special distribution, known as the Gaussian (bell shaped) distribution. Once you assume that a population is distributed in that manner, statistical tests let you make inferences about the mean (and other properties) of the population. Most commonly used statistical tests assume that the population is Gaussian. These tests are sometimes called parametric tests.
The second method is to rank all values from low to high and then compare the distributions of ranks. This is the principle behind most commonly used nonparametric tests, which are used to analyze data from non-Gaussian distributions.
The third method is known as resampling. With this method, you create a population of sorts by repeatedly sampling values from your sample. This is best understood by an example. Assume you have a single sample of five values, and want to know how close that sample mean is likely to be from the true population mean. Write each value on a card and place the cards in a hat. Create many pseudo samples by drawing a card from the hat, writing down that number, and then returning the card to the hat. Generate many samples of N=5 this way. Since you can draw the same value more than once, the samples won’t all be the same (but some might be). When randomly selecting cards gets tedious, use a computer program instead. The distribution of the means of these computer-generated samples gives you information about how accurately you know the mean of the entire population. The idea of resampling can be difficult to grasp.