How to visualise sampling variability with Python with Python.

In real life and in the statistical world also it is almost impossible or not feasible to know all the details about the whole population. In this case we deal with approximations of a smaller group (or sample) and hope that the answer we get isn’t too far from the truth.

Sampling variability is the difference between the measured value and the true value or parameter.
In other words sampling variability is the extent to which the measures of a sample differ from the measure of the population.
A measure that refers to a sample is called a statistic.
The parameter of a population never changes, but a statistic changes from sample to sample because there is always variation between samples. But in case you have enough samples, you generally get close to the population parameter. There is always variability in a measure and it comes from the fact that not every item in the sample is the same.
The sampling variability is also referred to as standard deviation or variance of a given data. It is used in several types of statistical tests for data analysis.

Theoretical distribution (population) and experiment data (sample):

import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats

## a theoretical normal distribution
x = np.linspace(-5,5,10101)
theoNormDist = stats.norm.pdf(x)
# (normalize to pdf)
# theoNormDist = theoNormDist*np.mean(np.diff(x))

# now for our experiment
numSamples = 40

# initialize
sampledata = np.zeros(numSamples)

# run the experiment!
for expi in range(numSamples):
    sampledata[expi] = np.random.randn()

# show the results
plt.xlabel('Data values')

population and sample data

Show the mean of samples of a known distribution:

# generate population data with known mean
populationN = 1000000
population  = np.random.randn(populationN)
population  = population - np.mean(population) # demean

# now we draw a random sample from that population
samplesize = 30

# the random indices to select from the population
sampleidx = np.random.randint(0,populationN,samplesize)
samplemean = np.mean(population[ sampleidx ])

### how does the sample mean compare to the population mean?

OUT: -0.050885435985437565

Sample means VS sample sizes:

samplesizes = np.arange(30,1000)

samplemeans = np.zeros(len(samplesizes))

for sampi in range(len(samplesizes)):
    # nearly the same code as above
    sampleidx = np.random.randint(0,populationN,samplesizes[sampi])
    samplemeans[sampi] = np.mean(population[ sampleidx ])

# show the results!
plt.xlabel('sample size')
plt.ylabel('mean value')
plt.legend(('Sample means','Population mean'))

Sample means VS sample sizes

See also related topics: