Easy way to perform two-sample t-test with Python.


The two-sample t-test (or independent samples t-test) is one of the most commonly used hypothesis tests which applied to compare whether the average difference between two groups is really significant.
Two-sample means that we have 2 sets of samples.

The formula itself used in Python stats library may differ depending if two data groups are paired or unpaired, with equal or unequal varianse and equal or unequal sample sizes. So chosing correct formula depending on tested data nature is important. But surely 1 common part of each formula is nemerator which is a data groups means differense.

Paired means that both samples consist of the same test subjects, e g testing group of students before and after taking drugs. Unpaired means that both samples consist of distinct test subjects, e g testing group of students taking drugs and reference group of students. It is a common assuption that if we have the ratio of the larger variance to the smaller variance less than 4, we can assume the variances are approximately equal.



Generate the data for two-sample t-test:



import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats

# parameters
n1 = 30   # samples in dataset 1
n2 = 40   # ...and 2
mu1 = 1   # population mean in dataset 1
mu2 = 1.2 # population mean in dataset 2


# generate the data
data1 = mu1 + np.random.randn(n1)
data2 = mu2 + np.random.randn(n2)

# show their histograms
plt.hist(data1,bins='fd',color=[1,0,0,.5],label='Data 1')
plt.hist(data2,bins='fd',color=[0,0,1,.5],label='Data 2')
plt.xlabel('Data value')
plt.ylabel('Count')
plt.legend()
plt.show()

Generate the data for two-sample t-test

T-test using the Python scipy library:



t,p = stats.ttest_ind(data1,data2,equal_var=True)

df = n1+n2-2
print('t(%g) = %g, p=%g'%(df,t,p))

OUT: t(68) = 0.0974228, p=0.922677


T-values depending means difference and variance:



# ranges for t-value parameters
meandiffs = np.linspace(-3,3,80)
pooledvar = np.linspace(.5,4,100)

# group sample size
n1 = 40
n2 = 30

# initialize output matrix
allTvals = np.zeros((len(meandiffs),len(pooledvar)))

# loop over the parameters...
for meani in range(len(meandiffs)):
    for vari in range(len(pooledvar)):
        
        # t-value denominator
        df = n1 + n2 - 2
        s  = np.sqrt(( (n1-1)*pooledvar[vari] + (n2-1)*pooledvar[vari]) / df)
        t_den = s * np.sqrt(1/n1 + 1/n2)
        
        # t-value in the matrix
        allTvals[meani,vari] = meandiffs[meani] / t_den

        
plt.imshow(allTvals,vmin=-4,vmax=4,extent=[pooledvar[0],pooledvar[-1],meandiffs[0],meandiffs[-1]],aspect='auto')
plt.xlabel('Variance')
plt.ylabel('Mean differences')
plt.colorbar()
plt.title('t-values as a function of difference and variance')
plt.show()

T-values depending means difference and variance


See also related topics: