20 sep 2018

## Gosset

In probability and statistics, Student's t-distribution (or simply the t-distribution) is any member of a family of continuous probability distributions that arises when estimating the mean of a normally distributed population in situations where the sample size is small and population standard deviation is unknown.

In the English-language literature it takes its name from William Sealy Gosset's 1908 paper in Biometrika under the pseudonym "Student". Gosset worked at the Guinness Brewery in Dublin, Ireland, and was interested in the problems of small samples, for example the chemical properties of barley where sample sizes might be as low as 3.

Source: Wikipedia

## Population distribution

layout(matrix(c(2:6,1,1,7:8,1,1,9:13), 4, 4))

n  = 56    # Sample size
df = n - 1 # Degrees of freedom

mu    = 100
sigma = 15

IQ = seq(mu-45, mu+45, 1)

par(mar=c(4,2,2,0))
plot(IQ, dnorm(IQ, mean = mu, sd = sigma), type='l', col="red", main = "Population Distribution")

n.samples = 12

for(i in 1:n.samples) {

par(mar=c(2,2,2,0))
hist(rnorm(n, mu, sigma), main="Sample Distribution", cex.axis=.5, col="beige", cex.main = .75)

}

## T-statistic

$T_{n-1} = \frac{\bar{x}-\mu}{SE_x} = \frac{\bar{x}-\mu}{s^2_x / \sqrt{n}}$

So the t-statistic represents the deviation of the sample mean $$\bar{x}$$ from the population mean $$\mu$$, considering the sample size, expressed as the degrees of freedom $$df = n - 1$$

## A sample

Let's take one sample from our normal populatiion and calculate the t-value.

x = rnorm(n, mu, sigma); x
##  [1] 102.16687 114.15483  86.69264  99.02376 101.82758  83.25325  89.57004
##  [8]  88.78985  61.47332 112.33939  78.83421 104.51506 105.29668 107.23038
## [15]  85.19226 108.68843  77.78700 106.51962 101.36222 142.87580  99.63164
## [22]  93.92062  88.47106 131.24375 113.91682 115.85990  92.24418 118.89392
## [29]  94.82804  63.32423 111.77896  96.79166  82.33991 116.93666  99.65081
## [36] 107.08591  88.39150 126.88769  83.43686  99.61964  65.37463  93.30822
## [43] 122.72570 110.15269  79.84345  90.44909  88.32921 110.10466 105.55144
## [50] 102.77819 106.96157 116.37308 123.50946  72.66226 127.29991  96.57863
hist(x, main = "Sample distribution", col = "beige")

mean(x)
## [1] 99.90802

## t-value

$T_{n-1} = \frac{\bar{x}-\mu}{SE_x} = \frac{\bar{x}-\mu}{s^2_x / \sqrt{n}}$

t = (mean(x) - mu) / (sd(x) / sqrt(n))
t
## [1] -0.04025698

## More samples

let's take more samples.

n.samples     = 1000
mean.x.values = vector()
se.x.values   = vector()

for(i in 1:n.samples) {
x = rnorm(n, mu, sigma)
mean.x.values[i] = mean(x)
se.x.values[i]   = (sd(x) / sqrt(n))
}

## Mean and SE for all samples

head(cbind(mean.x.values, se.x.values))
##      mean.x.values se.x.values
## [1,]     100.79155    2.046958
## [2,]      98.84425    1.853184
## [3,]      98.00282    2.100768
## [4,]      98.08265    2.403656
## [5,]     100.60792    1.684585
## [6,]      99.30550    2.053344

## Samples distribution

hist(mean.x.values,
col  = "beige",
main = "Samples distribution",
xlab = "all sample means")

## Calculate t-values

$T_{n-1} = \frac{\bar{x}-\mu}{SE_x} = \frac{\bar{x}-\mu}{s^2_x / \sqrt{n}}$

t.values = (mean.x.values - mu) / se.x.values

tail(cbind(mean.x.values, mu, se.x.values, t.values))
##         mean.x.values  mu se.x.values   t.values
##  [995,]      94.37146 100    2.227314 -2.5270530
##  [996,]     100.64958 100    1.781037  0.3647228
##  [997,]     104.23438 100    1.955161  2.1657437
##  [998,]      98.49562 100    2.228782 -0.6749779
##  [999,]     101.39357 100    1.996812  0.6978980
## [1000,]     103.46712 100    1.965430  1.7640512

## Sampled t-values

What is the distribution of all these t-values?

hist(t.values,
freq = F,
main = "Sampled T-values",
xlab = "T-values",
col  = "beige",
ylim = c(0, .4))
T = seq(-4, 4, .01)
lines(T, dt(T,df), col = "red")
legend("topright", lty = 1, col="red", legend = "T-distribution")

## T-distribution

So if the population is normaly distributed (assumption of normality) the t-distribution represents the deviation of sample means from the population mean ($$\mu$$), given a certain sample size ($$df = n - 1$$).

The t-distibution therefore is different for different sample sizes and converges to a standard normal distribution if sample size is large enough.

The t-distribution is defined by:

$\textstyle\frac{\Gamma \left(\frac{\nu+1}{2} \right)} {\sqrt{\nu\pi}\,\Gamma \left(\frac{\nu}{2} \right)} \left(1+\frac{x^2}{\nu} \right)^{-\frac{\nu+1}{2}}\!$

where $$\nu$$ is the number of degrees of freedom and $$\Gamma$$ is the gamma function.

Source: wikipedia