layout(matrix(c(2:6,1,1,7:8,1,1,9:13), 4, 4))
n = 56 # Sample size
df = n - 1 # Degrees of freedom
mu = 100
sigma = 15
IQ = seq(mu-45, mu+45, 1)
par(mar=c(4,2,2,0))
plot(IQ, dnorm(IQ, mean = mu, sd = sigma), type='l', col="red", main = "Population Distribution")
n.samples = 12
for(i in 1:n.samples) {
par(mar=c(2,2,2,0))
hist(rnorm(n, mu, sigma), main="Sample Distribution", cex.axis=.5, col="beige", cex.main = .75)
}
T-distribution and the
One-sample t-test
T-distribution
Gosset
In probability and statistics, Student’s t-distribution (or simply the t-distribution) is any member of a family of continuous probability distributions that arises when estimating the mean of a normally distributed population in situations where the sample size is small and population standard deviation is unknown.
In the English-language literature it takes its name from William Sealy Gosset’s 1908 paper in Biometrika under the pseudonym “Student”. Gosset worked at the Guinness Brewery in Dublin, Ireland, and was interested in the problems of small samples, for example the chemical properties of barley where sample sizes might be as low as 3.
Source: Wikipedia
Population distribution
A sample
Let’s take one sample from our normal populatiion and calculate the t-value.
= rnorm(n, mu, sigma); x x
[1] 109.43112 96.87128 125.02820 100.50236 103.15677 86.33173 79.40158
[8] 109.37891 98.09908 99.08930 93.71331 92.56543 97.07199 122.73654
[15] 120.66219 114.96873 59.54988 109.57176 96.82111 116.20960 123.09352
[22] 99.23364 87.89970 108.11215 111.38267 128.05283 101.35790 99.22410
[29] 111.57814 105.22128 99.32861 95.55047 65.06105 91.23018 88.75616
[36] 112.04127 108.23778 89.89440 74.23339 109.12790 81.36150 108.44184
[43] 116.13111 114.97818 138.63833 93.69768 91.26753 101.30472 94.33761
[50] 84.74939 103.41192 107.03687 109.93603 84.34081 77.82622 85.21784
hist(x, main = "Sample distribution", col = "beige", breaks = 15)
text(80, 10, round(mean(x),2))
More samples
let’s take more samples.
= 1000
n.samples = vector()
mean.x.values = vector()
se.x.values
for(i in 1:n.samples) {
= rnorm(n, mu, sigma)
x = mean(x)
mean.x.values[i] = (sd(x) / sqrt(n))
se.x.values[i] }
Mean and SE for all samples
head(cbind(mean.x.values, se.x.values))
mean.x.values se.x.values
[1,] 100.13359 2.178297
[2,] 100.01375 1.951004
[3,] 96.20884 2.114315
[4,] 101.50074 1.848675
[5,] 98.53697 2.541891
[6,] 99.86697 2.057775
Sampling distribution
Of the mean
hist(mean.x.values,
col = "beige",
main = "Sampling distribution",
xlab = "all sample means")
T-statistic
\[T_{n-1} = \frac{\bar{x}-\mu}{SE_x} = \frac{\bar{x}-\mu}{s_x / \sqrt{n}}\]
So the t-statistic represents the deviation of the sample mean \(\bar{x}\) from the population mean \(\mu\), considering the sample size, expressed as the degrees of freedom \(df = n - 1\)
t-value
\[T_{n-1} = \frac{\bar{x}-\mu}{SE_x} = \frac{\bar{x}-\mu}{s_x / \sqrt{n}}\]
= (mean(x) - mu) / (sd(x) / sqrt(n))
t t
[1] -0.2164327
Calculate t-values
\[T_{n-1} = \frac{\bar{x}-\mu}{SE_x} = \frac{\bar{x}-\mu}{s_x / \sqrt{n}}\]
= (mean.x.values - mu) / se.x.values
t.values
tail(cbind(mean.x.values, mu, se.x.values, t.values))
mean.x.values mu se.x.values t.values
[995,] 101.35934 100 2.135869 0.63643459
[996,] 101.98231 100 1.782861 1.11187169
[997,] 101.91252 100 2.148384 0.89021105
[998,] 96.29151 100 2.309735 -1.60559106
[999,] 99.88058 100 2.041275 -0.05850079
[1000,] 99.56213 100 2.023119 -0.21643269
Sampled t-values
What is the distribution of all these t-values?
hist(t.values,
freq = F,
main = "Sampled T-values",
xlab = "T-values",
col = "beige",
ylim = c(0, .4))
= seq(-4, 4, .01)
T lines(T, dt(T,df), col = "red")
legend("topright", lty = 1, col="red", legend = "T-distribution")
T-distribution
So if the population is normaly distributed (assumption of normality) the t-distribution represents the deviation of sample means from the population mean (\(\mu\)), given a certain sample size (\(df = n - 1\)).
The t-distibution therefore is different for different sample sizes and converges to a standard normal distribution if sample size is large enough.
The t-distribution is defined by:
\[\textstyle\frac{\Gamma \left(\frac{\nu+1}{2} \right)} {\sqrt{\nu\pi}\,\Gamma \left(\frac{\nu}{2} \right)} \left(1+\frac{x^2}{\nu} \right)^{-\frac{\nu+1}{2}}\!\]
where \(\nu\) is the number of degrees of freedom and \(\Gamma\) is the gamma function.
Source: wikipedia
One or two sided
Two sided
- \(H_A: \bar{x} \neq \mu\)
One sided
- \(H_A: \bar{x} > \mu\)
- \(H_A: \bar{x} < \mu\)
Effect-size
The effect-size is the standardised difference between the mean and the expected \(\mu\). In the t-test effect-size is expressed as \(r\).
\[r = \sqrt{\frac{t^2}{t^2 + \text{df}}}\]
= sqrt(t^2/(t^2 + df))
r
r
[1] 0.2603778
Effect-sizes
We can also calculate effect-sizes for all our calculated t-values. Under the assumption of \(H_0\) the effect-size distribution looks like this.
= sqrt(t.values^2/(t.values^2 + df))
r
tail(cbind(mean.x.values, mu, se.x.values, t.values, r))
mean.x.values mu se.x.values t.values r
[995,] 101.35934 100 2.135869 0.63643459 0.085502558
[996,] 101.98231 100 1.782861 1.11187169 0.148267669
[997,] 101.91252 100 2.148384 0.89021105 0.119180489
[998,] 96.29151 100 2.309735 -1.60559106 0.211595753
[999,] 99.88058 100 2.041275 -0.05850079 0.007887999
[1000,] 99.56213 100 2.023119 -0.21643269 0.029171358
Effect-size distribution
Cohen (1988)
- Small: \(0 \leq .1\)
- Medium: \(.1 \leq .3\)
- Large: \(.3 \leq .5\)
Power
- Strive for 80%
- Based on know effect size
- Calculate number of subjects needed
- Use G*Power to calculate
Alpha Power
= seq(-3,6,.01)
T = 45
N = 2
E
# Set plot
plot(0,0,
type = "n",
ylab = "Density",
xlab = "T",
ylim = c(0,.5),
xlim = c(-3,6),
main = "T-Distributions under H0 and HA")
= qt(.05,N-1,lower.tail=FALSE)
critical_t
# Alpha
= seq(critical_t,6,.01)
range_x polygon(c(range_x,rev(range_x)),
c(range_x*0,rev(dt(range_x,N-1,ncp=0))),
col = "grey",
density = 10,
angle = 90,
lwd = 2)
# Power
= seq(critical_t,6,.01)
range_x polygon(c(range_x,rev(range_x)),
c(range_x*0,rev(dt(range_x,N-1,ncp=E))),
col = "grey",
density = 10,
angle = 45,
lwd = 2)
lines(T,dt(T,N-1,ncp=0),col="red", lwd=2) # H0 line
lines(T,dt(T,N-1,ncp=E),col="blue",lwd=2) # HA line
# Critical value
lines(rep(critical_t,2),c(0,dt(critical_t,N-1,ncp=E)),lwd=2,col="black")
text(critical_t,dt(critical_t,N-1,ncp=E),"critical T-value",pos=2, srt = 90)
# H0 and HA
text(0,dt(0,N-1,ncp=0),expression(H[0]),pos=3,col="red", cex=2)
text(E,dt(E,N-1,ncp=E),expression(H[A]),pos=3,col="blue",cex=2)
# Mu H0 line
lines(c(0,0),c(0,dt(0,N-1)), col="red", lwd=2,lty=2)
text(0,dt(0,N-1,ncp=0)/2,expression(mu),pos=4,cex=1.2)
# Mu HA line
lines(c(E,E),c(0,dt(E,N-1,ncp=E)),col="blue",lwd=2,lty=2)
text(E,dt(0,N-1,ncp=0)/2,expression(paste(mu)),pos=4,cex=1.2)
# t-value
lines( c(critical_t+.01,6),c(0,0),col="green",lwd=4)
# Legend
legend("topright", c(expression(alpha),'POWER'),density=c(10,10),angle=c(90,45))
One-sample t-test
IQ next to you
http://goo.gl/T6Lo2s
Models
\[\text{outcome} = \text{model} + \text{error}\]
Compare sample mean
We use the one-sample t-test to compare the sample mean \(\bar{x}\) to the population mean \(\mu\).
Let’s take a different sample and calculate the mean of this sample.
= 120
mu = length(IQ.next.to.you)
n = IQ.next.to.you
x = mean(x, na.rm = TRUE)
mean_x = sd(x, na.rm = TRUE)
sd_x cbind(n, mean_x, sd_x)
n mean_x sd_x
[1,] 77 119.9091 12.71673
Does this mean, differ significantly from the population mean \(\mu = 120\)?
Hypothesis
Null hypothesis
- \(H_0: \bar{x} = \mu\)
Alternative hypothesis
- \(H_A: \bar{x} \neq \mu\)
- \(H_A: \bar{x} > \mu\)
- \(H_A: \bar{x} < \mu\)
Assumptions
- Normal samples distribution
- Random sample
- Measurement level
- Interval
- Ratio
T-statistic
\[T_{n-1} = \frac{\bar{x}-\mu}{SE_x} = \frac{\bar{x}-\mu}{s_x / \sqrt{n}} = \frac{119.91 - 120 }{12.72 / \sqrt{77}}\]
So the t-statistic represents the deviation of the sample mean \(\bar{x}\) from the population mean \(\mu\), considering the sample size.
= (mean_x - mu) / (sd_x / sqrt(n)); t t
[1] -0.06273026
Type I error
To determine if this t-value significantly differs from the population mean we have to specify a type I error that we are willing to make.
- Type I error / \(\alpha\) = .05
P-value one sided
Finally we have to calculate our p-value for which we need the degrees of freedom \(df = n - 1\) to determine the shape of the t-distribution.
= n - 1; df df
[1] 76
if(!"visualize" %in% installed.packages()) { install.packages("visualize") }
library("visualize")
visualize.t(t, df, section = "upper")
P-value two sided
visualize.t(c(-t, t), df, section = "tails")
Effect-size
\[r = \sqrt{\frac{t^2}{t^2 + \text{df}}}\]
= sqrt(t^2/(t^2 + df))
r
r
[1] 0.007195468